General
Q: What is Amazon Managed Service for Apache Flink?
With Amazon Managed Service for Apache Flink, you can transform and analyze streaming data in real time with Apache Flink. Apache Flink is an open source framework and engine for processing data streams. Amazon Managed Service for Apache Flink reduces the complexity of building, managing, and integrating Apache Flink applications with other Amazon Web Services services.
Amazon Managed Service for Apache Flink takes care of everything required to continuously run streaming applications and scales automatically to match the volume and throughput of your incoming data. With Amazon Managed Service for Apache Flink, there are no servers to manage, there is no minimum fee or setup cost, and you only pay for the resources your streaming applications consume.
Q: What is real-time stream processing and why do I need it?
Companies are ingesting data faster than ever because of an explosive growth of real-time data sources. Whether you are handling log data from mobile and web applications, purchase data from ecommerce platforms, or sensor data from IoT devices, ingesting data in real time helps you learn what your customers, organization, and business are doing right now.
Q: What can I do with Amazon Managed Service for Apache Flink?
You can use Amazon Managed Service for Apache Flink for many use cases to continuously process data, getting insights in seconds or minutes rather than waiting days or even weeks. With Amazon Managed Service for Apache Flink, you can quickly build end-to-end stream processing applications for log analytics, clickstream analytics, IoT, ad tech, gaming, and more. The four most common use cases are streaming extract, transform, and load (ETL) applications, continuous metric generation, responsive real-time analytics, and interactive querying of data streams.
Streaming ETL
With streaming ETL applications, you can clean, enrich, organize, and transform raw data prior to loading your data lake or data warehouse in real time, reducing or eliminating batch ETL steps. These applications can buffer small records into larger files prior to delivery and perform sophisticated joins across streams and tables. For example, you can build an application that continuously reads IoT sensor data stored in Amazon Managed Streaming for Apache Kafka (Amazon MSK), organize the data by sensor type, remove duplicate data, normalize data per a specified schema, and then deliver the data to Amazon Simple Storage Service (Amazon S3).
Continuous metric generation
With continuous metric generation applications, you can monitor and understand how your data is trending over time. Your applications can aggregate streaming data into critical information and seamlessly integrate it with reporting databases and monitoring services to serve your applications and users in real time. With Amazon Managed Service for Apache Flink, you can use Apache Flink code (in Java, Scala, Python, or SQL) to continuously generate time series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon DynamoDB. You can also track the traffic to your website by calculating the number of unique website visitors every 5 minutes and then sending the processed results to Amazon Redshift.
Responsive real-time analytics
Responsive real-time analytics applications send real-time alarms or notifications when certain metrics reach predefined thresholds or, in more advanced cases, when your application detects anomalies using machine learning (ML) algorithms. With these applications, you can respond immediately to changes in your business in real time such as predicting user abandonment in mobile apps and identifying degraded systems. For example, an application can compute the availability or success rate of a customer-facing API over time and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Amazon Kinesis Data Streams and Amazon Simple Notification Service (Amazon SNS).
Interactive analysis of data streams
Interactive analysis helps you to stream data exploration in real time. With ad hoc queries or programs, you can inspect streams from Amazon MSK or Amazon Kinesis Data Streams and visualize what data looks like within those streams. For example, you can view how a real-time metric that computes the average over a time window behaves and send the aggregated data to a destination of your choice. Interactive analysis also helps with iterative development of stream processing applications. The queries you build continuously update as new data arrives. With Amazon Managed Service for Apache Flink Studio, you can deploy these queries to run continuously with auto scaling and durable state backups enabled.
Getting started
Q: How do I get started with Apache Flink applications for Amazon Managed Service for Apache Flink?
Sign in to the Amazon Managed Service for Apache Flink console and create a new stream processing application. You can also use the Amazon CLI and Amazon SDKs. Once you create an application, go to your preferred integrated development environment, connect to Amazon Web Services, and install the open source Apache Flink libraries and Amazon Web Services SDKs in your language of choice. Apache Flink is an open source framework and engine for processing data streams And Amazon SDKs. The extensible libraries include more than 25 prebuilt stream processing operators, such as window and aggregate, and Amazon Web Services integrations such as Amazon MSK, Amazon Kinesis Data Streams, Amazon DynamoDB, and Amazon Kinesis Data Firehose. Once built, upload your code to Amazon Managed Service for Apache Flink. The service then takes care of everything required to continuously run your real-time applications, including scaling automatically to match the volume and throughput of your incoming data.
Q: How do I get started with Apache Beam applications for Amazon Managed Service for Apache Flink?
Using Apache Beam to create your Amazon Managed Service for Apache Flink application is very similar to getting started with Apache Flink. You can follow the instructions in the question above. Ensure you install any components necessary for applications to run on Apache Beam, following the instructions in the Developer Guide. Note that Amazon Managed Service for Apache Flink supports Java SDK only when running on Apache Beam.
Q: How do I get started with Amazon Managed Service for Apache Flink Studio?
You can get started from the Amazon Managed Service for Apache Flink console and create a new Studio notebook. Once you start the notebook, you can open it in Apache Zeppelin to immediately write code in SQL, Python, or Scala. You can interactively develop applications using the notebook interface for Amazon Kinesis Data Streams, Amazon MSK, and Amazon S3 using built-in integrations and other Apache Flink-supported sources and destinations with custom connectors. You can use all the operators that Apache Flink supports in Flink SQL and the Table API to perform ad hoc data stream querying and develop your stream processing application. Once you are ready, you can build and promote your code to a continuously running stream processing application with auto scaling and durable state In a few steps.
Q: What are the limits of Amazon Managed Service for Apache Flink?
Amazon Managed Service for Apache Flink elastically scales your application to accommodate for the data throughput of your source stream and your query complexity for most scenarios. For detailed information on service limits for Apache Flink applications, visit limits section in the Amazon Managed Service for Apache Flink Developer Guide.
Q: Does Amazon Managed Service for Apache Flink support schema registration?
Yes, by using Apache Flink DataStream Connectors, Amazon Managed Service for Apache Flink applications can use Amazon Glue Schema Registry, a serverless feature of Amazon Glue. You can integrate Apache Kafka, Amazon MSK, and Amazon Kinesis Data Streams, as a sink or a source, with your Amazon Managed Service for Apache Flink workloads. Visit the Amazon Glue Schema Registry Developer Guide to get started and learn more.
Key concepts
Q: What is an Amazon Managed Service for Apache Flink application?
An application is the Amazon Managed Service for Apache Flink entity that you work with. Amazon Managed Service for Apache Flink applications continuously read and process streaming data in real time. You write application code in an Apache Flink–supported language to process the incoming streaming data and produce output. Then, Amazon Managed Service for Apache Flink writes the output to a configured destination.
Each application consists of three primary components:
Input
Input is the streaming source for your application. In the input configuration, you map the streaming sources to data streams. Data flows from your data sources into your data streams. You process data from these data streams using your application code, sending processed data to subsequent data streams or destinations. You add inputs inside application code for Apache Flink applications and Studio notebooks and through the API for Amazon Managed Service for Apache Flink applications.
Application code
Application code is a series of Apache Flink operators that process input and produce output. In its simplest form, application code can be a single Apache Flink operator that reads from a data stream associated with a streaming source and writes to another data stream associated with an output. For a Studio notebook, this could be a simple Flink SQL select query, with the results shown in context within the notebook. You can write Apache Flink code in its supported languages for Amazon Managed Service for Apache Flink applications or Studio notebooks.
Output
You can then optionally configure an application output to persist data to an external destination. You add these outputs inside application code for Amazon Managed Service for Apache Flink applications and Studio notebooks.
Q: What application code is supported?
Amazon Managed Service for Apache Flink supports applications built using Java, Scala, and Python with the open source Apache Flink libraries and your own custom code. Amazon Managed Service for Apache Flink also supports applications built using Java with the open source Apache Beam libraries and your own customer code. Amazon Managed Service for Apache Flink Studio supports code built using Apache Flink–compatible SQL, Python, and Scala.
Managing applications
Q: How can I monitor the operations and performance of my Amazon Managed Service for Apache Flink applications?
Amazon Web Services provides various tools that you can use to monitor your Amazon Managed Service for Apache Flink applications including access to the Flink Dashboard for Apache Flink applications. You can configure some of these tools to do the monitoring for you. For more information about how to monitor your application, explore the Amazon Managed Service for Apache Flink developer guide.
Q: How do I manage and control access to my Amazon Managed Service for Apache Flink applications?
Amazon Managed Service for Apache Flink needs permissions to read records from the streaming data sources you specify in your application. Amazon Managed Service for Apache Flink also needs permissions to write your application output to specified destinations in your application output configuration. You can grant these permissions by creating an Amazon Identity and Access Management (IAM) roles that Amazon Managed Service for Apache Flink can assume. The permissions you grant to this role determine what Amazon Managed Service for Apache Flink can do when the service assumes the role. For more information, see the Amazon Managed Service for Apache Flink developer guide.
Q: How does Amazon Managed Service for Apache Flink scale my application?
Amazon Managed Service for Apache Flink elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Amazon Managed Service for Apache Flink provisions capacity in the form of Amazon KPUs. One KPU provides you with 1 vCPU and 4 GB memory.
For Apache Flink applications and Studio notebooks, Amazon Managed Service for Apache Flink assigns 50 GB of running application storage per KPU that your application uses for checkpoints and is available for you to use through temporary disk. A checkpoint is an up-to-date backup of a running application used to recover immediately from an application disruption. You can also control the parallel execution for your Amazon Managed Service for Apache Flink application tasks (such as reading from a source or creating an operator) using the Parallelism and ParallelismPerKPU parameters in the API. Parallelism defines the number of concurrent instances of a task. All operators, sources, and sinks run with a defined parallelism by default one. Parallelism per KPU defines the amount of the number of parallel tasks that can be scheduled per KPU of your application by default one. For more information, see Scaling in the Amazon Managed Service for Apache Flink Developer Guide.
Q: What are the best practices for building and managing my Amazon Managed Service for Apache Flink applications?
For information about best practices for Apache Flink, see the Best Practices section of the Amazon Managed Service for Apache Flink Developer Guide. The section covers best practices for fault tolerance, performance, logging, coding, and more.
For information about best practices for Amazon Managed Service for Apache Flink Studio, see the Best Practices section of the Amazon Managed Service for Apache Flink Studio Developer Guide. In addition to best practices, this section covers samples for SQL, Python, and Scala applications, requirements for deploying your code as a continuously running stream processing application, performance, logging, and more.
Q: Can I access resources behind an Amazon VPC with an Amazon Managed Service for Apache Flink application?
Yes. You can access resources behind an Amazon VPC. Learn how to configure your application for VPC access in the Using an Amazon VPC section of the Amazon Managed Service for Apache Flink Developer Guide.
Q: Can a single Amazon Managed Service for Apache Flink application have access to multiple VPCs?
No. If multiple subnets are specified, they must all be in the same VPC. You can connect to other VPCs by peering your VPCs.
Q: Can an Amazon Managed Service for Apache Flink application that’s connected to a VPC access the internet and Amazon Web Services service endpoints?
Amazon Managed Service for Apache Flink applications and Amazon Managed Service for Apache Flink Studio notebooks that are configured to access resources in a particular VPC do not have access to the internet as a default configuration. You can learn how to configure access to the internet for your application in the Internet and Service Access section of the Amazon Managed Service for Apache Flink Developer Guide.
Pricing and billing
Q: How much does Amazon Managed Service for Apache Flink cost?
With Amazon Managed Service for Apache Flink, you pay only for what you use. There are no resources to provision or upfront costs associated with Amazon Managed Service for Apache Flink.
You are charged an hourly rate based on the number of Amazon KPUs used to run your streaming application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. Amazon Managed Service for Apache Flink automatically scales the number of KPUs required by your stream processing application as the demands of memory and compute vary in response to processing complexity and the throughput of streaming data processed.
For Apache Flink and Apache Beam applications, you are charged a single additional KPU per application for application orchestration. Apache Flink and Apache Beam applications are also charged for running application storage and durable application backups. Running application storage is used for stateful processing capabilities in Amazon Managed Service for Apache Flink and charged per GB-month. Durable application backups are optional, charged per GB-month, and provide a point-in-time recovery point for applications.
For Amazon Managed Service for Apache Flink Studio, in development or interactive mode, you are charged an additional KPU for application orchestration and 1 KPU for interactive development. You are also charged for running application storage. You are not charged for durable application backups.
For more pricing information, see the Amazon Managed Service for Apache Flink pricing page.
Q: Am I charged for an Amazon Managed Service for Apache Flink application that is running but not processing any data from the source?
For Apache Flink and Apache Beam applications, you are charged a minimum of 2 KPUs and 50 GB running application storage if your Amazon Managed Service for Apache Flink application is running.
For Amazon Managed Service for Apache Flink Studio notebooks, you are charged a minimum of 3 KPUs and 50 GB running application storage if your application is running.
Q: Other than Amazon Managed Service for Apache Flink costs, are there any other costs that I might incur?
Amazon Managed Service for Apache Flink is a fully managed stream processing solution, independent from the streaming source that it reads data from and the destinations it writes processed data to. You will be independently billed for the services you read from and write to in your application.
Q: Is Amazon Managed Service for Apache Flink available in the Amazon Web Services for China Region Free Tier?
No. Amazon Managed Service for Apache Flink is not currently available in the Amazon Web Services for China Regions Free Tier.
Building Apache Flink applications
Q: What is Apache Flink?
Apache Flink is an open source framework and engine for stream and batch data processing. It makes streaming applications easy to build because it provides powerful operators and solves core streaming problems such as duplicate processing. Apache Flink provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Q: How do I develop applications?
You can start by downloading the open source libraries including the Aamazon SDK, Apache Flink, and connectors for Amazon Web Services services. Get instructions on how to download the libraries and create your first application in the Amazon Managed Service for Apache Flink Developer Guide.
Q: How do I use the Apache Flink operators?
Operators take an application data stream as input and send processed data to an application data stream as output. Operators can be connected to build applications with multiple steps and don’t require advanced knowledge of distributed systems to implement and operate.
Q: What operators are supported?
Amazon Managed Service for Apache Flink supports all operators from Apache Flink that can be used to solve a wide variety of use cases including map, KeyBy, aggregations, windows, joins, and more. For example, the map operator allows you to perform arbitrary processing, taking one element from an incoming data stream and producing another element. KeyBy logically organizes data using a specified key so that you can process similar data points together. Aggregations performs processing across multiple keys such as sum, min, and max. Window Join joins two data streams together on a given key and window.
You can build custom operators if these do not meet your needs. You can find a full list of Apache Flink operators in the Apache Flink documentation.
Q: What integrations are supported in an Amazon Managed Service for Apache Flink application?
You can set up prebuilt integrations provided by Apache Flink with minimal code or build your own integration to connect to virtually any data source. The open source libraries based on Apache Flink support streaming sources and destinations, or sinks, to process data delivery. This also includes data enrichment support through asynchronous I/O connectors. Some of these connectors include the following:
- Streaming data sources: Amazon MSK and Amazon Kinesis Data Streams
- Destinations or sinks: Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon DynamoDB, Amazon OpenSearch Service, and Amazon S3 (through Streaming File Sink)
Apache Flink also includes other connectors such as Apache Kafka, Apache Cassandra, Elasticsearch, and more.
Q: Can Amazon Managed Service for Apache Flink applications replicate data across streams and topics?
Yes. You can use Amazon Managed Service for Apache Flink applications to replicate data between Amazon Kinesis Data Streams, Amazon MSK, and other systems.
Q: Are custom integrations supported?
You can add a source or destination to your application by building upon a set of primitive types so that you can read and write from files, directories, sockets, or anything that you can access over the internet. Apache Flink provides these primitive types for data sources and data sinks. The primitive types come with configurations such as the ability to read and write data continuously or once, asynchronously or synchronously, and much more. For example, you can set up an application to continuously read from Amazon S3 by extending the existing file-based source integration.
Q: What delivery and processing model do Amazon Managed Service for Apache Flink applications provide?
Apache Flink applications in Amazon Managed Service for Apache Flink use an exactly-once delivery model if an application is built using idempotent operators, including sources and sinks. This means the processed data impacts downstream results once and only once.
By default, Amazon Managed Service for Apache Flink applications use the Apache Flink exactly-once semantics. Your application supports exactly-once processing semantics if you design your applications using sources, operators, and sinks that use Apache Flink’s exactly-once semantics.
Q: Do I have access to local storage from my application storage?
Yes. Amazon Managed Service for Apache Flink applications provide your application 50 GB of running application storage per KPU. Amazon Managed Service for Apache Flink scales storage with your application. Running application storage is used for saving application state using checkpoints. It is also accessible to your application code to use as temporary disk for caching data or any other purpose. Amazon Managed Service for Apache Flink can remove data from running application storage not saved through checkpoints (such as operators, sources, sinks) at any time. All data stored in running application storage is encrypted at rest.
Q: How does Amazon Managed Service for Apache Flink automatically back up my application?
Amazon Managed Service for Apache Flink automatically backs up your running application’s state using checkpoints and snapshots. Checkpoints save the current application state and enable Amazon Managed Service for Apache Flink applications to recover the application position to provide the same semantics as a failure-free processing. Checkpoints use running application storage. Checkpoints for Apache Flink applications are provided through Apache Flink’s checkpointing functionality. Snapshots save a point-in-time recovery point for applications and use durable application backups. Snapshots are analogous to Flink savepoints.
Q: What are application snapshots?
With snapshots, you can create and restore your application to a previous point in time. You can maintain previous application state and roll back your application at any time because of this. You control how many snapshots you have at any given time, from zero to thousands of snapshots. Snapshots use durable application backups and Amazon Managed Service for Apache Flink charges you based on their size. Amazon Managed Service for Apache Flink encrypts data saved in snapshots by default. You can delete individual snapshots through the API or all snapshots by deleting your application.
Q: What versions of Apache Flink are supported?
To learn more about supported Apache Flink versions, visit the Amazon Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Beam, Java, Scala, Python, and Amazon SDKs that Amazon Managed Service for Apache Flink supports.
Q: Can Amazon Managed Service for Apache Flink applications run Apache Beam?
Yes, Amazon Managed Service for Apache Flink supports streaming applications built using Apache Beam. You can build Apache Beam streaming applications in Java and run them in different engines and services including using Apache Flink on Amazon Managed Service for Apache Flink. You can find information regarding supported Apache Flink and Apache Beam versions in the Amazon Managed Service for Apache Flink Developer Guide.
Building Amazon Managed Service for Apache Flink Studio applications in a managed notebook
Q: How do I develop a Studio application?
You can start from the Amazon Managed Service for Apache Flink Studio, Amazon Kinesis Data Streams, or Amazon MSK consoles in a few steps to launch a serverless notebook to immediately query data streams and perform interactive data analytics.
Interactive data analytics: You can write code in the notebook in SQL, Python, or Scala to interact with your streaming data, with query response times in seconds. You can use built-in visualizations to explore the data, view real-time insights on your streaming data from within your notebook, and develop stream processing applications powered by Apache Flink.
Once your code is ready to run as a production application, you can transition with a single step to a stream processing application that processes gigabytes of data per second, without servers.
Stream processing application: Once you are ready to promote your code to production, you can build your code by clicking “Deploy as stream processing application” in the notebook interface or issue a single command in the CLI. Studio takes care of all the infrastructure management necessary for you to run your stream processing application at scale, with auto scaling and durable state enabled, just as in an Amazon Managed Service for Apache Flink application.
Q: What does my application code look like?
You can write code in the notebook in your preferred language of SQL, Python, or Scala using Apache Flink’s Table API. The Table API is a high-level abstraction and relational API that supports a superset of SQL’s capabilities. It offers familiar operations, such as select, filter, join, group by, aggregate, and so on, along with stream-specific concepts, such as windowing. You use % to specify the language to be used in a section of the notebook and can switch between languages. Interpreters are Apache Zeppelin plugins, so you can specify a language or data processing engine for each section of the notebook. You can also build user-defined functions and reference them to improve code functionality.
Q: What SQL operations are supported?
You can perform SQL operations such as the following:
- Scan and filter (SELECT, WHERE)
- Aggregations (GROUP BY, GROUP BY WINDOW, HAVING)
- Set (UNION, UNIONALL, INTERSECT, IN, EXISTS)
- Order (ORDER BY, LIMIT)
- Joins (INNER, OUTER, Timed Window – BETWEEN, AND, Joining with Temporal Tables – tables that track changes over time)
- Top-N
- Deduplication
- Pattern recognition
Some of these queries, such as GROUP BY, OUTER JOIN, and Top-N, are results updating for streaming data, which means that the results are continuously updating as the streaming data is processed. Other DDL statements, such as CREATE, ALTER, and DROP, are also supported.
Q: How are Python and Scala supported?
Apache Flink’s Table API supports Python and Scala through language integration using Python strings and Scala expressions. The operations supported are very similar to the SQL operations supported, including select, order, group, join, filter, and windowing. A full list of operations and samples are included in our developer guide.
Q: What versions of Apache Flink and Apache Zeppelin are supported?
To learn more about supported Apache Flink versions, visit the Amazon Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Zeppelin, Apache Beam, Java, Scala, Python, and Amazon SDKs that Amazon Managed Service for Apache Flink supports.
Q: What integrations are supported by default in an Amazon Managed Service for Apache Flink Studio application?
- Data sources: Amazon MSK, Amazon Kinesis Data Streams, Amazon S3
- Destinations or sinks: Amazon MSK, Amazon Kinesis Data Streams, and Amazon S3
Q: Are custom integrations supported?
You can configure additional integrations with a few more steps and lines of Apache Flink code (Python, Scala, or Java) to define connections with all Apache Flink supported integrations. This includes destinations such as Amazon OpenSearch Service, Amazon ElastiCache for Redis, Amazon Aurora, Amazon Redshift, Amazon DynamoDB, Amazon Keyspaces, and more. You can attach executables for these custom connectors when you create or configure your Amazon Managed Service for Apache Flink Studio application.
Service Level Agreement
Q: What does the Amazon Managed Service for Apache Flink SLA guarantee?
Our service level agreement (SLA) guarantees a Monthly Uptime Percentage of at least 99.9% for Amazon Managed Service for Apache Flink.
Q: How do I know if I qualify for an SLA Service Credit?
You are eligible for an SLA Service Credit for Amazon Managed Service for Apache Flink under the Amazon Managed Service for Apache Flink SLA if more than one Availability Zone in which you are running a task, within the same Amazon Web Services Region, has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle. For full details on all the SLA terms and conditions as well as details on how to submit a claim, visit the Amazon Kinesis SLA page.