Amazon Glue Documentation

Notice

Getting Started »

Overview

Amazon Glue is a serverless data integration service that helps you prepare data for analytics, machine learning, and application development. Amazon Glue provides capabilities needed for data integration, so you can gain insights and put your data to use in minutes.

Data Discovery

Discover and search across all your Amazon Web Services data sets

The Amazon Glue Data Catalog is designed to be a persistent metadata store for all your Amazon Web Services data assets, regardless of where they are located. The Data Catalog contains table definitions, job definitions, schemas, and other control information to help you manage your Amazon Glue environment. It is designed to automatically compute statistics and register partitions to make queries against your data efficient and effective. It also maintains a schema version history so you can understand how your data has changed over time.

Automatic schema discovery

Amazon Glue crawlers are designed to connect to your source or target data store, progress through a prioritized list of classifiers to determine the schema for your data, and then create metadata in your Amazon Glue Data Catalog. The metadata stored in tables in your data catalog can be used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event to keep your metadata up-to-date.

Manage and enforce schemas for data streams

Amazon Glue Schema Registry, a serverless feature of Amazon Glue, helps you validate and control the evolution of streaming data using registered Apache Avro schemas. Through Apache-licensed serializers and deserializers, the Schema Registry is designed to integrate with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and Amazon Lambda. When you integrate data streaming applications with the Schema Registry, it can help improve data quality and assist in safeguarding against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update Amazon Glue tables and partitions using schemas stored within the registry.

Automatically scale based on workload

Amazon Glue Autoscaling, a serverless feature in Amazon Glue, dynamically scales resources up and down based on workload. With Autoscaling, your job is assigned workers only when needed. As the job progresses, and it goes through advanced transforms, Amazon Glue adds and removes resources depending on how much it can split up the workload. You no longer need to worry about over-provisioning resources, spending time optimizing the number of workers, or paying for idle resources.

Data Transformation

Visually transform data with a drag-and-drop interface

Amazon Glue Studio helps you to author scalable ETL jobs for distributed processing without becoming an Apache Spark expert. Define your ETL process in the drag-and-drop job editor and Amazon Glue generates the code to extract, transform, and load your data. The code is generated in Scala or Python and written for Apache Spark.

Build complex ETL pipelines with simple job scheduling

Amazon Glue jobs can be invoked on a schedule, on-demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. Amazon Glue is designed to handle all inter-job dependencies, filter bad data, and retry jobs if they fail. Logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

Clean and transform streaming data in-flight

Serverless streaming ETL jobs in Amazon Glue are designed to continuously consume data from streaming sources including Amazon Kinesis and Amazon MSK, clean and transform it in-flight, and make it available for analysis in your target data store. You can use this feature to process event data like IoT event streams, clickstreams, and network logs. Amazon Glue streaming ETL jobs can help you enrich and aggregate data, join batch and streaming sources, and run a variety of complex analytics and machine learning operations.

Integrate

Simplify Data Integration job development

Amazon Glue Interactive Sessions, a serverless feature of job development, simplifies the development of data integration jobs. Amazon Glue Interactive Sessions enables data engineers to interactively explore and prepare data. Engineers can explore, experiment on, and process data interactively using the IDE or notebook of their choice.

Built-in Job Notebooks

Amazon Glue Studio Job Notebooks provides serverless notebooks with minimal setup in Amazon Glue Studio, so developers can get started quickly. Glue Studio Job Notebooks provides a built-in interface for Amazon Glue Interactive Sessions that lets users save and schedule their notebook code as Amazon Glue jobs.

Data Preparation

Deduplicate and cleanse data with built-in machine learning

Amazon Glue can help clean and prepare your data for analysis without becoming a machine learning expert. Its FindMatches feature is designed to deduplicate and find records that are imperfect matches of each other. For example, you can use FindMatches to help you find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main”. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system is designed to learn your criteria for calling a pair of records a “match” and build an ETL job that you can use to help you find duplicate records within a database or matching records across two databases.

Edit, debug, and test ETL code with developer endpoints

If you choose to interactively develop your ETL code, Amazon Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite IDE or notebook. You can write custom readers, writers, or transformations and import them into your Amazon Glue ETL jobs as custom libraries. You can also use and share code with other developers in our GitHub repository.

Normalize data without code using a visual interface

Amazon Glue DataBrew provides an interactive, point-and-click visual interface to help users like data analysts and data scientists to clean and normalize data without writing code. You can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. You can choose from over 250 built-in transformations to help you combine, pivot, and transpose the data, and automate data preparation tasks by applying saved transformations directly to the new incoming data.

Define, detect, and remediate sensitive data

Amazon Glue Sensitive Data Detection lets you define, identify, and process sensitive data in your data pipeline and data lake. Once identified, you can remediate sensitive data by redacting, replacing, or reporting on personally identifiable information (PII) data and other types of data deemed sensitive. Amazon Glue Sensitive Data Detection simplifies the identification and masking of sensitive data, including PII such as names, SSNs, addresses, emails, and driver’s licenses.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.amazonaws.cn/en_us/. This additional information does not form part of the Documentation for purposes of the Sinnet Customer Agreement for Amazon Web Services (Beijing Region), Western Cloud Data Customer Agreement for Amazon Web Services (Ningxia Region) or other agreement between you and Sinnet or NWCD governing your use of services of Amazon Web Services China Regions.