Revisit Amazon Web Services re:Invent 2024’s biggest moments and watch keynotes and innovation talks on demand
General
Q: What is Amazon Glue?
Amazon Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Amazon Glue provides all of the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AMAZON Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AMAZON Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can create and run ETL workflows. Data analysts and data scientists can use AMAZON Glue DataBrew to visually enrich, clean, and normalize data without writing code.
Q. What are the main components of Amazon Glue?
Amazon Glue consists of a Data Catalog which is a central metadata repository; an ETL engine that can automatically generate Scala or Python code; a flexible scheduler that handles dependency resolution, job monitoring, and retries; and Amazon Glue DataBrew for cleaning and normalizing data with a visual interface. Together, these automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data.
Q: When should I use Amazon Glue?
You should use Amazon Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3, data warehouse in Amazon Redshift, and various databases running on Amazon. It provides a unified view of your data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Glue automatically generates Scala or Python code for your ETL jobs that you can further customize using tools you are already familiar with. You can use Amazon Glue DataBrew to visually clean up and normalize data without writing code. Amazon Glue is serverless, so there are no compute resources to configure and manage.
Q: How does Amazon Glue relate to Amazon Lake Formation?
A: Lake Formation leverages a shared infrastructure with AMAZON Glue, including console controls, ETL code creation and job monitoring, a common data catalog, and a serverless architecture. While Amazon Glue is still focused on these types of functions, Lake Formation encompasses Amazon Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake. See the Amazon Lake Formation pages for more details.
Amazon Glue Data Catalog
Q: What is the Amazon Glue Data Catalog?
The Amazon Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
The Amazon Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. For more information on setting up your EMR cluster to use Amazon Glue Data Catalog as an Apache Hive Metastore, click here.
The Amazon Glue Data Catalog also provides out-of-box integration with Amazon EMR, and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL and also readily available for querying in Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.
Q: How do I get my metadata into the Amazon Glue Data Catalog?
Amazon Glue provides a number of ways to populate metadata into the Amazon Glue Data Catalog. Glue crawlers scan various data stores you own to automatically infer schemas and partition structure and populate the Glue Data Catalog with corresponding table definitions and statistics. You can also schedule crawlers to run periodically so that your metadata is always up-to-date and in-sync with the underlying data. Alternately, you can add and update table details manually by using the Amazon Glue Console or by calling the API. You can also run Hive DDL statements via a Hive client on an Amazon EMR cluster. Finally, if you already have a persistent Apache Hive Metastore, you can perform a bulk import of that metadata into the Amazon Glue Data Catalog by using our import script.
Q: What are Amazon Glue crawlers?
An Amazon Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions. You can customize Glue crawlers to classify your own file types.
Q: How do I import data from my existing Apache Hive Metastore to the Amazon Glue Data Catalog?
You simply run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the Amazon Glue Data Catalog.
Q: Do I need to maintain my Apache Hive Metastore if I am storing my metadata in the Amazon Glue Data Catalog?
No. Amazon Glue Data Catalog is Apache Hive Metastore compatible. You can point to the Glue Data Catalog endpoint and use it as an Apache Hive Metastore replacement. For more information on how to configure your cluster to use Amazon Glue Data Catalog as an Apache Hive Metastore, please read our documentation here.
Amazon Glue Schema Registry
Q: What is the Amazon Glue Schema Registry?
Amazon Glue Schema Registry, a serverless feature of Amazon Glue, enables you to validate and control the evolution of streaming data using schemas registered in Apache Avro and JSON Schema data formats, at no additional charge. Through Apache-licensed serializers and deserializers, the Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and Amazon Lambda. When data streaming applications are integrated with the Schema Registry, you can improve data quality and safeguard against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update Amazon Glue tables and partitions using Apache Avro schemas stored within the registry.
Q: Why should I use Amazon Glue Schema Registry?
With the Amazon Glue Schema Registry, you can:
- Validate schemas. When data streaming applications are integrated with Amazon Glue Schema Registry, schemas used for data production are validated against schemas within a central registry, allowing you to centrally control data quality.
- Safeguard schema evolution. You can set rules on how schemas can and cannot evolve using one of eight compatibility modes.
- Improve data quality. Serializers validate schemas used by data producers against those stored in the registry, improving data quality when it originates and reducing downstream issues from unexpected schema drift.
- Save costs. Serializers convert data into a binary format and can compress it before it is delivered, reducing data transfer and storage costs.
- Improve processing efficiency. In many cases, a data stream contains records of different schemas. The Schema Registry enables applications that read from data streams to selectively process each record based on the schema without having to parse its contents, which increases processing efficiency.
Q: What data format, client language, and integrations are supported by Amazon Glue Schema Registry?
The Schema Registry supports Apache Avro and JSON Schema data formats and Java client applications. We plan to continue expanding support for other data formats and non-Java clients. The Schema Registry integrates with applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and Amazon Lambda.
Q: What kinds of evolution rules does Amazon Glue Schema Registry support?
The following compatibility modes are available for you to manage your schema evolution: Backward, Backward All, Forward, Forward All, Full, Full All, None, and Disabled. Visit the Schema Registry user documentation to learn more about compatibility rules.
Q: How does Amazon Glue Schema Registry maintain high availability for my applications?
The Schema Registry storage and control plane is designed for high availability and is backed by the Amazon Glue SLA, and the serializers and deserializers leverage best-practice caching techniques to maximize schema availability within clients.
Q: Is Amazon Glue Schema Registry open-source?
Amazon Glue Schema Registry storage is an Amazon Web Services service, while the serializers and deserializers are Apache-licensed open-source components.
Q: Does Amazon Glue Schema Registry provide encryption at rest and in-transit?
Yes, your clients communicate with the Schema Registry via API calls which encrypt data in-transit using TLS encryption over HTTPS. Schemas stored in the Schema Registry are always encrypted at rest using a service-managed KMS key.
Q: How can I privately connect to Amazon Glue Schema Registry?
You can use Amazon PrivateLink to connect your data producer’s VPC to Amazon Glue by defining an interface VPC endpoint for Amazon Glue. When you use a VPC interface endpoint, communication between your VPC and Amazon Glue is conducted entirely within the Amazon Web Services network. For more information, please visit the user documentation.
Q: How can I monitor my Amazon Glue Schema Registry usage?
Amazon CloudWatch metrics are available as part of CloudWatch’s free tier. You can access these metrics in the CloudWatch Console. Visit the Amazon Glue Schema Registry user documentation for more information.
Q: Does Amazon Glue Schema Registry provide tools to manage user authorization?
Yes, the Schema Registry supports both resource-level permissions and identity-based IAM policies.
Q: How do I migrate from an existing schema registry to the Amazon Glue Schema Registry?
Steps to migrate from a third-party schema registry to Amazon Glue Schema Registry are available in the user documentation.
Extract, transform, and load (ETL)
Q: What programming language can I use to write my ETL code for Amazon Glue?
You can use either Scala or Python.
Q: How can I customize the ETL code generated by Amazon Glue?
Amazon Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using Amazon Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the Amazon Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.
Q: Can I import custom libraries as part of my ETL script?
Yes. You can import custom Python libraries and Jar files into your Amazon Glue ETL job. For more details, please check our documentation here.
Q: Can I bring my own code?
Yes. You can write your own code using Amazon Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here.
Q: How can I develop my ETL code using my own IDE?
You can create and connect to development endpoints that offer ways to connect your notebooks and IDEs.
Q: How can I build end-to-end ETL workflow using multiple jobs in Amazon Glue?
In addition to the ETL library and code generation, Amazon Glue provides a robust set of orchestration features that allow you to manage dependencies between multiple jobs to build end-to-end ETL workflows. Amazon Glue ETL jobs can either be triggered on a schedule or on a job completion event. Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event. You can also trigger one or more Glue jobs from an external source such as an Amazon Lambda function.
Q: How does Amazon Glue monitor dependencies?
Amazon Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs. You can either have a scheduled trigger that invokes jobs periodically, an on-demand trigger, or a job completion trigger.
Q: How does Amazon Glue handle errors?
Amazon Glue monitors job event metrics and errors, and pushes all notifications to Amazon CloudWatch. With Amazon CloudWatch, you can configure a host of actions that can be triggered based on specific notifications from Amazon Glue. For example, if you get an error or a success notification from Glue, you can trigger an Amazon Lambda function. Glue also provides default retry behavior that will retry all failures three times before sending out an error notification.
Q: Can I run my existing ETL jobs with Amazon Glue?
Yes. You can run your existing Scala or Python code on Amazon Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.
Q: How can I use Amazon Glue to ETL streaming data?
Amazon Glue ETL is batch oriented, and you can schedule your ETL jobs at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data. If your use case requires you to ETL data while you stream it in, you can perform the first leg of your ETL using Amazon Kinesis Data Firehose, and then store data to either Amazon S3 or Amazon Redshift and trigger a Glue ETL job to pick up that dataset and continue applying additional transformations to that data.
Q: Do I have to use both Amazon Glue Data Catalog and Glue ETL to use the service?
No. While we do believe that using both the Amazon Glue Data Catalog and ETL provides an end-to-end ETL experience, you can use either one of them independently without using the other.
Amazon Glue DataBrew
Q: What is Amazon Glue DataBrew?
Amazon Glue DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to prepare data with an interactive, point-and-click visual interface without writing code. With Glue DataBrew, you can easily visualize, clean, and normalize terabytes, and even petabytes of data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. Amazon Glue DataBrew is generally available in the regions where Amazon Glue is available .
Q: Who can use Amazon Glue DataBrew?
Amazon Glue DataBrew is built for users who need to clean and normalize data for analytics and machine learning. Data analysts and data scientists are the primary users. For data analysts, examples of job functions are business intelligence analysts, operations analysts, market intelligence analysts, legal analysts, financial analysts, economists, quants, or accountants. For data scientists, examples of job functions are materials scientists, bioanalytical scientists, and scientific researchers.
Q: What types of transformations are supported in Amazon Glue DataBrew?
You can choose from over 250 built-in transformations to combine, pivot, and transpose the data without writing code. Amazon Glue DataBrew also automatically recommends transformations such as filtering anomalies, correcting invalid, incorrectly classified, or duplicate data, normalizing data to standard date and time values, or generating aggregates for analyses. For complex transformations, such as converting words to a common base or root word, Glue DataBrew provides transformations that use advanced machine learning techniques such as Natural Language Processing (NLP). You can group multiple transformations together, save them as recipes, and apply the recipes directly to the new incoming data.
Q: What file formats does Amazon Glue DataBrew support?
For input data, Amazon Glue DataBrew supports commonly used file formats, such as comma-separated values (.csv), JSON and nested JSON, Apache Parquet and nested Apache Parquet, and Excel sheets. For output data, Amazon Glue DataBrew supports comma-separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC and XML.
Q: Can I try Amazon Glue DataBrew for free?
Yes. Sign up for an Amazon Free Tier account, then visit the Amazon Glue DataBrew Management Console, and get started instantly for free. If you are a first-time user of Glue DataBrew, the first 40 interactive sessions are free. Visit the Amazon Glue Pricing page to learn more.
Q: Do I need to use Amazon Glue Data Catalog or Amazon Lake Formation to use Amazon Glue DataBrew?
No. You can use Amazon Glue DataBrew without using either Amazon Glue Data Catalog or Amazon Lake Formation. If you use Glue Data Catalog to store schema and metadata, Glue DataBrew automatically infers schema from the Glue Data Catalog. If your data is centralized and secured in Amazon Lake Formation, DataBrew users can use all data sets available to them from its centralized data catalog.
Q: Can I retain a record of all changes made to my data?
Yes. You can visually track all the changes made to your data in the Amazon Glue DataBrew Management Console. The visual view makes it easy to trace the changes and relationships made to the datasets, projects and recipes, and all other associated jobs. In addition, Glue DataBrew keeps all account activities as logs in the Amazon CloudTrail.
Amazon Web Services Product Integrations
Q: When should I use Amazon Glue vs. Amazon EMR?
Amazon Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. Amazon Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Q: When should I use Amazon Glue vs Amazon Database Migration Service?
Amazon Database Migration Service (DMS) helps you migrate databases to Amazon Web Services easily and securely. For use cases which require a database migration from on-premises to Amazon Web Services or database replication between on-premises sources and sources on Amazon Web Services, we recommend you use Amazon DMS. Once your data is in Amazon Web Services, you can use Amazon Glue to move and transform data from your data source into another database or data warehouse, such as Amazon Redshift.
Pricing and billing
Q: How am I charged for Amazon Glue?
You will pay a simple monthly fee, for storing and accessing the metadata in the Amazon Glue Data Catalog. Additionally, you will pay an hourly rate, billed per second, for the ETL job and crawler run, with a 10-minute minimum for each. If you choose to use a development endpoint to interactively develop your ETL code, you will pay an hourly rate, billed per second, for the time your development endpoint is provisioned, with a 10-minute minimum. For more details, please refer our pricing page.
Q: When does billing for my Amazon Glue jobs begin and end?
Billing commences as soon as the job is scheduled for execution and continues until the entire job completes. With Amazon Glue, you only pay for the time for which your job runs and not for the environment provisioning or shutdown time.
Security and availability
Q: How does Amazon Glue keep my data secure?
We provide server side encryption for data at rest and SSL for data in motion.
Q: What are the service limits associated with Amazon Glue?
Please refer our documentation to learn more about service limits.
Q: How many DPUs (Data Processing Units) are allocated to the development endpoint?
A development endpoint is provisioned with 5 DPUs by default. You can configure a development endpoint with a minimum of 2 DPUs and a maximum of 5 DPUs.
Q: How do I scale the size and performance of my Amazon Glue ETL jobs?
You can simply specify the number of DPUs (Data Processing Units) you want to allocate to your ETL job. A Glue ETL job requires a minimum of 2 DPUs. By default, Amazon Glue allocates 10 DPUs to each ETL job.
Q: How do I monitor the execution of my Amazon Glue jobs?
The Amazon Glue provides status of each job and pushes all notifications to Amazon CloudWatch. You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.
Service Level Agreement
Q: What does the Amazon Glue SLA guarantee?
Our Amazon Glue SLA guarantees a Monthly Uptime Percentage of at least 99.9% for Amazon Glue.
Q: How do I know if I qualify for a SLA Service Credit?
You are eligible for a SLA credit for Amazon Glue under the Amazon Glue SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.