General
Q: What is a data lake?
A: A data lake is a scalable central repository of large quantities and varieties of data, both structured and unstructured. Data lakes enable you to manage the full lifecycle of your data. The first step of building a data lake is ingesting and cataloging data from a variety of sources. The data is then enriched, combined, and cleaned before analysis. This makes it easy to discover and analyze the data with direct queries, visualization, and machine learning. Data lakes complement traditional data warehouses, providing more flexibility, cost-effectiveness, and scalability for ingestion, storage, transformation, and analysis of your data. The traditional challenges around the construction and maintenance of data warehouses and limitations in the types of analysis can be overcome using data lakes.
Q: What is Amazon Lake Formation?
A: Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning. Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon S3 data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from Amazon analytic and machine learning services. Lake Formation automatically manages access to the registered data in Amazon S3 via services including Amazon Glue, Amazon Athena, Amazon Redshift, and (in beta) Amazon EMR Notebooks and Zeppelin notebooks with Apache Spark, to ensure compliance with your defined policies. If you’ve set up transformation jobs spanning Amazon services, Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the execution of your jobs. With Lake Formation, you can configure and manage your data lake without manually integrating multiple underlying Amazon Web Services services.
Q: Why should I use Lake Formation to build my data lake?
A: Lake Formation makes it easy to build, secure, and manage your Amazon Web Services data lake. Lake Formation integrates with underlying Amazon Web Services security, storage, analysis, and machine learning services and automatically configures them to comply with your centrally defined access policies; and gives you a single console to monitor your jobs and data transformation and analytic workflows.
Lake Formation can manage data ingestion via Amazon Glue. Data is automatically classified and relevant data definitions, schema, and metadata are stored in the central data catalog. Amazon Glue also converts your data to your choice of open data formats to be stored in S3 and cleans your data to remove duplicates and link records across data sets. Once your data is in your S3 data lake, you can define access policies, including table and column level access controls, and enforce encryption for data at rest. You can then use a wide variety of Amazon analytic and machine learning services to access your data lake. All access is secured, governed, and auditable.
Q: What kind of problems does the FindMatches ML Transform solve?
A: FindMatches generally solves Record Linkage and Data Deduplication problems. Deduplication is what you have to do when you are trying to identify records in a database which are conceptually “the same”, but for which you have separate records. This problem is trivial if duplicate records can be identified by a unique key (for instance if products can be uniquely identified by a UPC Code), but becomes very challenging when you have to do a “fuzzy match”.
Record linkage is basically the same problem as data deduplication under the hood, but this term usually means that you are doing a “fuzzy join” of two databases that do not share a unique key rather than deduplicating a single database. As an example, consider the problem of matching a large database of customers to a small database of known fraudsters. FindMatches can be used on both record linkage and deduplication problems.
For instance, Lake Formation's FindMatches ML Transform can help you with the following problems:
- Linking patient records between hospitals so that doctors have more background information and are better able to treat patients by using FindMatches on separate databases that both contain common fields such as name, birthday, home address, phone number, etc.
- Deduplicating a database of movies containing columns like “title”, “plot synopsis”, “year of release”, “run time”, and “cast”. For instance, the same movie might be variously identified as “Star Wars”, “Star Wars: A New Hope”, and “Star Wars: Episode IV—A New Hope (Special Edition)”.
- Automatically group all related products together in your storefront by identifying equivalent items in an apparel product catalog where you want to define “equivalent” to mean that they are the same ignoring differences in size and color. Hence “Levi 501 Blue Jeans, size 34x34” is defined to be the same as “Levi 501 Jeans--black, Size 32x31”.
Q: How does Lake Formation deduplicate my data?
A: Lake Formation's FindMatches ML Transform makes it easy to find and link records that refer to the same entity but don’t share a reliable identifier. Before FindMatches, developers would commonly solve data-matching problems deterministically, by writing huge numbers of hand-tuned rules. FindMatches uses machine learning algorithms behind the scenes to learn how to match records according to each developer's own business criteria. FindMatches first identifies records for the customer to label as to whether they match or do not match and then uses machine learning to create an ML Transform. Customers can then execute this Transform on their database to find matching records or they can ask FindMatches to give them additional records to label to push their ML Transform to higher levels of accuracy.
Q: What are ML Transforms?
A: ML Transforms provide a destination for creating and managing machine-learned transforms. Once created and trained, these ML Transforms can then be executed in standard Amazon Glue scripts. Customers select a particular algorithm (for example, the FindMatches ML Transform) and input datasets and training examples, and the tuning parameters needed by that algorithm. Amazon Lake Formation uses those inputs to build an ML Transform that can be incorporated into a normal ETL Job workflow.
Q: How do ML Transforms work?
A: Lake Formation includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
Customers start by navigating to the ML Transforms tab in the Lake Formation console (or using the ML Transforms service endpoints or accessing ML Transforms training via CLI) to create their first ML transform model. The ML Transforms tab provides a user-friendly view for management of user transforms. ML Transforms require distinct workflow requirements from other transforms, including the need for separate training, parameter tuning, and execution workflows; the need for estimating the quality metrics of generated transformations; and the need to manage and collect additional truth labels for training and active learning.
To create an ML transform via the console, customers first select the transform type (such as Record Deduplication or Record Matching) and provide the appropriate data sources previously discovered in Data Catalog. Depending on the transform, customers may then be asked to provide ground truth label data for training or additional parameters. Customers can monitor the status of their training jobs and view quality metrics for each transform. (Quality metrics are reported using a hold-out set of the customer-provided label data.)
Once satisfied with the performance, customers can promote ML Transforms models for use in production. ML Transforms can then be used during ETL workflows, both in code autogenerated by the service and in user-defined scripts submitted with other jobs, similar to pre-built transforms offered in Amazon Glue libraries.
Q: How does Lake Formation relate to other Amazon Web Services services?
A: Lake Formation manages data access for registered data that is stored in S3, and manages query access from Amazon Glue, Athena, Redshift, and (in beta) EMR Notebooks and Zeppelin notebooks for EMR with Apache Spark through a unified security model and permissions. Lake Formation can ingest data from S3, Amazon RDS databases, and Amazon CloudTrail logs, understand their formats, and make data clean and queryable. Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the execution of your jobs.
Q: How does Lake Formation relate to Amazon Glue?
A: Lake Formation leverages a shared infrastructure with Amazon Glue, including console controls, ETL code creation and job monitoring, blueprints to create workflows for data ingest, the same data catalog, and a serverless architecture. While Amazon Glue focuses on these types of functions, Lake Formation encompasses all Amazon Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake.
ETL and catalog
Q: How does Lake Formation help me discover the data I can move into my data lake?
A: Lake Formation automatically discovers all Amazon Web Services data sources to which it is provided access by your Amazon IAM policies. It crawls S3, RDS, and CloudTrail sources and through blueprints it identifies them to you as data that can be ingested into your data lake. No data is ever moved or made accessible to analytic services without your permission. You can also use Amazon Glue to ingest data from other sources including S3 and DynamoDB.
You can also define JDBC connections to allow Lake Formation to access your Amazon Web Services databases and on-premises databases including Oracle, MySQL, Postgres, SQL Server, and MariaDB.
Lake Formation ensures that all your data is described in a central data catalog, giving you one location to browse the data that you have permission to view and query. The permissions are defined in your data access policy and can be set at the table and column level.
In addition to the properties automatically populated by the crawlers, you can add additional labels including business attributes such as data sensitivity, at the table- or column-level, and add field-level comments.
Q: How does Lake Formation organize my data in a data lake?
A: You can use one of the blueprints available in Lake Formation to ingest data into your data lake. Lake Formation creates Glue workflows that crawl source tables, extract the data, and load it to S3. In S3, Lake Formation organizes the data for you, setting up partitions and data formats for optimized performance and cost. For data already in Amazon S3, you can register those buckets with Lake Formation to manage them.
Lake Formation also crawls your data lake to maintain a data catalog and provides an intuitive user interface for you to search entities (by type, classification, attribute, or free-form text.)
Q: How does Lake Formation use machine learning to clean my data?
A: Lake Formation provides jobs that run machine learning algorithms to perform de-duplication and link matching records. Creating ML Transforms is as easy as selecting your source, selecting a desired transform, and providing training data for the changes you would like performed. Once trained to your satisfaction, the ML Transforms can be run as part of your regular data movement workflows, with no machine learning expertise required.
Q: What are other ways I can ingest data to Amazon Web Services for use with Lake Formation?
A: Customers can move petabytes to exabytes of data from their datacenters to Amazon Web Services using physical appliances with Amazon Snowball, Amazon Snowball Edge, and Amazon Snowmobile or connect their on-premises applications directly to Amazon Web Services with Amazon Storage Gateway. Customers can accelerate data transfer using a dedicated network connection between a customer’s network and Amazon Web Services with Amazon Direct Connect or boost long distance global data transfers using Amazon’s globally distributed edge locations with Amazon S3 Transfer Acceleration. Amazon Kinesis also provides a useful way to load streaming data to S3. Lake Formation Data Importers can be set up to perform ongoing ETL jobs and prepare ingested data for analysis.
Q: Can I use my existing data catalog or Hive Metastore with Lake Formation?
A: Lake Formation provides a way for you to import your existing catalog and metastore into the Data Catalog. However, Lake Formation requires your metadata to reside in the Data Catalog to ensure governed access to your data.
Security and governance
Q: How does Lake Formation protect my data?
A: Lake Formation protects your data by giving you a central location where you can configure granular data access policies that protect your data, regardless of which services are used to access it.
To centralize data access policy controls using Lake Formation, first shut down direct access to your buckets in S3 so all data access is managed by Lake Formation. Next, configure data protection and access policies using Lake Formation, which enforces those policies across all the Amazon Web Services services accessing data in your lake. You can configure users and roles and define the data these roles can access, down to the table and column level.
Lake Formation currently supports Server-Side-Encryption on S3 (SSE-S3, AES-265). Lake Formation also supports private endpoints in your VPC and records all activity in Amazon CloudTrail, so you have network isolation and auditability.
Q: How does Lake Formation work with Amazon IAM?
A: Lake Formation integrates with IAM so authenticated users and roles can be automatically mapped to data protection policies that are stored in the Data Catalog. The IAM integration also enables you to use Microsoft Active Directory or LDAP to federate into IAM using SAML.
Enabling data access
Q: How does Lake Formation help an analyst or data scientist discover what data they can access?
A: Lake Formation ensures that all your data is described in the Data Catalog, giving you a central location to browse the data that you have permission to view and query. The permissions are defined in your data access policy and can be set at the table and column level.
Q: Can I use third party business intelligence tools with Lake Formation?
A: Yes, you can use your third-party business applications, like Tableau and Looker, to connect to your Amazon Web Services data sources through services like Athena or Redshift. Access to data is managed by the underlying Data Catalog, so regardless of which application you use, you are assured that access to your data is governed and controlled.
Q: Does Lake Formation provide APIs or a CLI?
A: Yes, Lake Formation provides APIs and a CLI to integrate Lake Formation functionality into your custom applications. Java and C++ SDKs are also available to enable you to integrate your own data engines with Lake Formation.