Presto is an open-source distributed SQL query engine optimized for low-latency, ad hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3. Presto has two community projects – PrestoDB and PrestoSQL. Amazon EMR supports both projects.
You can quickly and easily create managed Presto clusters from the Amazon Web Services Management Console, Amazon Web Services CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including fast Amazon S3 connectivity, integration with Amazon EC2 Spot instances, choice of a wide variety of Amazon EC2 instances, including the memory optimized instances, and resize commands to easily add or remove instances from your cluster.
Features and benefits
Interactive query performance
Presto uses a custom query execution engine with operators designed to support SQL semantics. Different from Hive/MapReduce, Presto executes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/O. The pipelined execution model runs multiple stages in parallel and streams data from one stage to the next as it becomes available.
Ease of use
You can launch an Amazon EMR cluster running Presto in minutes. You don’t need to worry about node provisioning, cluster setup, configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis. You can also use tools such as Airpal, a web-based query execution tool open-sourced by Airbnb. Airpal’s user interface simplifies data exploration and ad hoc analysis and supports features such as syntax highlighting, the ability to export results to CSV, saving queries for later use, and the ability to explore tables to visualize schema.
Integration with Amazon EMR feature set
Run interactive queries that directly access data in Amazon S3, save costs using Amazon EC2 Spot instance capacity, use EMR Managed Scaling to dynamically add and remove capacity, and launch long-running or ephemeral clusters to match your workload. You can also add other Hadoop ecosystem applications on your cluster.
ANSI SQL support
Presto supports the ANSI SQL standard, which makes it easy for data analysts and developers to query both structured and unstructured data at scale. Currently, Presto supports a wide variety of SQL functionality, including complex queries, aggregations, joins, and window functions.
Learn more about Amazon EMR pricing