Posted On: May 30, 2023
Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning that enables data scientists and developers to perform every step of the machine learning workflow, from preparing data to building, training, tuning, and deploying models. SageMaker Studio comes with built-in integration with Amazon EMR so that data scientists can interactively prepare data at petabyte scale using open source frameworks such as Apache Spark, Hive, and Presto right from within Studio notebooks. Very often data is stored in data lakes managed by Amazon Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism. We’re excited to announce that SageMaker Studio now supports applying this fine-grained data access control with Amazon Lake Formation when accessing data through Amazon EMR.
Until now, when you ran multiple data processing jobs on an EMR cluster, all the jobs used the same Amazon Identity and Access Management (IAM) role i.e. the cluster’s EC2 Instance Profile for accessing data. Therefore, for running jobs that needed access to different data sources e.g. different S3 buckets, you had to configure the EC2 Instance Profile with policies that allowed access to the union of all such data sources. Additionally, for enabling groups of users with differential access to data, you had to create multiple, separate clusters, one for each group, resulting in operational overheads. Separately, jobs submitted to EMR from Studio notebooks were unable to apply fine-grained data access control with Amazon LakeFormation.
Starting today, when you connect to Amazon EMR cluster from SageMaker Studio notebooks, you can visually browse and choose an IAM role on-the-fly called as the runtime IAM Role. Subsequently, all your Apache Spark, Apache Hive or Presto jobs created from Studio notebooks will access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Amazon LakeFormation, you can enforce the table and column-level access using policies attached to the runtime role. With this new capability, multiple SageMaker Studio users can connect to the same EMR cluster, each using a runtime IAM role scoped with permissions matching their individual level of access to data. Their user sessions are also completely isolated from one another on the shared cluster. With this ability to control fine grained access to data on the same, shared cluster, customers can simplify provisioning of EMR clusters, thus reducing operational overhead and saving costs.
This feature is generally available in Amazon SageMaker Studio when connecting to Amazon EMR 6.9 in both Amazon Web Services China (Beijing) Region, operated by Sinnet, and Amazon Web Services China (Ningxia) Region, operated by NWCD. To learn more about SageMaker Studio visit the SageMaker user guide.