AWS re:Invent 2018: [REPEAT 1] A Deep Dive into What's New with Amazon EMR (ANT340-R1)
Amazon EMR is one of the largest Spark and Hadoop service providers in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long- and short-lived clusters, using notebooks, and other architectural best practices. We discuss lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. We showcase key improvements made to the service in 2017. We cover improvements in using the Amazon EMR API, best practices utilizing Spot instances and Spot Instances with Auto Scaling, improvements toward Amazon S3 performance on Amazon EMR, and security/authorization and authentication. We couple each of these with a demo or customer use case to illustrate the benefits. If you are an existing Amazon EMR user, you walk away with a thorough understanding of improvements made in 2018, and how they benefit you. If you are a new Amazon EMR user, get an understanding of common use cases and how other customers are using Amazon EMR.