Apache Spark Clusters for Everyone: Easy Access to Amazon EMR Spark Clusters Using R and Python

Published on Jul 08, 2016

Using systems like Apache Spark, big data analysis is becoming more accessible from high-level languages like R and Python. However, many analysts are unprepared for the challenges of setting up a big data analytical environment. In this talk, we outline a process that allows anyone in an organization to quickly spin up elastic Spark clusters, then analyze data through RStudio and SparkR, or alternatively pySpark and Jupyter Notebooks. The resulting system is affordable, powerful, and incredibly accessible: It takes just two clicks and a 15-minute wait time for analysts to each have their own cluster. This session will cover the following: the Amazon EMR bootstrap process for installations of high-level languages to work on top of Spark (specifically SparkR and pySpark); dynamic port forwarding with SSH and Foxy Proxy for browser access to RStudio and Jupyter; convenient data loading from Amazon S3 to EMR without leaving RStudio or Jupyter; and automating the startup process for nontechnical data analysts and researchers. Finally, we will share some short studies that demonstrate the power of Spark with EMR to interactively analyze massive datasets and discern public policy insights.