Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop implements a distributed file system, HDFS, and a computing framework (YARN) that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce, a functional programming paradigm that lends itself extremely well to designing distributed applications, but it carries a lot of computational overhead.

That's where Spark comes in. Spark is an open source Apache project that uses the cluster resource daemons of Hadoop (particularly HDFS and other Hadoop data stores) but allows developers to break out of the MapReduce paradigm and write more powerful and faster distributed applications. The flexibility that Spark provides means that it can be used to implement more complex algorithms and applications previously unavailable to MapReduce patterns. Spark can run in memory, making it hundreds of times faster than disk-based MapReduce, and it provides a programming API in Scala, Java, and Python.

What You Will Learn

In this course, we will introduce Spark at a high level and then dive into several examples that highlight its features with the help of Python’s PySpark library. Spark is fundamentally different than writing MapReduce jobs, so no prior Hadoop experience is needed. Participants will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms and search.

Course Outline

The course will cover the following topics:

  • Interacting with Spark via the Spark Shell

  • Interacting with RDDs and other distributed data

  • Creating Spark applications in Python

  • Submitting Spark applications to the cluster

  • Aggregations and Queries using Spark SQL

  • Machine Learning with Spark MLLib

  • Graph computing with Spark GraphX

After this course you will understand how to build distributed applications using Python and Spark particularly for conducting analyses. You will be introduced to Spark applications and be able to run Spark SQL queries on a distributed database, conduct machine learning with SparkMLib, and execute graph algorithms with Spark GraphX.

Course Requirements

Attendees should be familiar with Python and with the command line before participating in this course. They should also have the required software installed and operational on their computers.


Interested in Having us Teach This Course at Your Organization?