Fast Data Analytics with
Spark & python WORKSHOP



Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce - a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.

Many excellent analytical applications and algorithms have been written in MapReduce, creating an ecosystem that has made Hadoop continue to grow as an effective tool. However, more complex algorithms, especially machine learning algorithms, often require extremely complex chains of jobs to conform to the MapReduce functional paradigm. Enter Spark, an open source Apache project that uses the cluster resource daemons of Hadoop (particularly HDFS and other Hadoop data stores) but allows developers to break out of the MapReduce paradigm and write distributed applications that are much faster.

Spark also distributes applications to a cluster by using distributed executor processes- Spark developers write applications that are intended to work on local data; however unlike with MapReduce, these executors are in communication with each other and can share data via an external store. Spark is intended to work with Hadoop data stores, but can be run in a stand alone mode, or if you already have a Hadoop 2.0 cluster- then Spark can be run with YARN. The flexibility that Spark provides means that it can be used to implement more complex algorithms and applications previously unavailable to MapReduce patterns.

Spark can run in memory, making it hundreds of times faster than disk based MapReduce, and provides a programming API in Scala, Java, and Python - making it more accessible to developers. Spark has an interactive command line interface to quickly interact with data on the cluster, and applications for writing SQL-like queries with Spark and a fairly complete Machine Learning library. Importantly, it can also execute Graph algorithms that were previously unable to be ported to MapReduce frameworks.

What You Will Learn

In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster based analyses including SQL-like aggregations, machine learning applications, and Graph algorithms and search. The course will be conducted in Python using PySpark. 

Course Outline

The workshop will cover the following topics:

  • Interacting with Spark via the Spark Shell
  • Interacting with RDDs and other distributed data
  • Creating Spark applications in Python
  • Submitting Spark applications to the cluster
  • Aggregations and Queries using Spark SQL
  • Machine Learning with Spark MLLib
  • Graph computing with Spark GraphX

After this course you should understand how to build distributed applications using Python and Spark particularly for conducting analyses. You will be introduced to Spark applications and be able to run Spark SQL queries on a distributed database, conduct machine learning with SparkMLib, and execute graph algorithms with Spark GraphX.


You must be familiar with Python before participating in this course, and have familiarity with the command line on Linux systems. You must also prepare a stand-alone Spark machine either running on top of a pseudo-distributed Hadoop node or a single node Spark machine. Note that Spark can run on many operating systems, but it is recommended that you use Spark on top of a Linux distribution. Before attending the course ensure you have the following set up:

  1. Install Python and a POSIX compliant Terminal
  2. Install Spark on a Single Node (pseudo-distributed)
  3. Download the example datasets

You should also be familiar with Python and the command line, you can review these topics:

  1. Python programming:
  2. Using the terminal:

Instructor: Keegan Hines

Keegan Hines is a Data Scientist with IronNet Cybersecurity, focusing on large-scale machine learning applications in cyber defense.

He received a PhD from the University of Texas with a focus on computational statistics and neuroscience during which time he taught multiple seminars on statistical methods and R. He is interested in challenging problems in machine learning and distributed computing.

SATURDAY, Sept 10th 2016


(EXPIRES 8/20/2016)

Buy a course bundle and save!

Two Workshop Bundle - Save 25%


Bundle Price: $450
($225 per workshop)


Attend any two workshops and save 25% off the regular price!
Perfect for those looking to skill-up in a couple data science topics.


To purchase this bundle, go to our course bundle registration page.

Three Workshop Bundle - Save 33%


Bundle Price: $600
($200 per workshop)


Attend any three workshops and save 33% off the regular price!
Perfect for those who need a little more exposure to data science.


To purchase this bundle, go to our course bundle registration page.

Four Workshop Bundle - Save 42%


Bundle Price: $700
($175 per workshop)


Attend any four workshops and save 42% off the regular price!
Perfect for those looking to gain exposure to several topics.


To purchase this bundle, go to our course bundle registration page.