Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.

NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

What You Will Learn

In this course we will begin by exploring NLTK from the view of the corpora that comes with the library, and in this way we will get a feel for the various features and functionality that NLTK has to offer. However, as most NLP practitioners want to work on their own corpora, during the second half of the course we will focus on building a language-aware data product from a specific corpus — HTML articles ingested using the Baleen RSS ingestion tool. We will begin by using unsupervised methods and topic modeling to see if we can discover document similarities. We will then use this as a base to build a document classification system. Both techniques are essential to almost every NLP task, and their mastery will open up possibilities for a wide variety of language-aware applications.

Course Outline

The following represents the one-hour modules that will make up the course.

Part One: Using NLTK

  • Introduction to NLTK: code + resources (magic)
  • The counting of things: concordances, frequency distributions, tokenization
  • Tagging and parsing: PoS tagging, Syntactic Parsing
  • Key term extraction and object identification, NERC

Part Two: Building an NLP Data Product

  • Using the NLTK API to wrap a custom corpus
  • Creating vector models of text using bag-of-words
  • Document classification using Naive Bayes and Maximum Entropy
  • Adding in Scikit-Learn for other classifiers
  • Unsupervised methods: matrix factorization
  • Using Gensim for LDA topic modeling

Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc.

After taking this workshop students will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, students who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.


This course is an intermediate Python course as well as an intermediate Data Science course. Students will be expected to have a beyond beginner knowledge and understanding of both Python and software development, as well as analytical and mathematical techniques used in Data Science. In particular, the students will be required to have the following knowledge, preparations before the course:

  • Python installed on their system
  • Knowledge of how to write and execute Python programs
  • Understanding of how to use the command line
  • NLTK installed along with all corpora and NLTK Data
  • Knowledge of the English language (adjectives, verbs, nouns, etc.)
  • Basic probability and statistical knowledge


  • Install Gensim
  • Install Scikit-Learn

Instructor: Benjamin Bengfort


Benjamin is an experienced Data Scientist and Python developer who has worked in military, industry, and academia for the past eight years. He is currently pursuing his PhD in Computer Science at The University of Maryland, College Park, doing research in Machine Learning and Distributed Systems. He holds a Masters degree from North Dakota State University where he taught undergraduate Computer Science courses. He is also adjunct faculty at Georgetown University where he teaches Data Science and Analytics. He has built many language aware data products including classifier systems, language models - both sequential and connectionist, and semantic recognition systems.

SATURDAY, April 9th 2016


(EXPIRES 3/26/2016)

Buy a course bundle and save!

Two Workshop Bundle - Save 25%


Bundle Price: $450
($225 per workshop)


Attend any two workshops and save 25% off the regular price!
Perfect for those looking to skill-up in a couple data science topics.


To purchase this bundle, go to our course bundle registration page.

Three Workshop Bundle - Save 33%


Bundle Price: $600
($200 per workshop)


Attend any three workshops and save 33% off the regular price!
Perfect for those who need a little more exposure to data science.


To purchase this bundle, go to our course bundle registration page.

Four Workshop Bundle - Save 42%


Bundle Price: $700
($175 per workshop)


Attend any four workshops and save 42% off the regular price!
Perfect for those looking to gain exposure to several topics.


To purchase this bundle, go to our course bundle registration page.