Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.

NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

What You Will Learn

In this course we will begin by exploring NLTK from the view of the corpora that comes with the library, and in this way we will get a feel for the various features and functionality that NLTK has to offer. However, as most NLP practitioners want to work on their own corpora, during the second half of the course we will focus on building a language-aware data product from a specific corpus. We will begin by using unsupervised methods and topic modeling to see if we can discover document similarities. We will then use this as a base to build a document classification system. Both techniques are essential to almost every NLP task, and their mastery will open up possibilities for a wide variety of language-aware applications.

Course Outline

The following represents the modules that will make up the course.

Part One: Using NLTK

  • Introduction to NLTK: code + resources (magic)
  • The counting of things: concordances, frequency distributions, tokenization
  • Tagging and parsing: PoS tagging, Syntactic Parsing
  • Key term extraction and object identification, NERC

Part Two: Building an NLP Data Product

  • Using the NLTK API to wrap a custom corpus
  • Creating vector models of text using bag-of-words
  • Document classification using Naive Bayes and Maximum Entropy
  • Adding in Scikit-Learn for other classifiers
  • Unsupervised methods: matrix factorization
  • Using Gensim for LDA topic modeling

Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc.

After taking this course attendees will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, attendees who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.


This course is an intermediate Python course as well as an intermediate data science course. Students will be expected to have a beyond beginner knowledge and understanding of both Python and software development, as well as analytical and mathematical techniques used in data science. In particular, the students will be required to have the following knowledge, preparations before the course:

  • Python installed on their system
  • Knowledge of how to write and execute Python programs
  • Understanding of how to use the command line
  • NLTK installed along with all corpora and NLTK Data
  • Knowledge of the English language (adjectives, verbs, nouns, etc.)
  • Basic probability and statistical knowledge


  • Install Gensim
  • Install Scikit-Learn

Interested in having us teach this course at your organization?