Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora. In this class, we’ll explore how to capitalize on these benefits with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.

NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

What You Will Learn

In this course, we will begin by exploring NLTK from the view of the corpora that comes with the library, and in this way, we will get a feel for the various features and functionalities that NLTK has to offer. However, as most NLP practitioners want to work on their own corpora, during the second half of the course, we will focus on building a language-aware data product from a specific corpus. We will begin by using unsupervised methods and topic modeling to see if we can discover document similarities. We will then use this as a base to build a document classification system. Both techniques are essential to almost every NLP task, and their mastery will unlock possibilities for a wide variety of language-aware applications.

Course Outline

The following represents the modules that will be included in the course.

Part One: Using NLTK

  • Introduction to NLTK: code + resources (magic)

  • The counting of things: concordances, frequency distributions, tokenization

  • Tagging and parsing: PoS tagging, Syntactic Parsing

  • Key term extraction and object identification, NERC

Part Two: Building an NLP Data Product

  • Using the NLTK API to wrap a custom corpus

  • Creating vector models of text using bag-of-words

  • Document classification using Naive Bayes and Maximum Entropy

  • Adding in Scikit-Learn for other classifiers

  • Unsupervised methods: matrix factorization

  • Using Gensim for LDA topic modeling

After taking this course, attendees will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also develop an understanding of the features and functionality of NLTK, and a working knowledge of architecting applications that use NLP. Finally, attendees who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.

Course Requirements

Attendees should be familiar with Python and with the command line before participating in this course. They should also have the required software installed and operational on their computers.


Interested in having us teach this course at your organization?