Entity Resolution
With Python Workshop



Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process.

There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon.

However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches. 

What You Will Learn

After this workshop you should understand the basic theoretical foundation for entity resolution techniques, how to implement pairwise similarity scoring, and how to use clustering for deduplication.

Course Outline

This workshop will be an approximately 2 hour online webinar focused on demonstration-led topics and techniques. The session will be recorded for the benefit of the participants, and all code will be made available after the workshop. 

The workshop will cover the following topics:

  • Introduce entity resolution theory and tasks
  • Similarity scores and similarity vectors
  • Pairwise matching with the Fellegi Sunter algorithm
  • Clustering approaches to deduplication 
  • Notes on record linkage, constraints, and canonicalization  


We will use Python as the primary programming language as well as Scikit-Learn for demonstrating the machine learning tasks. Familiarity with a programming language and machine learning is recommended, if not specifically Python and Sckit-Learn.

Instructor: Benjamin Bengfort


Benjamin is an experienced Data Scientist and Python developer who has worked in military, industry, and academia for the past eight years. He is currently pursuing his PhD in Computer Science at The University of Maryland, College Park, doing research in question and answer reasoning systems. He holds a Masters degree from North Dakota State University where he taught undergraduate Computer Science courses. He is also adjunct faculty at Georgetown University where he teaches Data Science and Analytics.

a recording of this online workshop is available for purchase.

Click the button below to get the video.