USING DATA SCIENCE TO SOLVE A REAL-WORLD PROBLEM
Lots of people write or publish data science tutorials. Yet, most of these tutorials -- including those written by folks who hold the title of data scientist -- fail to provide examples where machine learning solves an everyday problem. New comers entering the field will struggle initially because tutorials and education rarely align with the problems (or data) you face in the workplace.
This real world data science tutorial hopes to address this issue by creating a machine learning pipeline to solve a problem you would face as a data scientist in a real work environment. We will cover, in detail, the steps to train a model that finds "relevant" or "non-relevant" content based on the requirements of a customer. In the process, you will learn a few tricks on how to pick and tune the best model to achieve optimum performance. And, most importantly, we close out by creating a web service that delivers our machine learning goodness to customers. This is a crucial step because data scientists need the skills to deliver data products to both technical and non-technical audiences. By following the steps in this tutorial, you will complete the entire cycle of the data science pipeline.
With all of the recent news about data breaches and tech giants pledging to #DeleteFacebook, something unexpected is happening: people are actually starting to care about online privacy. At the same time, awareness has been building around the real and potential harms of algorithmic decision-making systems. Scholarly organizations like FAT* and AI Now have been getting more attention, and concern about algorithmic harms has been filtering into pop culture. At least, among people who like nerdy webcomics. I knew we were entering a new era when I couldn’t even escape the “AI run amok” meme while watching cartoons with my 5-year-old.
I’ve recently been experimenting with the Yellowbrick library in Python, which is designed to make the machine learning process more visual. It offers a series of objects called Visualizers, which use a syntax that should be familiar to scikit-learn users. In this post, I want to explore whether we can use the tools in Yellowbrick to “audit” a black-box algorithm and assess claims about fairness and bias. At the same time, I prefer R for most visualization tasks. Fortunately, the new reticulate package has allowed Python part-timers, like me, to get something close to the best of both worlds.
Just about everyone on the Internet these days has been flooded with buzz words like "data science" and "artificial intelligence." Some of you may have been attracted to the field as a result of data scientist being called the "sexiest job of the 21st century," and that may have motivated you to learn how to become one.
In this tutorial, I'm going to use a very simple Titanic passenger survival dataset to show you how to start and finish a simple data science project using Python and Yellowbrick; from exploratory data analysis, to feature selection and feature engineering, to model building and evaluation.
AN ITERATIVE PROCESS
by Prema Roman
Entity resolution is a field that aims to find records in data sets that refer to the same entity by grouping and linking. Entity resolution is also called deduplication, merge purge, patient matching, etc. depending on the application. In voter registration, it is a useful technology to make sure voter rolls are up to date and can be used to see if a voter is registered in multiple states.
There are many challenges to applying entity resolution. Different data sources have varying schema, collection standards and methodologies. Even the task of standardizing the data to link entities from these sources can be a cumbersome task. To further complicate matters, there can be spelling errors, transposed characters, missing values, and other anomalies.
APPLICATIONS IN EDUCATION, CYBERSECURITY, MANUFACTURING, FINANCE, AND MEDIA
By Tony Ojeda
This month, we continue our Applied Data Science & AI Round-Up series where we highlight some of the interesting applications of data science and machine learning we’ve read about. In February, we saw several interesting examples of applications in the education, cybersecurity, manufacturing, and finance industries. Check them out below!
For 2018, we are starting a new monthly round-up series on the DDL blog that highlights examples of data science and artificial intelligence being applied in different fields, industries, and domains. We’ve noticed that people who don’t work closely with data, or haven’t worked with data scientists, often have trouble envisioning how these methods and technologies are applicable to them and can impact the work they do. We hope this series will help further the conversation about how these technologies can be applied and spark some ideas for how you can apply them in your own business or domain.
Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.
Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.
When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.
PART 1: SKIP-GRAM FEEDFORWARD
by Laura Lorenz
Let's continue our treatment of the Skip-gram model by traversing forward through an single example of feeding forward through a Skip-gram neural network; from an input target word, through a projection layer, to an output context vector representing the target word's nearest neighbors. Before we get into our example, though, let's revisit some fundamentals on neural networks.
NEW YEARS RESOLUTIONS FOR THE INTERMEDIATE DATA SCIENTIST
2016 marked a zenith in the data science renaissance. In the wake of a series of articles and editorials declaiming the shortage of data analysts, the internet responded in force, exploding with blog posts, tutorials, and listicles aimed at launching the beginner into the world of data science. And yet, in spite of all the claims that this language or that library make up the essential know-how of a "real" data scientist, if 2016 has taught us anything it's that the only essential skill is a willingness to keep learning.
Exploratory data analysis (EDA) is an important pillar of data science, a critical step required to complete every project regardless of the domain or the type of data you are working with. It is exploratory analysis that gives us a sense of what additional work should be performed to quantify and extract insights from our data. It also informs us as to what the end product of our analytical process should be. Yet, in the decade that I've been working in analytics and data science, I've often seen people grasping at straws when it comes to exploring their data and trying to find insights.
Machine learning models benefit from an increased number of features --- "more data beats better algorithms". In the financial and social domains, macroeconomic indicators are routinely added to models particularly those that contain a discrete time or date. For example, loan or credit analyses that predict the likelihood of default can benefit from unemployment indicators or a model that attempts to quantify pay gaps between genders can benefit from demographic employment statistics.
ANTICIPATING THE EXTENT OF ARTICLE VIRALITY
by Josh Erb, Walter Tyrna, and Anurag Khaitan
Producing online content that goes viral continues to be more art than science. Often, the virality of content depends heavily on cultural context, relevance to current events, and the mercurial interest of the target audience. In today's dynamic world of constantly shifting tastes and interests, reliance on the experience and intuition of the editing staff is no longer sufficient to generate high-engagement digital content.
Exceptions are a crucial part of higher level languages, and although exceptions might be frustrating when they occur, they are your friend. The alternative to an exception is a panic --- an error in execution that at best simply makes the program die and at worst can cause a blue screen of death. Exceptions, on the other hand, are tools of communication; they allow the program to tell you what, why, and how something went wrong and then they gracefully terminate your program without destruction. Learning to understand exceptions and throw some of your own is a crucial next step in programming mastery, particularly in Python.
A PRIMER FOR DATA SCIENTISTS
The phrase "open source" evokes an egalitarian, welcoming niche where programmers can work together towards a common purpose --- creating software to be freely available to the public in a community that sees contribution as its own reward. But for data scientists who are just entering into the open source milieu, it can sometimes feel like an intimidating place. Even experienced, established open source developers like Jon Schlinkert have found the community to be less than welcoming at times. If the author of more than a thousand projects, someone whose scripts are downloaded millions of times every month, has to remind himself to stay positive, you might question whether the open source community is really the developer Shangri-la it would appear to be!
The amount of data generated each day from sources such as scientific experiments, cell phones, and smartwatches has been growing exponentially over the last several years. Not only are the number data sources increasing, but the data itself is also growing richer as the number of features in the data increases. Datasets with a large number of features are called high-dimensional datasets.
Chances are, if you've been working in Natural Language Processing (NLP) or machine learning, you've heard of the class of approaches called Word2Vec. Word2Vec is an implementation of the Skip-Gram and Continuous Bag of Words (CBOW) neural network architectures. At its core, the skip-gram approach is an attempt to characterize a word, phrase, or sentence based on what other words, phrases, or sentences appear around it. In this post, I will provide a conceptual understanding of the inputs and outputs of the skip-gram architecture.
HOW I LEARNED TO STOP WORRYING AND LOVE WORD EMBEDDINGS
This post is about Distributed Representations, a concept that is foundational not only to the understanding of data processing in machine learning, but also to the understanding of information processing and storage in the brain. Distributed representations of data are the de-facto approach for many state-of-the-art deep learning techniques, notably in the area of Natural Language Processing, which will be the focus of this blog post.
VISUALIZING TEXT WITH PYTHON
by Lisa Combs and Prema Roman
In this article, we explore two extremely powerful ways to visualize text: word bubbles and word networks. These two visualizations are replacing word clouds as the defacto text visualization of choice because they are simple to create, understandable, and provide deep and valuable at-a-glance insights. In this post, we will examine how to construct these visualizations from a non-trivial corpus of news and blog RSS feeds. We begin by investigating the importance of text visualization. Next, we discuss our corpus and how to wrangle it. Finally, we will present word bubbles and word networks.
Last week, a group of us from District Data Labs flew to Portland, Oregon to attend PyCon, the largest annual gathering for the Python community. We had a talk, a tutorial, and two posters accepted to the conference, and we also hosted development sprints for several open source projects. With this blog post, we are putting everything together in one place to share with those that couldn't be with us at the conference.
VISUAL EVALUATION AND PARAMETER TUNING
Welcome back! In this final installment of Visual Diagnostics for More Informed Machine Learning, we'll close the loop on visualization tools for navigating the different phases of the machine learning workflow. Recall that we are framing the workflow in terms of the 'model selection triple' — this includes analyzing and selecting features, experimenting with different model forms, and evaluating and tuning fitted models. So far, we've covered methods for visual feature analysis in Part 1 and methods for model family and form exploration in Part 2. This post will cover evaluation and tuning, so we'll begin with two questions:
- How do you know when your machine learning model is working?
- How can you make it work better?
PYCON 2016 TUTORIAL ON SUNDAY MAY 29, 2016 AT 9AM
In this tutorial, we will explore the features of the NLTK library for text processing in order to build language-aware data products with machine learning. In particular, we will use a corpus of RSS feeds that have been collected since March to create supervised document classifiers as well as unsupervised topic models and document clusters. To do this we will need to use language analysis to preprocess and vectorize our documents into an acceptable format for applying machine learning techniques. We will use the NLTK included language classifiers, Naive Bayes and Maximum Entropy for our document classification, and use K-means clustering and LDA in Gensim for unsupervised topic modeling.
DEMYSTIFYING MODEL SELECTION
When it comes to machine learning, ultimately the most important picture to have is the big picture. Discussions of (i.e. arguments about) machine learning are usually about which model is the best. Whether it's logistic regression, random forests, Bayesian methods, support vector machines, or neural nets, everyone seems to have their favorite! Unfortunately these discussions tend to truncate the challenges of machine learning into a single problem, which is a particularly problematic misrepresentation for people who are just getting started with machine learning. Sure, picking a good model is important, but it's certainly not enough (and it's debatable whether a model can actually be 'good' devoid of the context of the domain, the hypothesis, the shape of the data, and the intended application. But we'll leave that to another post.
Python and high level libraries like Scikit-learn, TensorFlow, NLTK, PyBrain, Theano, and MLPY have made machine learning accessible to a broad programming community that might never have found it otherwise. With the democratization of these tools, there is now a large, and growing, population of machine learning practitioners who are primarily self-taught. At the same time, the stakes of machine learning have never been higher; predictive tools are driving decision-making in every sector, from business, art, and engineering to education, law, and defense.
COMBINING NERCS TO IMPROVE ENTITY EXTRACTION
The overwhelming amount of unstructured text data available today from traditional media sources as well as newer ones, like social media, provides a rich source of information if the data can be structured. Named Entity Extraction forms a core subtask to build knowledge from semi-structured and unstructured text sources. Some of the first researchers working to extract information from unstructured texts recognized the importance of “units of information” like names (such as person, organization, and location names) and numeric expressions (such as time, date, money, and percent expressions). They coined the term “Named Entity” in 1996 to represent these.
One of the machine learning workshops given to students in the Georgetown Data Science Certificate is to build a classification, regression, or clustering model using one of the UCI Machine Learning Repository datasets. The idea behind the workshop is to ingest data from a website, perform some initial analyses to get a sense for what's in the data, then structure the data to fit a Scikit-Learn model and evaluate the results. Although the repository does give advice as to what types of machine learning might be applied, this workshop still poses a challenge, especially in terms of data wrangling.
The analysis of interconnection structures of entities connected through relationships has proven to be of immense value in understanding the inner-workings of networks in a variety of different data domains including finance, health care, business, computer science, etc. These analyses have emerged in the form of Graph Analytics -- the analysis of the characteristics in these graph structures through various graph algorithms. Some examples of insights offered by graph analytics include finding clusters of entities closely connected to each-other, calculating optimal paths between entities (the definition of optimal depending on the dataset and use case), understanding the hierarchy of entities within an organization as well as figuring out the impact each entity has inside the network.
HOW NOT TO LOSE FRIENDS AND ALIENATE PEOPLE
In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to illustrate examples of how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson, because only that can provide for deep and meaningful exploration. Unfortunately, non-trivial datasets can be hard to find for a few reasons, one of which is that many contain personally identifying information (PII).
The impulse to ingest more data is our first and most powerful instinct. Born with billions of neurons, as babies we begin developing complex synaptic networks by taking in massive amounts of data - sounds, smells, tastes, textures, pictures. It's not always graceful, but it is an effective way to learn.
This post will cover a few things needed to quickly implement a fast, principled method for machine learning model parameter tuning. There are two common methods of parameter tuning: grid search and random search. Each have their pros and cons. Grid search is slow but effective at searching the whole search space, while random search is fast, but could miss important points in the search space. Luckily, a third option exists: Bayesian optimization. In this post, we will focus on one implementation of Bayesian optimization, a Python module called hyperopt.
Discrete events pervade our daily lives. These include phone calls, online transactions, and heartbeats. Despite the simplicity of discrete event data, it’s hard to visualize many events over a long time period without hiding details about shorter timescales.
We are living through an information revolution. Like any economic revolution, it has had a transformative effect on society, academia, and business. The present revolution, driven as it is by networked communication systems and the Internet, is unique in that it has created a surplus of a valuable new material - data - and transformed us all into both consumers and producers. The sheer amount of data being generated is tremendous. Data increasingly affects every aspect of our lives, from the food we eat, to our social interactions, to the way we work and play. In turn, we have developed a reasonable expectation for products and services that are highly personalized and finely tuned to our bodies, our lives, and our businesses, creating a market for a new information technology - the data product.
As data scientists, it's easy to get bogged down in the details. We're busy implementing Python and R code to extract valuable insights from data, train effective machine learning models, or put a distributed computation system together. Many of these tasks, especially those relating to data ingestion or wrangling, are time-consuming but are the bread and butter of the data scientist's daily grind. What we often forget, however, is that we must not only be data engineers, but also contributors to the data science corpus of knowledge.
Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.
Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).
Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote cluster or pay the expense of EC2. If you're learning Hadoop, you'll probably also want to set up a pseudo-distributed environment to facilitate your understanding of the various Hadoop daemons.
EFFICIENT PROCESSING, SCHEMAS, AND SERIALIZATION
I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I have to jump through hoops to deal with this type of data, when it feels like it should be as easy as JSON or other serialization formats.
In addition to regular probability, we often want to figure out how probability is affected by observing some event. For example, the NFLseason is rife with possibilities. From the beginning of each season, fans start trying to figure out how likely it is that their favorite team will make the playoffs. After every game the team plays, these probabilities change based on whether they won or lost. This post won't speak to how these probabilities are updated. That's the subject for a future post on Bayesian statistics. What we will explore is the concept of conditional probability, which is the probability of seeing some event knowing that some other event has actually occurred.
By Benjamin Bengfort
Consumers rely on the collective intelligence of other consumers to protect themselves from coffee pots that break at the first sign of water, eating bad food at the wrong restaurant, and stunning flops at the theater. Although occasionally there are metrics like Rotten Tomatoes, we primarily prejudge products we would like to consume through a simple 5 star rating. This methodology is powerful, because not only does it provide a simple, easily understandable metric, but people are generally willing to reliably cast a vote by clicking a star rating without too much angst.
Workflows and Development Tools
By Benjamin Bengfort
An Intro to Probability with R
By Abhijit Dasgupta
The probability of an event represents the likelihood of the event to occur. For example, most of us would agree that the probability of getting a heads after flipping a fair coin is 0.5 or that the probability of getting a one on rolling a fair die is 1/6. However, there are many more places where we encounter probabilities in our lives.
During election season, we have pundits and polls speaking to the likelihood (probability) of winning for each candidate. Doctors will often state that a patient has low or high risk for heart attacks or breast cancer, reflecting either data or the doctor's belief of how likely a patient is to experience that outcome. Banks use customer data to figure out how likely a customer is to default on a loan - the credit risk. We see probability appear in many guises, being called likelihood, risk, odds, propensity, and other synonyms.
AN INTRO TO R FOR MICROSOFT EXCEL USERS
by Tony Ojeda
In today's increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they've become accustomed to using. For most business people, the go-to tool for doing anything analytical is Microsoft Excel.
If you're an Excel user and you're scared of diving into R, you're in luck. I'm here to slay those fears! With this post, I'll provide you with the resources and examples you need to get up to speed doing some of the basic things you're used to doing in Excel in R. I'm going to spare you the countless hours I spent researching how to do this stuff when I first started so that you feel comfortable enough to continue using R and learning about its more sophisticated capabilities.