How to Visualize a Data Science Project Pipeline with Yellowbrick

By Yara Guk

 

No matter whether you are a novice data scientist or a well-seasoned professional in the field with years of experience, you most likely have faced a challenge of interpreting results generated somewhere along the many stages of the data science pipeline, be it data ingestion or wrangling, feature selection or model evaluation. This issue becomes even more prominent when the need arises to present interim findings to a group of stakeholders, clients, or other individuals with vested interest. How do you deal in that case with the long arrays of numbers, scientific notations, and formulas which tell a story of your data set? That's when a visualization library like Yellowbrick becomes an essential tool for data scientists and every level. Yellowbrick provides interpretable and comprehensive visualization means for any stage of a project pipeline.

In this post, we explain how to integrate visualization steps into each stage of your project without the need of creating customized, time-consuming charts. Because the human eye, unlike computers, perceives a graphical representation of information far better than it does with bits and digits, we'll also be able to see the benefits of drawing necessary insights into our data. Yellowbrick machine learning visualization library serves just that purpose - to "create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models and assist in diagnosing problems throughout the machine learning workflow" ( http://www.scikit-yb.org/en/latest/about.html ).

For this exercise, we will be using a dataset from UCI Machine Learning Repository on Absenteeism at Work. This dataset contains a mix of continuous, binary and hierarchical features, along with a continuous target representing many work hours an employee has been absent for from work. Such a variety in data makes for interesting wrangling, feature selection, and model evaluation task, results of which we will make sure to visualize along the way.

To begin, we need to pip install and import Yellowbrick Python library. To do that, simply run the following command from your command line:

$ pip install yellowbrick

Once that's done, let's import Yellowbrick along with other essential packages, libraries, and user-preference set up into the Jupyter Notebook.

import numpy as np
import pandas as pd
%matplotlib inline
from cycler import cycler
import matplotlib.style
import matplotlib as mpl
mpl.style.use('seaborn-white')
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from sklearn.cluster import KMeans
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, RandomTreesEmbedding, GradientBoostingClassifier
import warnings
warnings.filterwarnings("ignore")
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from yellowbrick.features import Rank1D
from yellowbrick.features import Rank2D
from yellowbrick.classifier import ClassBalance
from yellowbrick.model_selection import LearningCurve
from yellowbrick.model_selection import ValidationCurve
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ClassificationReport
from yellowbrick.features.importances import FeatureImportances

Data Ingestion and Wrangling

Now we are ready to proceed with downloading a zipped archive containing the dataset directly from the UCI Machine Learning Repository and extracting the data file. To perform this step, we will be using the urllib.request module which helps with opening URLs (mostly HTTP) in a complex world.

import urllib.request

print('Beginning file download...')

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip'  

urllib.request.urlretrieve(url, /Users/Yara/Downloads/Absenteeism_at_work_AAA.zip'))

Unzip the archive and extract a CSV data file which we will be using. Zipfile module does that flawlessly.

import zipfile

fantasy_zip = zipfile.ZipFile('C:\\Users\\Yara\\Downloads\\Absenteeism_at_work_AAA.zip')
fantasy_zip.extract('Absenteeism_at_work.csv', 'C:\\Users\\Yara\\Downloads')

fantasy_zip.close()

Load the data and place it in the same folder as your Python code.

dataset = pd.read_csv('C:\\Users\\Yara\\Downloads\\Absenteeism_at_work.csv', 'Absenteeism_at_work.csv', delimiter=';')

Let's take a look at a couple of randomly selected rows from the loaded data set.

data sample.png
dataset.ID.count()

740

As we can see, the selected dataset contains 740 instances, each representing an employed individual. Features provided in the dataset are those considered to be related to the number of hours an employee was absent from work (target). For the purpose of this exercise, we will subjectively group all instances into 3 categories, thus, converting continuous target into categorical. To identify appropriate bins for the target, let's look at the min, max and mean values.

print(dataset.loc[:, 'Absenteeism time in hours'].mean())
print(dataset.loc[:, 'Absenteeism time in hours'].min())
print(dataset.loc[:, 'Absenteeism time in hours'].max())

6.924324324324324
0
120

If approximately 7 hours of absence is an average value across our dataset, it makes sense to group records in the following manner:

  1. Low rate of absence (Low), if 'Absenteeism time in hours' value is < 6;

  2. Medium rate of absence (Medium), if 'Absenteeism time in hours' value is between 6 and 30;

  3. High rate of absence (High), if 'Absenteeism time in hours' value is > 30.

Upon grouping, we will be further exploring data and selecting relevant features from the dataset in order to predict an absentee category for the instances in the test portion of the data.

dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'] < 6, 1, dataset['Absenteeism time in hours'])

dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'].between(6, 30), 2, dataset['Absenteeism time in hours'])

dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'] > 30, 3, dataset['Absenteeism time in hours'])

data.head()

data grouped.png

Once the target is taken care of, it is time to look at the features. Those of them storing unique identifiers and / or data which might 'leak' information to the model, should be dropped from the data set. For instance, 'Reason for absence' feature stores the information 'from the future' since it will not be available in the real world business scenario when running the model on a new set of data. Therefore, it is highly correlated with the target.

dataset = dataset.drop(['ID', 'Reason for absence'], axis=1)

dataset.columns

Index(['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

We are now left with the set of features and a target to use in a machine learning model of our choice. So, let's separate features from the target, and split our dataset into a matrix of features (X) and an array of target values (y).

features = ['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index']

target = ['Absenteeism time in hours']

X = dataset.drop(['Absenteeism time in hours'], axis=1)
y = dataset.loc[:, 'Absenteeism time in hours']

Exploratory Analysis and Feature Selection

Whenever one deals with a categorical target, it is important to remember to test the data set for class imbalance issues. Machine learning models struggle with performing well on imbalanced data where one class is overrepresented, while the other one is underrepresented. While such data sets are representative of the real life, e.g., no company will have a majority or even half of its employees missing work on a massive scale, they need to be adjusted for machine learning purposes, to improve algorithms' ability to pick up patterns present in that data.

And to check for the potential class imbalance in our data, we will use Class Balance Visualizer from Yellowbrick.

# Calculating population breakdown by target category
Target = y.value_counts()
print(color.BOLD, 'Low:', color.END, Target[1])
print(color.BOLD, 'Medium:', color.END, Target[2])
print(color.BOLD, 'High:', color.END, Target[3])

# Creating class labels
classes = ["Low", "Medium", "High"]

# Instantiate the classification model and visualizer
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['red', 'limegreen', 'yellow'])
forest = RandomForestClassifier()
fig, ax = plt.subplots(figsize=(10, 7))
visualizer = ClassBalance(forest, classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(axis='x')

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.score(X, y)  # Evaluate the model on the test data
g = visualizer.poof()

Low:  468
Medium:  244
High:  28

class balance.png

There is an apparent class imbalance here. Therefore, we can expect the model to have difficulties learning the pattern for Medium and High categories, unless we perform data resampling or apply class weight parameter within the selected model if the chosen algorithm allows it.

Let's proceed with assessing feature importance and selecting those which will be used further in a model of our choice. Yellowbrick library provides many convenient visualizers to perform feature analysis, and we will use a couple of them for demonstration purposes, as well as to ensure that consistent results are returned when different methods are applied.

Rank 1D visualizer utilizes Shapiro-Wilk algorithm that takes into account only a single feature at a time and assesses the normality of the distribution of instances with respect to the feature. Let's see how it works!

fig, ax = plt.subplots(figsize=(10, 7))
visualizer = Rank1D(features=features, ax=ax, algorithm='shapiro')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.poof()

rank1d.png

Rank 2D Visualizer, in its turn, utilizes a ranking algorithm that takes into account pairs of features at a time. It provides an option for a user to select a ranking algorithm of their choice. We are going to experiment with covariance and Pearson and compare the results.

figsize=(10, 7)
fig, ax = plt.subplots(figsize=figsize)
visualizer = Rank2D(features=features, ax=ax, algorithm='covariance', colormap='summer')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.poof()

rank2d.png
figsize=(10, 7)
fig, ax = plt.subplots(figsize=figsize)
visualizer = Rank2D(features=features, algorithm='pearson', colormap='winter')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.poof()

rank2d pearson.png

Visual representation of feature correlation makes it much easier to spot pairs of features, which have high or low correlation coefficients. For instance, lighter colors on both plots indicate a strong correlation between such pairs of features as 'Body mass index' and 'Weight'; 'Seasons' and 'Month of absence,' etc.

Another way of estimating feature importance relative to the model is to rank them by feature importance attribute when data is fitted to the model. The Yellowbrick Feature Importances visualizer utilizes this attribute to rank and plot features' relative importances. Let's look at how this approach works with Ridge, Lasso and ElasticNet models.

# Ridge
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['red'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(311)
labels = features
viz = FeatureImportances(Ridge(alpha=0.1), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.poof()

# ElasticNet
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['salmon'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(312)
labels = features
viz = FeatureImportances(ElasticNet(alpha=0.01), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.poof()

# Lasso
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['purple'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(313)
labels = features
viz = FeatureImportances(Lasso(alpha=0.01), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.poof()

feature importance ridge.png
feature importance elasticnet.png
feature importance lasso.png

Having analyzed the output of all utilized visualizations (Shapiro algorithm, Pearson Correlation Ranking, Covariance Ranking, Lasso, Ridge, and ElasticNet), we can now select a set of features which have meaningful coefficient values (positive or negative). These are the features to be kept in the model:

  • Disciplinary failure

  • Day of the week

  • Seasons

  • Distance from Residence to Work

  • Number of children (Son)

  • Social drinker

  • Social smoker

  • Height

  • Weight

  • BMI

  • Pet

  • Month of absence

Graphic visualization of the feature coefficients calculated in a number of different ways significantly simplifies feature selection process, making it more obvious, as it provides an easy way to visually compare multiple values and consider only those which are statistically significant to the model.

Now let's drop features which didn't make it and proceed with creating models.

X = X.drop(['Transportation expense', 'Age', 
            'Transportation expense', 
            'Service time', 'Hit target', 'Education',
            'Work load Average/day '], axis=1)

Some of the features which are going to be further utilized in the modeling stage, might be of a hierarchical type and require encoding. Let's look at the top couple of rows to see if we have any of those.

X.head()

data 2.png

Looks like 'Month of absence,’ 'Day of week' and 'Seasons' are not binary. Therefore, we'll be using pandas get_dummies function to encode them.

X = pd.get_dummies(data=X, columns=['Month of absence', 'Day of the week', 'Seasons'])

X.head()

data 3.png

Model Evaluation and Selection

Our matrix of features X is now ready to be fitted to a model, but first, we need to split the data into train and test portions for further model validation.

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.20, random_state=42)

For the purpose of model evaluation and selection, we will use Yellowbrick's Classification Report Visualizer, which displays the precision, recall, F1, and support scores for the model. In order to support easier interpretation and problem detection, the report integrates numerical scores with a color-coded heat map. All heat maps are normalized, i.e., in the range from 0 to 1, to facilitate easy comparison of classification models across different classification reports.

def visual_model_selection(X, y, estimator):
    visualizer = ClassificationReport(estimator, classes=['Low', 'Medium', 'High'], cmap='PRGn')
    visualizer.fit(X, y)  
    visualizer.score(X, y)
    visualizer.poof()  

visual_model_selection(X, y, BaggingClassifier())
visual_model_selection(X, y, LogisticRegression(class_weight='balanced'))
visual_model_selection(X, y, KNeighborsClassifier())
visual_model_selection(X, y, RandomForestClassifier(class_weight='balanced'))
visual_model_selection(X, y, ExtraTreesClassifier(class_weight='balanced'))

classification report bagging classifier.png
classification report logistic regression.png
classification report K nearest neighbor.png
classification report random forest.png
classification report extra trees classifier.png

For the purposes of this exercise we will consider F1 score when estimating models' performance and making a selection. All of the above models visualized through Yellowbrick's Classification Report Visualizer make clear that classifier algorithms performed the best. We need to pay special attention to the F1 score for the underrepresented classes, such as "High" and "Medium", as they contained significantly less instances than "Low" class. Therefore, high F1 score for all three classes indicate a very strong performance of the following models: Bagging Classifier, Random Forest Classifier, Extra Trees Classifier.

We will also use Class Prediction Error visualizer for these models to confirm their strong performance.

# Visualizaing class prediction error for Bagging Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['turquoise', 'cyan', 'teal', 'coral', 'blue', 'lime', 'lavender', 'lightblue', 'darkgreen', 'tan', 'salmon', 'gold', 'darkred', 'darkblue'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(311)
visualizer = ClassPredictionError(BaggingClassifier(), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()

# Visualizaing class prediction error for Random Forest Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['coral', 'tan', 'darkred'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(312)
visualizer = ClassPredictionError(RandomForestClassifier(class_weight='balanced'), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()

# Visualizaing class prediction error for Extra Trees Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['limegreen', 'yellow', 'orange'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(313)
visualizer = ClassPredictionError(ExtraTreesClassifier(class_weight='balanced'), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()

class prediction error for bagging classifier.png
class prediction error for random forest classifier.png
class prediction error for extra trees classifier.png

Model Optimization

Now we can conclude that ExtraTreesClassifier seems to perform better as it had no instances from "High" class reported under the "Low" class.

However, decision trees become more overfit the deeper they are because at each level of the tree the partitions are dealing with a smaller subset of data. One way to avoid overfitting is by adjusting the depth of the tree. Yellowbrick's Validation Curve visualizer explores the relationship of the "max_depth" parameter to the R2 score with 10 shuffle split cross-validation.

So let's proceed with hyperparameter tuning for our selected ExtraTreesClassifier model using Validation Curve visualizer!

# Validation Curve
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['purple', 'darkblue'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(411)
viz = ValidationCurve(ExtraTreesClassifier(class_weight='balanced'), ax=ax, param_name="max_depth", param_range=np.arange(1, 11), cv=3, scoring="accuracy")

# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()

validation curve extra trees classifier.png

We can observe on the above chart that even though training score keeps rising continuously, cross-validation score drops down at max_depth=7. Therefore, we will choose that parameter for our selected model to optimize its performance.

visual_model_selection(X, y, ExtraTreesClassifier(class_weight='balanced', max_depth=7))

extra trees classification report.png

Conclusion

As we demonstrated in this article, visualization techniques prove to be a useful tool in the machine learning toolkit, and Yellowbrick provides a wide selection of visualizers to meet the needs at every step and stage of the data science project pipeline. Ranging from feature analysis and selection, to model selection and optimization, Yellowbrick visualizers make it easy to decide as to which features to keep in the model, which model performs best, and how to tune a model's hyper-parameters to achieve its optimal performance for future use. Moreover, visualizing algorithmic output also makes it easy to present insights to the audience and stakeholders, and contribute to the simplified interpretability of the machine learning results.


Need help with Machine Learning?

Schedule a free consultation to find out how we can help!


 

SUBSCRIBE TO THE DDL BLOG

Did you enjoy this post on machine learning? Don't miss the next one!

 
 

Learn data science at work!

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

On-site training for you and your co-workers on the latest data science, analytics, and machine learning methods and tools.


Need help with Machine LEarning?

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

Machine learning can solve many everyday business problems. Schedule a free consultation to find out how we can help!



Our Books: