How to Start Your First Data Science Project

a practical tutorial for beginners

By Juan L. Kehoe

Just about everyone on the Internet these days has been flooded with buzz words like "data science" and "artificial intelligence." Some of you may have been attracted to the field as a result of data scientist being called the "sexiest job of the 21st century," and that may have motivated you to learn how to become one. You type into Google "how to become a data scientist" and find tons of blog posts and articles. From the first page of the search result, you open some links and read them. You find that they are generally pretty good, with details of what a data scientist is, what kind of background they need, etc. Then words like linear algebra, probability, statistics, Python, R, machine learning, and domain knowledge start to overwhelm you. You might be scared by how many things you need to learn before you become a Data Scientist, and that makes you want to walk away. To be fair, all those requirements listed above are almost equivalent to the curriculum for a Ph.D. in Data Science. For a beginner, you are like an elementary school student trying to know something about Data Science; you don't need that Ph.D.-like curriculum to scare you.

Here, I'm going to show you another way to look at it. Using a metaphor from the culinary world, I'm not going to list all the skills and background that you will need to become an Iron Chef, which might be your ultimate goal. But I'm going to show you a simple recipe, so you can mix some ingredients, do some easy preparation, put a dish in the oven, and have your first experience as a chef. After your first experience, you can decide if you enjoy the process or not. If you do, you can continue learning. After all, Rome was not built in one day: you can start small, and then have the chance to grow big.

In this tutorial, I'm going to use a very simple Titanic passenger survival dataset to show you how to start and finish a simple data science project using Python and Yellowbrick; from exploratory data analysis, to feature selection and feature engineering, to model building and evaluation.

PYTHON AND YELLOWBRICK INSTALLATION

Although you don't need to know how to program yet, you will still need a programming language to finish the project. If you are a chef, programming languages are the pots, pans, and utensils you need to cook. As mentioned above, we are going to use Python and the Yellowbrick package.

Before you use Python and Yellowbrick, you will need to install them on your computer first. The following links will show you how to install them on different operating systems:

If you are curious as to why I've selected Python and Yellowbrick, check out this post.

Workflow of a data science project

Before we start, I want to introduce you to the workflow of how to start and finish a data science project. As shown in the image below, to start a data science project, you need to decide what the problem is. Ask yourself a question: what do you want to solve? It could be "who would survive on the Titanic?" like in the example we show below. Then you need to come up with an idea of what kind of data you will need to solve your problem, and how to collect them.

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

After data collection, next is exploratory data analysis (EDA). Then in the feature selection and feature engineering step, you might need to delete some features or add some new features. After you decide the features you want to feed into the model, you can start training your models.

Then you need to evaluate your models. If the models are good, you will be able to use them to predict your future data. Otherwise, you will have to go back to previous steps to do some improvement to see if you get better models. Depending on the diagnosis from your model evaluation step, you might go back to any of the previous steps and do some modifications to see if that help with your model performance.

You might need to iterate through the green part in the diagram several times before you get to a “good” model. It could also be the case that after you go through those steps numerous times, you still have not found a good model. That's when you need to start a new project or rethink your question.

Case study: Who would survive - Titanic dataset from Kaggle

I will show you how to start your first project via a easy example. In this project, we will use the data of the passengers and crew on the Titanic to predict who survived the tragedy. We will not go through all the steps as mentioned above since it will be a simple start. We will skip the first three steps and go directly to the EDA part by using an existing dataset from Kaggle.com. It will be the Titanic dataset. Please download the data from Kaggle.

You need to put the downloaded dataset into the same folder as your IPython Notebook file. Before you start doing anything, you can read some basic information of this dataset on Kaggle while you download the data.

I will keep my explanations minimal in this post, which means I will not explain why we do this and that in each step. Just like when you follow a recipe to cook a cake, you just follow the steps but don't need to figure out why we use flour but not rice and other stuff too.

However, I do want you to ask a lot of questions. You can send these questions to me via email at juanluo2008@gmail.com or try to figure them out yourself by searching for answers on Google. And also I will attach some links in each step, if you are interested you can dig into the topics more. That way you will learn a lot besides the stuff I cover here. 

Brief peek into the dataset

Before we do EDA, we need to peek into the dataset a little to see what the data looks like and what features does it have. The file you downloaded should be .csv files. If you google "csv file", you will know it's abbreviation for "comma-separated values file." When you read the .csv file in by the pandas package in Python, it will read in as a table, with each line as a row and each comma separated item into different columns.

# import packages needed for the procedure
import pandas as pd

# read data as data
data = pd.read_csv("train.csv")

# check the dimension of the table
print("The dimension of the table is: ", data.shape)
The dimension of the table is:  (891, 12)

As we can see above, the dimension of the table is (891, 12) which means there are 891 rows and 12 columns in the table. Each row in the table represents a passenger or a member of the crew, each column represents the information for that person which is also known as a variable.

I copied the description of the features from Kaggle into the following data dictionary so that we know the the information better:

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

Let's take a look at the first 5 rows of our data set. 

data.head()

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

By showing the first 5 rows of the data, we can see the data is a mixture of numerical (PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare), categorical (Sex, Cabin, Embarked), and text data (Name, Ticket). Technically, Survived and Pclass are categorical too, but they are represented in number form in this data. There are missing values too, which are represented as "NaN" as in the "Cabin" column.

The purpose of this project is to predict the "Survived" variable using other variables. "Survived" variable is also known as the "target" and other variables are also known as "features."

Exploratory data analysis (EDA)

EDA is used to explore the target and features so we know if we will need to transform or normalize some of the features based on their distribution, delete some because it might not give us any information in predicting future outcomes, or create some new features that might be useful for prediction. If you'd like to delve deeper into EDA, check out the Data Exploration with Python series on the District Data Labs blog. 

It's always good if you start your EDA process by asking lots of questions. Then you can generate figures and tables to answer these question. For visualization, I will mainly use matplotlib and yellowbrick.

To start with, I will ask some simple questions and then try to fill in the answers with some figures and tables. If you have more questions, feel free to write them down and try to figure out how to answer them or you can always send me an email. 

The questions I want to ask are:

  1. What do the variables look like? For example, are they numerical or categorical data. If they are numerical, what are their distribution; if they are categorical, how many are they in different categories?

  2. Are the numerical variables correlated?

  3. Are the distributions of numerical variables the same or different among survived and not survived? Is the survival rate different for different values? For example, were people more likely to survive if they were younger?

  4. Are there different survival rates in different categories? For example, did more women survived than man?

What Do the variables look like?

To answer this question, first I'm going to check the summary of the variables, then make some histograms for the numerical variables, and some barplots for the categorical variables.

data.describe()

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

For all the numerical variables, you will know the average (mean), standard deviation (std), minimum value (min), maximum value (max) and different percentile (25%, 50% and 75%) of the data. Also from the count of data, we could know that there are missing values for some of the variables. For example, the "Age" variable has only 714 data points instead of 891.

data.describe(include=['O'])

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

For all the object variables (categorical and text), you can see how many categories are in each variable from the "unique" row. For example, the "Sex" variable has only 2 categories.

Summary of all the variables in tables like this can give you a very rough idea of how the variables look. However, to get more details, we will need to dive deeper and use additional visualization techniques.

Histograms of the Numerical Variables

Histograms are very good visualization technique to check the distribution of numerical data. In our data set, "PassengerId" are unique numbers from 1-891 to label each person and "Survived" and "Pclass" are also categorical data, so I will not plot a histogram for these variables.

# import visulization packages
import matplotlib.pyplot as plt

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)

# make subplots
fig, axes = plt.subplots(nrows = 2, ncols = 2)

# Specify the features of interest
num_features = ['Age', 'SibSp', 'Parch', 'Fare']
xaxes = num_features
yaxes = ['Counts', 'Counts', 'Counts', 'Counts']

# draw histograms
axes = axes.ravel()
for idx, ax in enumerate(axes):
    ax.hist(data[num_features[idx]].dropna(), bins=40)
    ax.set_xlabel(xaxes[idx], fontsize=20)
    ax.set_ylabel(yaxes[idx], fontsize=20)
    ax.tick_params(axis='both', labelsize=15)

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

From the histogram, we see that all the values in the variables seem in the correct range. Most of the passengers are around 20 to 30 years old and don't have siblings or relatives with them. A large amount of the tickets sold were less than $50. There are very few tickets sold where the fare was over $500.

Barplot for the Categorical Data

Next we will create barplots for the categorical variables in the data set. Since "Ticket" and "Cabin" have too many levels (more than 100), I will not make the barplot for these.

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)

# make subplots
fig, axes = plt.subplots(nrows = 2, ncols = 2)

# make the data read to feed into the visulizer
X_Survived = data.replace({'Survived': {1: 'yes', 0: 'no'}}).groupby('Survived').size().reset_index(name='Counts')['Survived']
Y_Survived = data.replace({'Survived': {1: 'yes', 0: 'no'}}).groupby('Survived').size().reset_index(name='Counts')['Counts']
# make the bar plot
axes[0, 0].bar(X_Survived, Y_Survived)
axes[0, 0].set_title('Survived', fontsize=25)
axes[0, 0].set_ylabel('Counts', fontsize=20)
axes[0, 0].tick_params(axis='both', labelsize=15)

# make the data read to feed into the visulizer
X_Pclass = data.replace({'Pclass': {1: '1st', 2: '2nd', 3: '3rd'}}).groupby('Pclass').size().reset_index(name='Counts')['Pclass']
Y_Pclass = data.replace({'Pclass': {1: '1st', 2: '2nd', 3: '3rd'}}).groupby('Pclass').size().reset_index(name='Counts')['Counts']
# make the bar plot
axes[0, 1].bar(X_Pclass, Y_Pclass)
axes[0, 1].set_title('Pclass', fontsize=25)
axes[0, 1].set_ylabel('Counts', fontsize=20)
axes[0, 1].tick_params(axis='both', labelsize=15)

# make the data read to feed into the visulizer
X_Sex = data.groupby('Sex').size().reset_index(name='Counts')['Sex']
Y_Sex = data.groupby('Sex').size().reset_index(name='Counts')['Counts']
# make the bar plot
axes[1, 0].bar(X_Sex, Y_Sex)
axes[1, 0].set_title('Sex', fontsize=25)
axes[1, 0].set_ylabel('Counts', fontsize=20)
axes[1, 0].tick_params(axis='both', labelsize=15)

# make the data read to feed into the visulizer
X_Embarked = data.groupby('Embarked').size().reset_index(name='Counts')['Embarked']
Y_Embarked = data.groupby('Embarked').size().reset_index(name='Counts')['Counts']
# make the bar plot
axes[1, 1].bar(X_Embarked, Y_Embarked)
axes[1, 1].set_title('Embarked', fontsize=25)
axes[1, 1].set_ylabel('Counts', fontsize=20)
axes[1, 1].tick_params(axis='both', labelsize=15)

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

Are the numerical variables correlated?

In order to get a sense of whether the numerical variables in our data set are correlated, we will create a Pearson Ranking visualization. 

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 7)

# import the package for visulization of the correlation
from yellowbrick.features import Rank2D

# extract the numpy arrays from the data frame
X = data[num_features].as_matrix()

# instantiate the visualizer with the Covariance ranking algorithm
visualizer = Rank2D(features=num_features, algorithm='pearson')
visualizer.fit(X)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

From the Pearson ranking figure above, we can see that the correlation between variables are low (<0.5).

Are the distribution of numerical variables the same or different among survived and not survived?

Next, we will compare the distributions of numerical variables between passengers that survived and those that did not survive to see if there are any significant differences. We can do this with a Parallel Coordinates visualization. 

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 7)
plt.rcParams['font.size'] = 50

# setup the color for yellowbrick visulizer
from yellowbrick.style import set_palette
set_palette('sns_bright')

# import packages
from yellowbrick.features import ParallelCoordinates
# Specify the features of interest and the classes of the target
classes = ['Not-survived', 'Surivived']
num_features = ['Age', 'SibSp', 'Parch', 'Fare']

# copy data to a new dataframe
data_norm = data.copy()
# normalize data to 0-1 range
for feature in num_features:
    data_norm[feature] = (data[feature] - data[feature].mean(skipna=True)) / (data[feature].max(skipna=True) - data[feature].min(skipna=True))

# Extract the numpy arrays from the data frame
X = data_norm[num_features].as_matrix()
y = data.Survived.as_matrix()

# Instantiate the visualizer
# Instantiate the visualizer
visualizer = ParallelCoordinates(classes=classes, features=num_features)


visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.poof()         # Draw/show/poof the data

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

We can see from the figure above that passengers with more siblings on the boat have a higher death rate. We can also see that passengers that paid a higher fare had a higher survival rate.

Are there different survival rates in different categories?

Speaking of survival rates, how did they differ across our categorical variables? We can get a sense of this by creating faceted stacked barplots for each variable. 

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)

# make subplots
fig, axes = plt.subplots(nrows = 2, ncols = 2)

# make the data read to feed into the visulizer
Sex_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==1]['Sex'].value_counts()
Sex_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==0]['Sex'].value_counts()
Sex_not_survived = Sex_not_survived.reindex(index = Sex_survived.index)
# make the bar plot
p1 = axes[0, 0].bar(Sex_survived.index, Sex_survived.values)
p2 = axes[0, 0].bar(Sex_not_survived.index, Sex_not_survived.values, bottom=Sex_survived.values)
axes[0, 0].set_title('Sex', fontsize=25)
axes[0, 0].set_ylabel('Counts', fontsize=20)
axes[0, 0].tick_params(axis='both', labelsize=15)
axes[0, 0].legend((p1[0], p2[0]), ('Survived', 'Not-survived'), fontsize = 15)

# make the data read to feed into the visulizer
Pclass_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}}).replace({'Pclass': {1: '1st', 2: '2nd', 3: '3rd'}})[data['Survived']==1]['Pclass'].value_counts()
Pclass_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}}).replace({'Pclass': {1: '1st', 2: '2nd', 3: '3rd'}})[data['Survived']==0]['Pclass'].value_counts()
Pclass_not_survived = Pclass_not_survived.reindex(index = Pclass_survived.index)
# make the bar plot
p3 = axes[0, 1].bar(Pclass_survived.index, Pclass_survived.values)
p4 = axes[0, 1].bar(Pclass_not_survived.index, Pclass_not_survived.values, bottom=Pclass_survived.values)
axes[0, 1].set_title('Pclass', fontsize=25)
axes[0, 1].set_ylabel('Counts', fontsize=20)
axes[0, 1].tick_params(axis='both', labelsize=15)
axes[0, 1].legend((p3[0], p4[0]), ('Survived', 'Not-survived'), fontsize = 15)

# make the data read to feed into the visulizer
Embarked_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==1]['Embarked'].value_counts()
Embarked_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==0]['Embarked'].value_counts()
Embarked_not_survived = Embarked_not_survived.reindex(index = Embarked_survived.index)
# make the bar plot
p5 = axes[1, 0].bar(Embarked_survived.index, Embarked_survived.values)
p6 = axes[1, 0].bar(Embarked_not_survived.index, Embarked_not_survived.values, bottom=Embarked_survived.values)
axes[1, 0].set_title('Embarked', fontsize=25)
axes[1, 0].set_ylabel('Counts', fontsize=20)
axes[1, 0].tick_params(axis='both', labelsize=15)
axes[1, 0].legend((p5[0], p6[0]), ('Survived', 'Not-survived'), fontsize = 15)

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

As we can see from figures above, more females survived than men and the death rates in the 3rd ticket class and the embarkation from Southampton port are much higher than the others.

Feature selection and feature engineering

In this step, we will do lots of things to our data such as drop some features, fill in missing values, log transformations, and One Hot Encoding for the categorical features.

Dropping Features We Will Not Use

We will delete the features "PassengerId", "Name", "Ticket" and "Cabin" from our model. The reasons are as follows:

  1. "PassengerId": just a series of numbers from 1 - 891 which is used to label each person.

  2. "Name": the names of all the passengers, which might give some information like if there are some people are related based on the last names. But to simplify things up at this stage, I will pass this feature.

  3. "Ticket" and "Cabin": too many levels with unknown information.

Filling In Missing Values

From EDA, we know there are some missing value in "Age", "Cabin" and "Embarked" variables. Since we are not going to use "Cabin" feature, we will just fill in "Age" and "Embarked." I will fill the missing values in "Age" using the median age and fill the missing value in "Embarked" with "S" since there are only 2 values missing and "S" is the most represent in the dataset. If you want to know more about missing data, here is a article I liked that is very easy to read. 

From here on, I will use functions for data preprocessing so you can reuse them on new test data without the pain of going through all the process again. And also, these functions can be used to generate pipelines to make things even easier.

# fill the missing age data with median value
def fill_na_median(data, inplace=True):
    return data.fillna(data.median(), inplace=inplace)

fill_na_median(data['Age'])

# check the result
data['Age'].describe()
count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

# fill with the most represented value
def fill_na_most(data, inplace=True):
    return data.fillna('S', inplace=inplace)

fill_na_most(data['Embarked'])

# check the result
data['Embarked'].describe()
count     891
unique      3
top         S
freq      646
Name: Embarked, dtype: object

Log Transformation of the Fare

From the histograms, we can see that the distribution of "Fare" is highly right-skewed. For dealing with highly-skewed positive data, one of the strategies that can be used is log-transformation, so the skewness will be less. Since the minimum is 0, we will add 1 to the raw value, so there will not be any errors when using log-transformation.

# import package
import numpy as np

# log-transformation
def log_transformation(data):
    return data.apply(np.log1p)

data['Fare_log1p'] = log_transformation(data['Fare'])

# check the data
data.describe()

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training
# check the distribution using histogram
# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 5)

plt.hist(data['Fare_log1p'], bins=40)
plt.xlabel('Fare_log1p', fontsize=20)
plt.ylabel('Counts', fontsize=20)
plt.tick_params(axis='both', labelsize=15)

fare log

We can see from the figure above, after log-transformation the data is much less skewed.

One Hot Encoding for Categorical Features

I will use One Hot Encoding on the categorical features to transform them into numbers. If you want to know more about One Hot Encoding, here's a Quora question you can follow.

# get the categorical data
cat_features = ['Pclass', 'Sex', "Embarked"]
data_cat = data[cat_features]
data_cat = data_cat.replace({'Pclass': {1: '1st', 2: '2nd', 3: '3rd'}})
# One Hot Encoding 
data_cat_dummies = pd.get_dummies(data_cat)
# check the data
data_cat_dummies.head()

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

Model training and model evaluation

Here I will put model training and model evaluation in one part since Yellowbrick is a very good package that can wrap the model and creates good visualization of the model performance, making model evaluation much easier and fun.

Prepare Data for Model Training and Evaluation

Before we train the model, we will need to split the data into 2 sets: training and validation. We will use the training dataset to train the model and use the validation dataset to evaluate the model.

If you want to know more about training, and validation data, please check out this StackExchange question.

# create a whole features dataset that can be used for train and validation data splitting
# here we will combine the numerical features and the dummie features together
features_model = ['Age', 'SibSp', 'Parch', 'Fare_log1p']
data_model_X = pd.concat([data[features_model], data_cat_dummies], axis=1)

# create a whole target dataset that can be used for train and validation data splitting
data_model_y = data.replace({'Survived': {1: 'Survived', 0: 'Not_survived'}})['Survived']

# separate data into training and validation and check the details of the datasets
# import packages
from sklearn.model_selection import train_test_split

# split the data
X_train, X_val, y_train, y_val = train_test_split(data_model_X, data_model_y, test_size =0.3, random_state=11)

# number of samples in each set
print("No. of samples in training set: ", X_train.shape[0])
print("No. of samples in validation set:", X_val.shape[0])

# Survived and not-survived
print('\n')
print('No. of survived and not-survived in the training set:')
print(y_train.value_counts())

print('\n')
print('No. of survived and not-survived in the validation set:')
print(y_val.value_counts())
No. of samples in training set:  623
No. of samples in validation set: 268


No. of survived and not-survived in the training set:
Not_survived    373
Survived        250
Name: Survived, dtype: int64


No. of survived and not-survived in the validation set:
Not_survived    176
Survived         92
Name: Survived, dtype: int64

Model training and evaluation visualization using yellowbrick

Since the purpose of this project is to predict if a passenger has survived or not, it's a classification problem. There are lots of algorithms that can be used to do classification modeling. Here I will use logistic regression. This Medium post introduces some additional machine learning algorithms in a way that is very easy to understand. 

There are also lots of evaluation metrics you can use to evaluate your model for classification problem. Here I will use Confusion Matrix, precision, recall, F1 score, and ROC curve. Below are some links about those evaluation metrics:


Confusion Matrix

With Yellowbrick, we can create a visual confusion matrix that will allow us to easily see how well we were able to predict and compute our accuracy score. 

from sklearn.linear_model import LogisticRegression

from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC

# Instantiate the classification model 
model = LogisticRegression()

#The ConfusionMatrix visualizer taxes a model
classes = ['Not_survived','Survived']
cm = ConfusionMatrix(model, classes=classes, percent=False)

#Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)

#To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
#and then creates the confusion_matrix from scikit learn.
cm.score(X_val, y_val)

# change fontsize of the labels in the figure
for label in cm.ax.texts:
    label.set_size(20)

#How did we do?
cm.poof()

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

As we can see from the confusion matrix, the accuracy of the model is (158+68) / (158+24+18+68) = 84.33%. It's much higher than random guess, which is 50.00%. If you use majority guess, the accuracy will be a little higher which is 176/268 = 65.67%, but still lower than our model. It seems like our model worked pretty well which means we should use a machine learning model to predict who survived in this instance.

Precision, Recall and F1 Score

We can also visually generate a classification report, which includes precision, recall, and F1 score for our classification model. 

# set the size of the figure and the font size 
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 7)
plt.rcParams['font.size'] = 20

# Instantiate the visualizer
visualizer = ClassificationReport(model, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_val, y_val)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

The precision, recall, and F1 score result further confirmed that the model is performing relatively well.

ROC Curve and AUC

In addition to the previous two methods, we can also evaluate our classification model using a ROC curve for each class. 

# Instantiate the visualizer
visualizer = ROCAUC(model)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_val, y_val)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

 District Data Labs,   data science companies near me, data science consulting services, data science consulting firms, how to use data to improve business, how to use google analytics data for business, how to use data to grow your business, how to use big data, companies using data science, benefits of data analytics in business, how do businesses use big data, how data science helps business, how data science is used in business, data driven decision making, how can data science help a business, how to use data analytics to grow your business, business value of data science, data science for small business, custom analytics consulting, data analytics consultancy, top analytics consulting firms, data analysis consulting, small business analytics consulting, analytics consulting companies, data monitoring services, automate manual processes, business process automation companies, corporate analytics training, predictive analytics, machine learning course, data analytics corporate training

In the ROC Curve above, the doted diagonal line in the pictures shows when the result is randomly guessed. A good model should be above that line.

Conclusion

In this post, I briefly introduced what the workflow of a data science project should be like and used the Titanic dataset to show you how to finish a simple project. But it's not over yet. Although the model we trained looked good already, there are still lots of opportunities to improve.

You can do a variety of things to make the results better. I will list some possible things you can try later to see if they help with the predicting results:

  1. Go back to the features I discarded to see if you can dig out more information from there.

  2. Go to the logistic regression method on scikit-learn website to try the other parameters listed on the method to see if the model get improved.

  3. Try other classification algorithms to see if the performace get better.

As I mentioned in the data science project workflow section of this tutorial, you can always go back to any of the steps before model evaluating and tweak something there and see if any of the changes help your model get better. Just a reminder, no matter what change you make, make sure to write that down and record the performance. Also if you have other ideas, don't be afraid to try.

However, in terms of a good model, there are lots of other things you will need to know in addition to the evaluation metrics I mentioned above. I will have another post discuss about that. If you want to know when my next post will be out, you can check my website or you can follow me on Twitter. I hope you enjoyed this walkthrough tutorial!

Additional REsources

Below are some of the learning material I used that are very simple and easy to understand for beginners.

  • Quora: How to become a data scientist? - I highly recommend William Chen's answer on Quora. He listed all the details of what you need to become a data scientist. The thing for you now is you don't have to panic and try to start from the beginning and learn everything he listed. You can treat it as a guidance or toolkit and go back to it whenever you feel stuck on your project and try to see what you can do to improve yourself as data scientist.

  • Harvard CS109 Data Science Course - The CS109 data science course from Harvard University is a very good course for you to start to know structured knowledge about data science. And it also has the labs for using Python to finish data science problems which could enhance both your skills on Python and data science.

  • Andrew Ng Machine Learning Basics - Personally, I think Andrew Ng's machine learning course on Coursera.org is best introductory course if you want to know some machine learning as a beginner.

  • Math Basics Youtube Videos - The Youtube videos from 3Blue1Brown explained lots of linear algebra, calculus and other math concept in a fun and easy to understand way. I really enjoyed watching their videos.

My friend, Adam Morris, who is also a volunteer in the Yellowbrick Research Labs (Spring 2018), made videos related with this tutorial. You can check these videos on his Youtube Channel.


District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!


 

SUBSCRIBE TO THE DDL BLOG

Did you enjoy this post? Don't miss the next one!