Data Exploration with Python, Part 3
Embarking on an Insight-Finding Mission
by Tony Ojeda
Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.
We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:
- The higher-level categories we created via category aggregations.
- The quintiles we created by binning our continuous variables.
- The clusters we generated via k-means clustering based on numeric variables.
Now, without further ado, let's embark on our insight-finding mission!
Making Our Data Smaller: Filter + Aggregate
One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.
To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a
group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.
def agg_count(df, group_field): grouped = df.groupby(group_field, as_index=False).size() grouped.sort(ascending = False) grouped = pd.DataFrame(grouped).reset_index() grouped.columns = [group_field, 'Count'] return grouped
Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the
agg_count function to transform it to show what we wanted to know.
vehicles_2016 = vehicles[vehicles['Year']==2016] category_counts = agg_count(vehicles_2016, 'Vehicle Category')
This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.
ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Category (2016) \n')
Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).
vehicles_1985 = vehicles[vehicles['Year']==1985] category_counts = agg_count(vehicles, 'Vehicle Category') ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Category (1985) \n')
If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.
class_counts = agg_count(vehicles_2016, 'Vehicle Class') ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Class (2016) \n')
We could also look at vehicle counts by manufacturer.
make_counts = agg_count(vehicles_2016, 'Make') ax = sns.barplot(data=make_counts, x='Count', y='Make') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Make (2016) \n')
What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our
agg_countfunction. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.
very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency'] make_counts = agg_count(very_efficient, 'Make') ax = sns.barplot(data=make_counts, x='Count', y='Make') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Very Fuel Efficient Vehicles by Make \n')
What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called
agg_avg that calculates the mean of a designated numerical field.
def agg_avg(df, group_field, calc_field): grouped = df.groupby(group_field, as_index=False)[calc_field].mean() grouped = grouped.sort(calc_field, ascending = False) grouped.columns = [group_field, 'Avg ' + str(calc_field)] return grouped
We can then simply swap out the
agg_count function with our new
agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.
category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG') ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category') ax.set(xlabel='\n Average Combined MPG') sns.plt.title('Average Combined MPG by Category (2016) \n')
Pivoting the Data for More Detail
Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.
First, we will create a
pivot_count function, similar to the
agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.
def pivot_count(df, rows, columns, calc_field): df_pivot = df.pivot_table(values=calc_field, index=rows, columns=columns, aggfunc=np.size ).dropna(axis=0, how='all') return df_pivot
We will then use this function on our
vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.
effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency', 'Engine Size','Combined MPG')
This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.
sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Engine Size') sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')
Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.
effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency', 'Engine Size','Combined MPG') fig, ax = plt.subplots(figsize=(15,8)) sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Engine Size') sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')
With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.
effic_size_category = pivot_count(vehicles_2016, ['Engine Size','Fuel Efficiency'], 'Vehicle Category','Combined MPG') fig, ax = plt.subplots(figsize=(20,10)) sns.heatmap(effic_size_category, annot=True, fmt='g') ax.set(xlabel='\n Vehicle Category') sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')
In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.
As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.
effic_size_pivot = pivot_count(vehicles_2016, 'Make', 'Vehicle Category','Combined MPG') fig, ax = plt.subplots(figsize=(20,20)) sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Vehicle Category') sns.plt.title('Make vs. Vehicle Category (2016) \n')
Visualizing Changes Over Time
So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a
multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.
def multi_line(df, x, y): ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")
Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.
multi_line(vehicles, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('Vehicle Categories Over Time \n')
We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.
If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.
bmw = vehicles[vehicles['Make'] == 'BMW'] multi_line(bmw, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('BMW Vehicle Categories Over Time \n')
We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.
toyota = vehicles[vehicles['Make'] == 'Toyota'] multi_line(toyota, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('Toyota Vehicle Categories Over Time \n')
Examining Relationships Between Variables
The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.
For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.
select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year', 'City MPG','Highway MPG','Combined MPG', 'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name'] sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)
From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.
Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.
sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles, hue='Cluster Name', size=8, fit_reg=False)
In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.
We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!
We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!
Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!
Subscribe to the DDL Blog
Did you enjoy this post? Don't miss the next one!