Rapid Text Visualization with Yellowbrick
Manipulate and Plot Data with a single package
Yellowbrick combines the power of scikit-learn with the beauty of matplotlib, providing us with an easier way to quickly visualize and understand our data.
Yellowbrick is a suite of visual diagnostic tools called Visualizers that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines scikit-learn with matplotlib in the best tradition of the scikit-learn documentation, but to produce visualizations for your model.
Imagine you are performing a supervised classification of text data. Let us consider the BBC article data which is classified into sports, tech, business, politics, and entertainment categories. Before we go ahead and build a model to classify the text, we would want to visualize this text data and see if we could observe patterns and maybe even decide on what model to use.
But this process can be difficult as we should convert the text data into a bunch of x and y coordinates. Yellowbrick can do this with a single function.
The first step in understanding text data is to convert it into a format consumable by algorithms (numbers). There are several ways to extract features from text and model it for algorithms to consume. Some of the popular ones include bag of words and TF-IDF.
The Bag of Words (BoW) model just counts the number of times a word appears in a document, and these counts allow us to compare documents and gauge their similarities. However, sometimes the most frequently occurring words in a document might be irrelevant (e.g. stopwords like the, is, etc.).
TF-IDF model solves this problem by measuring the relevance of words and not their frequency. TF-IDF is composed of TF (count of occurrence) and IDF (1/total word-count) This factor diminishes the weight on very frequently occurring words and increases weight on rare words. TF-IDF of a word is the product of it’s TF and IDF.
For a text t in a document…
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Now we have converted our text data into vectors, but to plot these documents in a graph we need to convert the n dimensional vector into a two-dimensional vector using a dimensionality reduction technique.
Dimensionality reduction can be achieved in the following ways:
Feature Elimination: Remove unimportant features.
Feature Selection: Select only important features.
Feature Extraction: Combine existing feature to create new ones.
We use several techniques to determine feature importance so that we could either eliminate, select, or extract. There are two popular dimensionality reduction techniques: t-SNE and UMAP.
t-SNE minimizes the divergence between two distributions:
A distribution that measures pairwise similarities of the input objects.
A distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding.
Using this approach, it is able to find patterns in the data by identifying observed clusters based on similarity of data points with multiple features.
UMAP uses local manifold approximations and patches together their local
fuzzy set representations, which would construct a topological representation of the given higher dimensional data. It is superior to t-SNE in terms of visualization quality and also preserves more of the global structure with superior runtime. UMAP has a topological foundation that makes it feasible for larger data sets, and when we are handling scripts from several languages, we need significantly higher visualization quality that can be provided by UMAP.
Dimensionality reduction can lead to loss of information that can be represented better in higher dimension, but with higher dimension, it is very difficult to make sense of the data. Apart from UMAP and t-SNE, we could also use techniques like Principal Component Analysis, Linear Discriminant Analysis, and several other methods that use feature extraction to transform higher dimensional data into lower dimension.
But doing all this just to make sense of the data can get tedious, so we turn to Yellowbrick and write a simple function that could do this and help us rapidly compare different encoding and dimensionality reduction techniques.
This simple function that’s gonna help us do that.
from yellowbrick.text import TSNEVisualizer,UMAPVisualizer from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer def visualize(dim_reduction,encoding,corpus_data,corpus_target,labels = True,alpha=0.7,metric=None): if 'tfidf' in encoding.lower(): encode = TfidfVectorizer() if 'count' in encoding.lower(): encode = CountVectorizer() docs = encode.fit_transform(corpus_data) if labels is True: labels = corpus_target else: labels = None if 'umap' in dim_reduction.lower(): if metric is None: viz = UMAPVisualizer() else: viz = UMAPVisualizer(metric=metric) if 't-sne' in dim_reduction.lower(): viz = TSNEVisualizer(alpha = alpha) viz.fit(docs,labels) return viz.poof()
I will elaborate on the parameters defined in this function
dim_reduction - Dimensionality reduction technique used (UMAP, t-SNE).
encoding - The text vectorization technique used (TFIDF, Bag of words).
corpus_data - A list of text data (A list of BBC articles).
corpus_target - The text label ( The type of the BBC article, such as sports).
labels - True would differentiate differentiate the labels with colors.
alpha - Perplexity, which is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity.
metric - This controls how distance is computed in the ambient space of the input data.
Now lets visualize the BBC corpus and see what it tells us.
As we have used the bag of words vectorizer, we don’t see much of a differentiation between the clusters due to the stop words that occur commonly in all kinds of articles.
Now let’s remove that stopword bias by using a TFIDF vectorizer.
Since TF-IDF gives the important keywords more weight, we can see the clusters slowly starting to make sense. There are also lots of intersections between all these clusters. The business and politics topics are clustered close together on one side, tech and entertainment are clustered close as well, while sports is on the other end.
Now let’s try this with UMAP and see if it clusters them differently.
Wth a UMAP bag of words model, we get a similar result as t-SNE but here the clusters are closer and it also runs faster.
When using the TF-IDF vectorization, UMAP shows a much richer result when compared to t-SNE as it has fewer intersecting articles.
We could see the sports topic now clustered further away from the rest of articles. This may be because of the presence sports lingo that is used only in sports articles and not in any other kind of articles.
We could also view these plot without the color differentiation.
Now we can fine tune the perplexity to arrive at a very reasonable visualization and then go about clustering these articles and define a model to predict the article type based on the distance between that unlabelled article and the existing clusters.
Check out Yellowbrick and try out some cool stuff yourself! :)
Need help with Machine Learning?
Schedule a free consultation to find out how we can help!
SUBSCRIBE TO THE DDL BLOG
Learn data science at work!
On-site training for you and your co-workers on the latest data science, analytics, and machine learning methods and tools.
Need help with Data Visualization?
Data Visualization can help uncover hidden insights in your organization’s data. Schedule a free consultation to find out how we can help!