NLP Research Lab Part 2: Skip-Gram Architecture Overview

by Don Vetal

Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab. Make sure to check out NLP Research Lab Part 1: Distributed Representations.

Chances are, if you've been working in Natural Language Processing (NLP) or machine learning, you've heard of the class of approaches called Word2Vec. Word2Vec is an implementation of the Skip-Gram and Continuous Bag of Words (CBOW) neural network architectures. At its core, the skip-gram approach is an attempt to characterize a word, phrase, or sentence based on what other words, phrases, or sentences appear around it. In this post, I will provide a conceptual understanding of the inputs and outputs of the skip-gram architecture.

Skip-Gram's Purpose

The purpose of the Skip-Gram Architecture is to train a system to represent all the words in a corpus as vectors. Given a word, it aims to find the probability that the word will show up near another word. From this kind of representation, we can calculate similarities between words or even the correct response to an analogy test.

For example, a typical analogy test might consist of the following:

Tape : Sticky :: Oil : X

In this case, an appropriate response for the value of X might be "Slippery." The output for this model, which is described in detail below, results in a vector of the length of the vocabulary for each word. A practitioner should be able to calculate the cosine distance between two word vector representations to determine similarity.

Here is a simple Python example, where we assume the vocabulary size is 6 and are trying to compare the similarity between two words:

from scipy.spatial.distance import cosine

tape = [0.1, 0.2, 0.1, 0.2, 0.2, 0]
oil = [0.2, 0.1, 0, 0.2, 0.2, 0]

cosine_distance = cosine(tape,oil)

In this case, the cosine distance ends up being 0.1105008200066786.

Guessing the result of an analogy simply uses vector addition and subtraction and then determines the closest word to the resulting vector. For example, to calculate the vector in order to guess the result of an analogy, we might do the following:

import numpy as np

tape = np.array([0.1, 0.2, 0.1, 0.2, 0.2, 0.1])
oil = np.array([0.2, 0.1, 0, 0.2, 0.2, 0.1])
sticky = np.array([0.2, 0.0, 0.05, 0.02, 0.0, 0.0])

result_array = tape + oil - sticky

Then you can simply find the closest word (via cosine distance or other) to the result_array, and that would be your prediction. Both of these examples should give you a good intuition for why skip-gram is incredibly useful. So let's dig into some of the details.

Defining the Skip-Gram Structure

To make things easy to understand, we are going to take a look at another example.

Duct tape works anywhere. Duct tape is magic and should be worshiped.
— Andy Weir, The Martian

In the real world, a corpus that you want to train will be large; at least tens of thousands of words if not larger. If we trained the example above in the real world, it wouldn’t work because it isn’t large enough, but for the purposes of this post, it will do.

Preparing the Corpus for Training

Before you get to the meat of the algorithm, you should be doing some preparatory work with the content, just as you would do for most other NLP-oriented tasks. One might think immediately that they should remove stopwords, or words that are common and have little subject oriented meaning (ex. the, in, I). This is not the case in skip-gram, as the algorithm relies on understanding word distance in a paragraph to generate the right vectors. Imagine if we removed stop words from the sentence "I am the king of the world." The original distance between king and world is 3, but by removing stopwords, the distance between those two words changes to 1. We've fundamentally changed the shape of the sentence.

However, we do probably want to conduct stemming in order to get words down to their core root (stem). This is very helpful in ensuring that two words that have the same stem (ex. ‘run’ and ‘running’) end up being seen as the same word ('run') by the computer.

A simple example using NLTK in Python is provided below.

from nltk import stem

#Initialize an empty list
my_stemmed_words_list = []

#Start with a list of words.
word_list = [duct, tape, works, anywhere, magic, worshiped]

#Instantiate a stemmer object.

#Loop through the words in word_list
for word in word_list:

The result is as follows:

Unstemmed Stemmed (result)
duct duct
tape tape
works work
anywhere anywher
magic magic
worshiped worship

Building the Vocabulary

To build the final vocabulary that will be used for training, we generate a list of all the distinct words in the text after we have stemmed appropriately. To make this example easier to follow, we will sort our vocabulary alphabetically. Sorting the vocabulary in real life provides no benefit and, in fact, can just be a waste of time. In a scenario where we have a 1 billion word vocabulary, we can imagine the sorting taking a long time.

So without any further delay, our vocabulary ends up becoming the following:

[anywher, duct, magic, tape, work, worship]

The following stopwords would also be included: [is, and, should, be]. I'm leaving these out to keep this example simple and small, but in reality, those would be in there as well.

The Training Set

Just like with other statistical learning approaches, you’ll need to develop some methodology for splitting your data into training, validation, and testing sets. In our specific example, we’ll make 2/3 of the total vocabulary our training set through a random selection. So lets suppose our training set ends up being the following after a random selection:

T = [anywher, duct, magic, work]

Which means we have 4 training samples t1 through t4 (T={t1,t2,t3,t4}). The vectors used to feed the input layer of the network are as follows:


The Input

Suppose your only goal was to find the probability that "work" shows up near "tape." You can’t just throw one example at the Neural Network (NN) and expect to get a result that is meaningful. When these systems (NN) are trained, you will eventually be pushing the bulk of the vocabulary (your training set) into the input layer and training the system regardless of the specific question you may be asking.

Our input layer is a vector that is the length of the vocabulary (V) and we have four training samples, one for each word. So the total set of data pushed through the NN during training-time is of size VxT (6x4). During training time, one of the samples in T is input into the system at a time. It is then up to the practitioner to decide if they want to use online training or batch inputs before back-propagating. Back-propagation is discussed in our back-propagation blog post, which will be published soon. For now, don’t worry about those details. The point here is to conceptually grasp the approach.

The insertion into the input layer looks something like the following diagram:


Each sample of array length V (6) represents a single word in the vocabulary and its index location in the unique word vocabulary.

So let's review our objective here. The objective of the Skip-gram model, in the aggregate, is to develop an output array that describes the probability a word in the vocabulary will end up “near” the target word. “Near” defined by many practitioners in this approach is a zone of c words before and after the target word. This is referred to as the context area or context zone. So in the example below, if the context size is 2 (c=2) and our target word is "magic," the words in the context area are C shown below.

C(‘magic’ | c = 2) = [Duct, tape, worship]

The Output

To get to the point where we get a single array that represents the probability of finding any other word in the vocabulary in the context area of the target word, we need to understand exactly what the NN output looks like. To illustrate, I’ve provided the diagram below.


In this diagram, if we choose a context size of one, it means we care about words that appear only directly before and directly after the target word.The output layer includes two distinct sets of output values. If our context size was 5, we’d end up with 60 output values.

Each word in the vocabulary receives two values for our context size of one. To get the score for an individual word in the context area, we can simply sum up the values. So for example, the score for "duct" (v2) showing up within the context area is 0.22 (0.2 + 0.02).

You may have noticed we are calling the results scores instead of probabilities. That is because the raw output of the skip-gram architecture does not produce probabilities that add up to 1. To convert the scores to probabilities, you must conduct a softmax calculation to scale them. The purpose of this post isn’t to describe softmax, so we are just going to pretend the values in the diagram are probabilities.

The Error Calculation

At the end of forward propagation (stay tuned for the forward-propagation blog we have coming up next), you need to calculate an error in order to backpropagate. So how do you do that? It’s actually pretty easy. We already know what the actual probabilities are of finding a word in the context area of a target word based on our corpus. So for example, if we wanted to know the error in the probability of finding "duct" given "magic" as the target word, we would do the following.

SUM of v2 (duct index) = 0.2 + 0.02 => 0.22

In our corpus, the actual probability of finding "duct" in the context area around "magic" is 100% because "magic" is only used once and "duct" is within the context zone. So the absolute error in probability is 1 - 0.22 = 0.78, and the mean squared error (MSE) is 0.61. This error is used in backpropagation which re-calculates the input and output weight matrices.


What I have given you is a conceptual understanding of what the input vectors look like and what the output vectors look like, but there are other components of the algorithm that will be explained in upcoming blog posts.

  1. There is a weight matrix between inputs and the hidden layer.
  2. There is a weight matrix between the hidden layer to the outputs.

The input weight matrix (1) is the matrix that becomes the vectors for each word where each row is a word and the vector is of length H (H is the number of nodes in the hidden layer). The output vectors simply give the score for a word being in the context zone and are really not used for anything other than training and error calculation.

It is important to understand that a practitioner can choose any number of H nodes for the hidden layer; it is a hyperparameter in training. Generally, the more hidden layer nodes you have, the more expressive (but also the more computationally expensive) a vector is. The output weight matrices are not used outside of the context of training. To learn more about these details and what the process of forward propagation is, please check out our forward propagation blog post, which is coming up next.

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!


Subscribe to the DDL Blog

Did you enjoy this post? Don't miss the next one!


Learn data science at work!

On-site training for you and your co-workers on the latest data science, analytics, and machine learning methods and tools.

Need help with Natural Language Processing?

Businesses using skip-gram methods can greatly reduce the computational power required for their models. Click below to learn more about the services we offer and how we can help.

Our Books: