Word2Vec: NLP with Contextual Understanding
With the ambition to make machines understand human language, traditional methods such as one-hot encoding have proven inadequate. These methods, encoding words as sparse vectors with a single ‘1’ and numerous ‘0s,’ fail to capture the inherent complexities and relationships within language. Words lose their contextual essence, and the sheer volume of data becomes overwhelming for models to process efficiently. The need for Word2Vec models and word embeddings arises from this limitation. In NLP, words are not isolated entities; they derive meaning from their context and relationships with other words. Traditional methods struggle to preserve these nuances, limiting the capability of models to understand human language.
To grasp the significance of embeddings, let’s consider two examples:
- He offered his friend a cup of coffee.
- He offered his friend a cup of tea.
These sentences are quite similar since “coffee” and “tea” are both beverages. However, if we represent them as independent entities using a one-hot vector, the similarity between coffee and tea would be zero. This is the challenge that word embeddings aim to address accurately capturing the meanings of words and their relationships.
Covering in this article:
- What are Word Embeddings?
- How to create Word Embeddings?
- Word2Vec Models
- Skip-Gram and Negative sampling method
- Glove Model
What is Word Embeddings
Word Embedding represents a paradigm shift in how machines encode and understand words. Unlike traditional methods that treat words as isolated units, Word Embeddings capture the essence of words by assigning them dense vectors in a continuous space. The key idea is to embed words in a multi-dimensional space where their proximity reflects semantic similarity.
Word Embeddings capture not only the syntactic relationships but also the semantic relationships between words. For instance, in a well-trained Word2Vec model, the vectors representing ‘king’ and ‘queen’ would be closer to each other than those representing ‘king’ and ‘apple,’ reflecting the inherent hierarchical relationship between royalty terms.
Advantages of Word Embeddings
The superiority of Word Embeddings is their ability to encapsulate rich semantic information and contextual nuances.
1. Semantic Similarity: Word Embeddings excel at capturing semantic relationships between words. Similar words are embedded closer together in the vector space, allowing models to understand and leverage semantic connections more effectively.
2. Contextual Information: Unlike one-hot encoding, which treats each word in isolation, Word Embeddings consider the context in which words appear. They capture the meaning of a word based on its surroundings, allowing models to grasp the contextual significance of words.
3. Generalization: Word Embeddings generalize well to unseen words or contexts. The model learns to infer similarities and relationships from the context it appears in, enabling it to make educated guesses about words it has not encountered during training.
4. Dimensionality Reduction: Traditional methods often result in high-dimensional, sparse vectors. Word Embeddings, on the other hand, represent words in a lower-dimensional, dense space, reducing the computational complexity while retaining meaningful information.
Example of Word Embedding
Imagine we have a Word Embedding model that represents words in a three-dimensional space. In practice, the dimensions of word vectors are often much higher.
Suppose we have word vectors for these words in a three-dimensional space:
car = [0.8, 0.8, 0.7]
bus = [0.75, 0.7, 0.8]
tree = [025, 0.01, 0.4]
Now, let’s interpret what the dimensions of these vectors could mean:
- Dimension 1: Represents size or scale.
- Dimension 2: Represents movement speed.
- Dimension 3: Represents color.
It’s important to note that the specific meanings of these dimensions are learned during the training of the word embeddings model and may not have explicit human-interpretable labels.
Now, let’s calculate the cosine similarity (dot product) between pairs of these words:
- cosine_similarity(car, bus) ≈ 0.97
- cosine_similarity(car, tree) ≈ 0.29
- cosine_similarity(bus, tree) ≈ 0.34
The relatively high cosine similarity (approximately 0.97) between “car” and “bus” indicates that these words are quite similar in meaning, sharing characteristics associated with modes of transportation. The lower cosine similarities (approximately 0.29 and 0.34) between “car” and “tree,” and “bus” and “tree” suggest that “tree” is less similar to both “car” and “bus,” aligning with their disparate meanings.
In the traditional one-hot vector methods, the similarity score between “car” and “bus” would be zero, as such methods represent each word as an independent entity without capturing semantic relationships.
Word2Vec Models
Word2Vec is a popular word embedding technique that transforms words into dense vectors, capturing semantic relationships between them. The two main architectures used by Word2Vec are Continuous Bag of Words (CBOW) and Skip-Gram.
Continuous Bag of Words (CBOW):
In the CBOW architecture, the model predicts the target word (central word) based on the context words (surrounding words). The input to the model is a context window of words, and the output is the target word. The objective is to maximize the probability of predicting the target word given its context.
Skip-Gram:
In the Skip-Gram architecture, the model predicts the context words given a target word. Unlike CBOW, the input is a target word, and the output is a set of context words. The objective is to maximize the probability of predicting context words given the target word.
How Do Word2Vec models learn word embedding?
Word2Vec models employ a clever strategy known as the “fake task trick” to learn distributed representations of words in a continuous vector space. In simpler terms, this technique involves training a neural network for a specific task, but the primary objective is not to perform that task; rather, the focus is on capturing semantic relationships between words through the learned weights in the hidden layer.
Word2Vec Skip-Gram Workflow
1. Context Extraction:
A fixed-size context window is moved through the training data, extracting context words around each target word.
- Context: Context refers to the words surrounding a target word. The context provides the model with information about the word’s meaning based on its usage in different contexts.
- Window Size Parameter: The window size determines the number of words considered as context on each side of the target word. A larger window captures a broader context, while a smaller window focuses on more immediate surroundings. Adjusting the window size impacts the level of granularity in semantic relationships captured by the embeddings.
2. Training Objective:
- The model is trained to predict the context words based on the target word.
- The objective is to maximize the probability of the true context words given the target word.
3. Output Layer:
- Represents the predicted probabilities of each word in the vocabulary being a context word.
- Utilizes a softmax activation function to convert raw scores into probabilities.
4. Word Embeddings:
- The weights in the hidden layer, which act as word embeddings, are extracted for each word in the vocabulary.
- The resulting word embeddings of the hidden layer capture semantic relationships, allowing words with similar meanings to have similar vector representations.
Word2Vec SkipGram Model Architecture:
The architecture comprises a shallow neural net with 3 layers:
Layer | Description |
---|---|
Input Layer | Represents the target word one-hot vector. |
Hidden Layer | Acts as an embedding lookup layer, transforming the input target word into a continuous vector representation. The weights learned in this layer capture semantic relationships between words. |
Output Layer | Represents the predicted probabilities of each word in the vocabulary being a context word. |
Learning Word Embeddings with SkipGram
Word2Vec is an iterative model. Word2Vec refines its understanding of word relationships and meanings through a series of iterative updates, progressively enhancing the quality of word embeddings over multiple training iterations.
Training Steps:
1- The first step, Tokenization, and Building Vocabulary. You can read more about tokenization techniques illustrated in Tokenization article
2- Then, generation our one-hot word vector for the input center word.
3- Next, we get our embedded word vectors for the center word v
.
- Multiplying one hot vector by a weight matrix, essentially acts as a lookup, retrieving the row in the matrix that corresponds to the position of ‘1’ in the vector.
4- After getting our embedding Generate a score vector z = Uv
(U
is 2nd weight matrix containing embedding for each word in our vocab)
5- Then, turn the scores into probabilities using Softmax. yˆ
= softmax(z) ∈ R|V|
6- Finally, define the objective function and minimize the loss, we aim to make the resulting probability for context words to be the highest given the center word.
Objective function
(1)
P(ct,j∣wt)
:- This represents the conditional probability of observing a context word
c
t,j
given the target wordwt
. - In the context of Skip-gram, it measures the likelihood of encountering the context word
c
t,j
given the target wordwt
.
- This represents the conditional probability of observing a context word
exp(vct,j ⋅ vwt)
:- This part computes the exponential of the dot product between the vector representation vct,j of the context word
c
t,j
and the vector representationvwt
of the target wordw
t
. - It represents the similarity or compatibility between the target word and the context word.
- This part computes the exponential of the dot product between the vector representation vct,j of the context word
exp(vi⋅vwt)
:- This denominator term calculates the sum of the exponentials of the dot products between the vector representations
vi
of all words in the vocabulary and the vector representationv
wt
of the target wordwt
. - It serves as a normalization factor, ensuring that the probabilities sum up to 1 over all possible context words.
- This denominator term calculates the sum of the exponentials of the dot products between the vector representations
This iterative refinement is repeated for a specified number of iterations, each iteration contributing to the progressive enhancement of the Word2Vec model’s comprehension of word relationships and semantic nuances.
Computational Challenges:
The Skip-Gram model involves a large vocabulary, making the computation of the softmax function infeasible due to its computational complexity. Specifically, the softmax function requires evaluating an exponential for each word in the vocabulary, leading to high time and memory requirements. This challenge becomes more pronounced as the vocabulary size increases.
Let’s consider an example:
Consider a scenario where the vocabulary size V
is 10,000, and the embedding dimension E
is 300. At the output layer which represents the dot product between the target word’s vector and all other word vectors in the vocabulary resulting in similarity scores.
A softmax function is applied during each training iteration to convert the output similarity scores into a probability distribution, this process involves V
multiplications and V
−1 addition for each training instance. Additionally, there is a necessity to update 3,000,000 values in the embedding matrix during backpropagation for each iteration.
This computational demand escalates rapidly with larger vocabulary sizes, higher embedding dimensions, and an increased number of iterations, emphasizing the scaling challenges associated with these factors.
To address the computational challenges, negative sampling was introduced as an alternative to the traditional softmax function. The idea is to simplify the objective function by transforming it into a binary classification problem.
Negative Sampling
Negative sampling transforms the objective function into a binary classification problem. Instead of predicting the correct context word from the entire vocabulary, the model learns to distinguish between true context words and randomly sampled negative words.
For each positive training instance, randomly sample K
words from the vocabulary to act as negative examples. These negative examples are words not present in the context of the positive example.
Now instead of updating the whole vocabulary each iteration only the positive samples and K
negative samples will be updated.
(2)
- :
- computes the log-sigmoid of the dot product between the vector representation of the negative sampled context word
wO
and the vector representationv
wI
of the input wordwI
. - This term measures how well the model predicts the presence of the context word given the input word.
- computes the log-sigmoid of the dot product between the vector representation of the negative sampled context word
-
- Is a sum over k negative samples. For each negative sample wi drawn from the noise distribution
Pn(w)
, it computes the expected value of the log-sigmoid of the negative dot productv
wi
′⋅vwI
. - This term measures how well the model predicts the absence of negative samples given the input word.
- Is a sum over k negative samples. For each negative sample wi drawn from the noise distribution
Pn(w)
represents the noise distribution used to sample negative words during training.
(3)
U(w)
represents the unigram count of the wordw
, which is the number of timesw
appears in the corpus.- The exponent
is used to smooth out the distribution.
Z
is the normalization constant.
Benefits of Negative Sampling:
- Computational Efficiency: Negative sampling reduces the computational cost by sampling a small number of negative examples rather than considering the entire vocabulary.
- Training Speed: The model converges faster because it focuses on a binary classification task, making each training iteration computationally lighter.
- Memory Efficiency: Negative sampling reduces memory requirements, allowing the model to scale well with larger vocabularies.
Transitioning from Skip-gram to GloVe marks a shift in the methodology of word embedding generation within the realm of Natural Language Processing (NLP). While Skip-gram, a popular neural network-based approach, focuses on predicting context words given a target word, GloVe (Global Vectors for Word Representation) takes a different path.
GloVe leverages co-occurrence statistics in a corpus to generate word embeddings. Unlike Skip-gram, which relies on training neural networks to predict word-context pairs, GloVe aims to directly capture the statistical relationships between words through the construction of a co-occurrence matrix. This matrix reflects the frequency of word co-occurrences across the corpus, offering a comprehensive snapshot of word interactions.
Glove
GloVe, or Global Vectors for Word Representation, is a word embedding technique designed to capture the global statistical information of word co-occurrences within a corpus. Unlike some other methods that focus on local contexts, such as Skip-Gram or Continuous Bag of Words (CBOW), GloVe emphasizes the overall distributional patterns of words in a corpus.
It operates on the principle that words with similar meanings tend to co-occur frequently in various contexts. By encoding these global statistical relationships, GloVe aims to produce word vectors that reflect semantic similarities more accurately.
The key idea behind GloVe is to construct a co-occurrence matrix based on word pairs and their respective frequencies. The model is then trained to learn word vectors by minimizing the difference between the dot product of these vectors and the logarithm of the observed co-occurrence probabilities. This objective effectively captures the underlying semantic structure of words in the corpus.
Training of Glove
1- Initialization:
- Randomly initialize word vectors vi for each word in the vocabulary.
- The embedding dimension in GloVe can be varied based on the specific requirements and characteristics of the dataset.
2- Co-occurrence Matrix Construction:
- Based on the chosen window size, construct the co-occurrence matrix (
X
). - The co-occurrence matrix represents the observed frequency of word co-occurrences within a given context window.
- Each entry
X
ij
in the matrix indicates how often the wordi
and wordj
appear together within the chosen context window.
3- Objective Function:
The objective function aims to minimize the difference between the model’s predictions and the logarithm of the observed co-occurrence probabilities in the corpus.
(4)
V
is the vocabulary size.Xij
is the weighted and normalized co-occurrence count for wordsi
andj
.f(Xij)
is a weighting function applied to the co-occurrence count.vi
andu
j
are the word vectors for wordsi
andj
, respectively.b
i
andbj
are bias terms for wordsi
andj
.
The purpose of the weighting function is to assign less importance to very frequent co-occurrences.
4- Training Iterations:
- Iterate over the co-occurrence matrix.
- For each pair of words
i
andj
:- Calculate the components of the objective function
J
ij
. - Compute the gradients of the objective function with respect to word vectors.
- Update the word vectors using the gradients and a learning rate.
- Calculate the components of the objective function
5- Convergence Check
- Convergence ensures that the model has learned meaningful representations of word co-occurrences in the corpus.
6- Final Word Vectors:
- The resulting word vectors vi represent the learned embeddings that capture semantic relationships between words based on their co-occurrence patterns.
- These word vectors can be used for downstream natural language processing tasks such as sentiment analysis, machine translation, and text generation.
Applications of Word Embedding
Word embeddings, such as those generated by Word2Vec, GloVe, and similar techniques, have found wide-ranging applications across various domains in natural language processing (NLP) and machine learning.
1- Text Classification and Sentiment Analysis:
Word embeddings provide rich semantic representations of words, which can enhance the performance of text classification and sentiment analysis tasks. By converting text inputs into dense vectors, classifiers and sentiment analysis models can better capture the underlying semantics and context of the text, leading to improved accuracy and generalization.
2- Named Entity Recognition (NER):
NER systems aim to identify and classify entities such as person names, locations, organizations, and dates in text. Word embeddings help NER models by capturing contextual information about words, allowing them to recognize entities even when they appear in different contexts or have variations in spelling.
3- Machine Translation:
Word embeddings play a crucial role in machine translation systems, where they facilitate the mapping of words and phrases from one language to another. By learning embeddings in both the source and target languages, translation models can better capture semantic similarities and improve translation quality.
4- Information Retrieval and Document Similarity:
In tasks such as document retrieval and similarity measurement, word embeddings enable quantifying the semantic similarity between documents or passages. By representing documents as vectors of word embeddings, similarity measures like cosine similarity or Euclidean distance can be used to rank documents and retrieve relevant information.
5- Recommendation Systems:
Word embeddings are used in recommendation systems to model user preferences and item characteristics. By embedding user profiles and item descriptions into a common vector space, recommendation algorithms can identify relevant items based on their semantic similarity to the user’s preferences.