June 11 2024

Word2Vec: NLP with Contextual Understanding

Ahmed Zakaria Natural Language Processing Embeddings, Machine Learning, NLP, Word2Vec 0

With the ambition to make machines understand human language, traditional methods such as one-hot encoding have proven inadequate. These methods, encoding words as sparse vectors with a single ‘1’ and numerous ‘0s,’ fail to capture the inherent complexities and relationships within language. Words lose their contextual essence, and the sheer volume of data becomes overwhelming for models to process efficiently. The need for Word2Vec models and word embeddings arises from this limitation. In NLP, words are not isolated entities; they derive meaning from their context and relationships with other words. Traditional methods struggle to preserve these nuances, limiting the capability of models to understand human language.

To grasp the significance of embeddings, let’s consider two examples:

He offered his friend a cup of coffee.
He offered his friend a cup of tea.

These sentences are quite similar since “coffee” and “tea” are both beverages. However, if we represent them as independent entities using a one-hot vector, the similarity between coffee and tea would be zero. This is the challenge that word embeddings aim to address accurately capturing the meanings of words and their relationships.

Covering in this article:

What are Word Embeddings?
How to create Word Embeddings?
Word2Vec Models
Skip-Gram and Negative sampling method
Glove Model

What is Word Embeddings

Word Embedding represents a paradigm shift in how machines encode and understand words. Unlike traditional methods that treat words as isolated units, Word Embeddings capture the essence of words by assigning them dense vectors in a continuous space. The key idea is to embed words in a multi-dimensional space where their proximity reflects semantic similarity.

Word Embeddings capture not only the syntactic relationships but also the semantic relationships between words. For instance, in a well-trained Word2Vec model, the vectors representing ‘king’ and ‘queen’ would be closer to each other than those representing ‘king’ and ‘apple,’ reflecting the inherent hierarchical relationship between royalty terms.

Advantages of Word Embeddings

The superiority of Word Embeddings is their ability to encapsulate rich semantic information and contextual nuances.

1. Semantic Similarity: Word Embeddings excel at capturing semantic relationships between words. Similar words are embedded closer together in the vector space, allowing models to understand and leverage semantic connections more effectively.

2. Contextual Information: Unlike one-hot encoding, which treats each word in isolation, Word Embeddings consider the context in which words appear. They capture the meaning of a word based on its surroundings, allowing models to grasp the contextual significance of words.

3. Generalization: Word Embeddings generalize well to unseen words or contexts. The model learns to infer similarities and relationships from the context it appears in, enabling it to make educated guesses about words it has not encountered during training.

4. Dimensionality Reduction: Traditional methods often result in high-dimensional, sparse vectors. Word Embeddings, on the other hand, represent words in a lower-dimensional, dense space, reducing the computational complexity while retaining meaningful information.

Example of Word Embedding

Imagine we have a Word Embedding model that represents words in a three-dimensional space. In practice, the dimensions of word vectors are often much higher.

Suppose we have word vectors for these words in a three-dimensional space:

car = [0.8, 0.8, 0.7]
bus = [0.75, 0.7, 0.8]
tree = [025, 0.01, 0.4]

Now, let’s interpret what the dimensions of these vectors could mean:

Dimension 1: Represents size or scale.
Dimension 2: Represents movement speed.
Dimension 3: Represents color.

It’s important to note that the specific meanings of these dimensions are learned during the training of the word embeddings model and may not have explicit human-interpretable labels.

Now, let’s calculate the cosine similarity (dot product) between pairs of these words:

cosine_similarity(car, bus) ≈ 0.97
cosine_similarity(car, tree) ≈ 0.29
cosine_similarity(bus, tree) ≈ 0.34

The relatively high cosine similarity (approximately 0.97) between “car” and “bus” indicates that these words are quite similar in meaning, sharing characteristics associated with modes of transportation. The lower cosine similarities (approximately 0.29 and 0.34) between “car” and “tree,” and “bus” and “tree” suggest that “tree” is less similar to both “car” and “bus,” aligning with their disparate meanings.

In the traditional one-hot vector methods, the similarity score between “car” and “bus” would be zero, as such methods represent each word as an independent entity without capturing semantic relationships.

Word2Vec Models

Word2Vec is a popular word embedding technique that transforms words into dense vectors, capturing semantic relationships between them. The two main architectures used by Word2Vec are Continuous Bag of Words (CBOW) and Skip-Gram.

Word2Vec main architectures. — *Figure 1: CBOW vs SkipGram*

Continuous Bag of Words (CBOW):

In the CBOW architecture, the model predicts the target word (central word) based on the context words (surrounding words). The input to the model is a context window of words, and the output is the target word. The objective is to maximize the probability of predicting the target word given its context.

Skip-Gram:

In the Skip-Gram architecture, the model predicts the context words given a target word. Unlike CBOW, the input is a target word, and the output is a set of context words. The objective is to maximize the probability of predicting context words given the target word.

How Do Word2Vec models learn word embedding?

Word2Vec models employ a clever strategy known as the “fake task trick” to learn distributed representations of words in a continuous vector space. In simpler terms, this technique involves training a neural network for a specific task, but the primary objective is not to perform that task; rather, the focus is on capturing semantic relationships between words through the learned weights in the hidden layer.

Word2Vec Skip-Gram Workflow

1. Context Extraction:

A fixed-size context window is moved through the training data, extracting context words around each target word.

Context: Context refers to the words surrounding a target word. The context provides the model with information about the word’s meaning based on its usage in different contexts.
Window Size Parameter: The window size determines the number of words considered as context on each side of the target word. A larger window captures a broader context, while a smaller window focuses on more immediate surroundings. Adjusting the window size impacts the level of granularity in semantic relationships captured by the embeddings.

2. Training Objective:

The model is trained to predict the context words based on the target word.
The objective is to maximize the probability of the true context words given the target word.

3. Output Layer:

Represents the predicted probabilities of each word in the vocabulary being a context word.
Utilizes a softmax activation function to convert raw scores into probabilities.

4. Word Embeddings:

The weights in the hidden layer, which act as word embeddings, are extracted for each word in the vocabulary.
The resulting word embeddings of the hidden layer capture semantic relationships, allowing words with similar meanings to have similar vector representations.

Word2Vec SkipGram Model Architecture:

The architecture comprises a shallow neural net with 3 layers:

Layer	Description
Input Layer	Represents the target word one-hot vector.
Hidden Layer	Acts as an embedding lookup layer, transforming the input target word into a continuous vector representation. The weights learned in this layer capture semantic relationships between words.
Output Layer	Represents the predicted probabilities of each word in the vocabulary being a context word.

Learning Word Embeddings with SkipGram

Word2Vec is an iterative model. Word2Vec refines its understanding of word relationships and meanings through a series of iterative updates, progressively enhancing the quality of word embeddings over multiple training iterations.

Training Steps:

1- The first step, Tokenization, and Building Vocabulary. You can read more about tokenization techniques illustrated in Tokenization article
2- Then, generation our one-hot word vector for the input center word.
3- Next, we get our embedded word vectors for the center word v.

Multiplying one hot vector by a weight matrix, essentially acts as a lookup, retrieving the row in the matrix that corresponds to the position of ‘1’ in the vector.

Mapping input one-hot vector into corresponding dense vector with lookup mechanism. — *Figure 2: Matrix Lookup method using one-hot vector*

4- After getting our embedding Generate a score vector z = Uv (U is 2^nd weight matrix containing embedding for each word in our vocab)
5- Then, turn the scores into probabilities using Softmax. yˆ = softmax(z) ∈ R|V|
6- Finally, define the objective function and minimize the loss, we aim to make the resulting probability for context words to be the highest given the center word.

Objective function

(1) $\begin{equation*} P(c_{t,j} | w_t) = \frac{\exp(\mathbf{v}_{c_{t,j}} \cdot \mathbf{v}_{w_t})}{\sum_{i=1}^{V} \exp(\mathbf{v}_i \cdot \mathbf{v}_{w_t})} \LARGE \end{equation*}$

P(c_t,j∣w_t):
- This represents the conditional probability of observing a context word c_t,j given the target word w_t.
- In the context of Skip-gram, it measures the likelihood of encountering the context word c_t,j given the target word w_t.

exp(v_ct,j ⋅ v_wt):
- This part computes the exponential of the dot product between the vector representation v_ct,j of the context word c_t,j and the vector representation v_wt of the target word w_t.
- It represents the similarity or compatibility between the target word and the context word.
exp(vi⋅vwt):
- This denominator term calculates the sum of the exponentials of the dot products between the vector representations v_i of all words in the vocabulary and the vector representation v_wt of the target word w_t.
- It serves as a normalization factor, ensuring that the probabilities sum up to 1 over all possible context words.

This iterative refinement is repeated for a specified number of iterations, each iteration contributing to the progressive enhancement of the Word2Vec model’s comprehension of word relationships and semantic nuances.

Computational Challenges:

The Skip-Gram model involves a large vocabulary, making the computation of the softmax function infeasible due to its computational complexity. Specifically, the softmax function requires evaluating an exponential for each word in the vocabulary, leading to high time and memory requirements. This challenge becomes more pronounced as the vocabulary size increases.

Let’s consider an example:

Consider a scenario where the vocabulary size V is 10,000, and the embedding dimension E is 300. At the output layer which represents the dot product between the target word’s vector and all other word vectors in the vocabulary resulting in similarity scores.

A softmax function is applied during each training iteration to convert the output similarity scores into a probability distribution, this process involves V multiplications and V−1 addition for each training instance. Additionally, there is a necessity to update 3,000,000 values in the embedding matrix during backpropagation for each iteration.

This computational demand escalates rapidly with larger vocabulary sizes, higher embedding dimensions, and an increased number of iterations, emphasizing the scaling challenges associated with these factors.

To address the computational challenges, negative sampling was introduced as an alternative to the traditional softmax function. The idea is to simplify the objective function by transforming it into a binary classification problem.

Negative Sampling

Negative sampling transforms the objective function into a binary classification problem. Instead of predicting the correct context word from the entire vocabulary, the model learns to distinguish between true context words and randomly sampled negative words.

For each positive training instance, randomly sample K words from the vocabulary to act as negative examples. These negative examples are words not present in the context of the positive example.

Now instead of updating the whole vocabulary each iteration only the positive samples and K negative samples will be updated.

(2) $\begin{equation*} \log \sigma(\mathbf{v}_{w_O'}^\top \mathbf{v}_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \log \sigma(-\mathbf{v}_{w_i}^\top \mathbf{v}_{w_I}) \Large \end{equation*}$

:
- computes the log-sigmoid of the dot product between the vector representation ${v}_{w_O'}$ of the negative sampled context word w_O and the vector representation v_wI of the input word w_I.
- This term measures how well the model predicts the presence of the context word given the input word.

- Is a sum over k negative samples. For each negative sample w_i drawn from the noise distribution Pn(w), it computes the expected value of the log-sigmoid of the negative dot product v_wi′⋅v_wI.
- This term measures how well the model predicts the absence of negative samples given the input word.

Pn(w) represents the noise distribution used to sample negative words during training.

(3) $\begin{equation*} \Large P_n(w) = \frac{U(w)^{\frac{3}{4}}}{Z} \end{equation*}$

U(w) represents the unigram count of the word w, which is the number of times w appears in the corpus.
The exponent is used to smooth out the distribution.
Z is the normalization constant.

Benefits of Negative Sampling:

Computational Efficiency: Negative sampling reduces the computational cost by sampling a small number of negative examples rather than considering the entire vocabulary.
Training Speed: The model converges faster because it focuses on a binary classification task, making each training iteration computationally lighter.
Memory Efficiency: Negative sampling reduces memory requirements, allowing the model to scale well with larger vocabularies.

Transitioning from Skip-gram to GloVe marks a shift in the methodology of word embedding generation within the realm of Natural Language Processing (NLP). While Skip-gram, a popular neural network-based approach, focuses on predicting context words given a target word, GloVe (Global Vectors for Word Representation) takes a different path.

GloVe leverages co-occurrence statistics in a corpus to generate word embeddings. Unlike Skip-gram, which relies on training neural networks to predict word-context pairs, GloVe aims to directly capture the statistical relationships between words through the construction of a co-occurrence matrix. This matrix reflects the frequency of word co-occurrences across the corpus, offering a comprehensive snapshot of word interactions.

Glove

GloVe, or Global Vectors for Word Representation, is a word embedding technique designed to capture the global statistical information of word co-occurrences within a corpus. Unlike some other methods that focus on local contexts, such as Skip-Gram or Continuous Bag of Words (CBOW), GloVe emphasizes the overall distributional patterns of words in a corpus.

It operates on the principle that words with similar meanings tend to co-occur frequently in various contexts. By encoding these global statistical relationships, GloVe aims to produce word vectors that reflect semantic similarities more accurately.

The key idea behind GloVe is to construct a co-occurrence matrix based on word pairs and their respective frequencies. The model is then trained to learn word vectors by minimizing the difference between the dot product of these vectors and the logarithm of the observed co-occurrence probabilities. This objective effectively captures the underlying semantic structure of words in the corpus.

Training of Glove

1- Initialization:

Randomly initialize word vectors v_i for each word in the vocabulary.
The embedding dimension in GloVe can be varied based on the specific requirements and characteristics of the dataset.

2- Co-occurrence Matrix Construction:

Co-occurrence matrix of the sentence: Brown Fox Jumps Over Lazy Dog with window size = 1 — *Figure 5: Co-occurrence matrix of the sentence: Brown Fox Jumps Over Lazy Dog* with window size = 1

Based on the chosen window size, construct the co-occurrence matrix (X).
The co-occurrence matrix represents the observed frequency of word co-occurrences within a given context window.
Each entry X_ij in the matrix indicates how often the word i and word j appear together within the chosen context window.

3- Objective Function:

The objective function aims to minimize the difference between the model’s predictions and the logarithm of the observed co-occurrence probabilities in the corpus.

(4) $\begin{equation*} J = \sum_{i=1}^{V} \sum_{j=1}^{V} f(X_{ij}) (\mathbf{w}_i^T \mathbf{\tilde{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2 \end{equation*}$

V is the vocabulary size.
X_ij is the weighted and normalized co-occurrence count for words i and j.
f(X_ij) is a weighting function applied to the co-occurrence count.
v_i and u_j are the word vectors for words i and j, respectively.
b_i and b_j are bias terms for words i and j.

The purpose of the weighting function is to assign less importance to very frequent co-occurrences.

*Figure 8: Weighting function f with α = 3/4.*

4- Training Iterations:

Iterate over the co-occurrence matrix.
For each pair of words i and j:
- Calculate the components of the objective function J_ij.
- Compute the gradients of the objective function with respect to word vectors.
- Update the word vectors using the gradients and a learning rate.

5- Convergence Check

Convergence ensures that the model has learned meaningful representations of word co-occurrences in the corpus.

6- Final Word Vectors:

The resulting word vectors v_i represent the learned embeddings that capture semantic relationships between words based on their co-occurrence patterns.
These word vectors can be used for downstream natural language processing tasks such as sentiment analysis, machine translation, and text generation.

Applications of Word Embedding

Word embeddings, such as those generated by Word2Vec, GloVe, and similar techniques, have found wide-ranging applications across various domains in natural language processing (NLP) and machine learning.

1- Text Classification and Sentiment Analysis:

Word embeddings provide rich semantic representations of words, which can enhance the performance of text classification and sentiment analysis tasks. By converting text inputs into dense vectors, classifiers and sentiment analysis models can better capture the underlying semantics and context of the text, leading to improved accuracy and generalization.

2- Named Entity Recognition (NER):

NER systems aim to identify and classify entities such as person names, locations, organizations, and dates in text. Word embeddings help NER models by capturing contextual information about words, allowing them to recognize entities even when they appear in different contexts or have variations in spelling.

3- Machine Translation:

Word embeddings play a crucial role in machine translation systems, where they facilitate the mapping of words and phrases from one language to another. By learning embeddings in both the source and target languages, translation models can better capture semantic similarities and improve translation quality.

4- Information Retrieval and Document Similarity:

In tasks such as document retrieval and similarity measurement, word embeddings enable quantifying the semantic similarity between documents or passages. By representing documents as vectors of word embeddings, similarity measures like cosine similarity or Euclidean distance can be used to rank documents and retrieve relevant information.

5- Recommendation Systems:

Word embeddings are used in recommendation systems to model user preferences and item characteristics. By embedding user profiles and item descriptions into a common vector space, recommendation algorithms can identify relevant items based on their semantic similarity to the user’s preferences.

Reference

Author

Ahmed Zakaria

View all posts

Word2Vec: NLP with Contextual Understanding