Understanding Bag of Words Models
In the context of natural language processing (NLP) and neural networks, the initial challenge was to find reliable ways to represent words as input for these models. One of the earliest approaches was using one-hot vectors, where each word in the vocabulary is represented as a vector of zeros, except for the element corresponding to the word’s index, which is set to 1. Other approaches like Bag Of Words [BoW] models would treat each document in the corpus as a single sparse vector with a length equal to vocab length and its entries indicate the presence of a word in the document and its frequency.
However, these approaches had significant limitations as they treated each word as an independent entity, overlooking the rich semantic relationships between words.
In this article, we’ll go over:
- Understanding BoW models
- Workflow and the Core Mechanisms of BoW
- The importance of cleaning and preprocessing our data
- BoW model Vs CBoW
- Decoding the Essence of Word Representations
- Workflow and the Core Mechanisms of CBoW
- Application and Limitations of CBoW
Bag of Words (BoW)
The Bag of Words (BoW) model is a fundamental and simplistic representation technique in Natural Language Processing (NLP). The BoW model represents a document as an unordered set or “bag” of its words, disregarding grammar and word order but considering the frequency of each word.
The resulting representation is a vector where each element corresponds to a unique word in the vocabulary, and the value in each element reflects the frequency of that word in the document.
The basic idea behind Bag of Words is to simplify the complexity of language by converting a document into an unordered set of words, removing any information about the order in which the words appear and the grammatical structure of the sentences.
Convert text into a Bag of Words (BoW)
The Bag-of-Words (BoW) model transforms text into a numerical vector by representing each document or piece of text as a fixed-length vector that captures the frequency of words in a predefined vocabulary. You can read more about tokenization and vocabulary building in Tokenization article
Step 1: Tokenization
The first step is to break down the text into individual words or tokens.
Sentence: “The quick brown fox jumps over the lazy dog.”
Resulting Tokens: [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog.”]
Step 2: Vocabulary Construction
Build a vocabulary by collecting unique words from the entire corpus. Each unique word becomes a feature in the BoW vector and has its unique index.
Step 3: Vectorization
Create a numerical vector based on the vocabulary for each document (sentence). The vector elements represent the frequency of each word in the document.
Example:
Sentence: “The brown fox jumps over the lazy dog”
Vocabulary: {“the”: 0, “brown”: 1, “fox”: 2, “jumps”: 3, “over”: 4, “lazy”: 5, “dog”: 6, “oov”: 7}
Vector: [2, 1, 1, 1, 1, 1, 1, 0]
Explanation:
Word | Appearance count | Index |
---|---|---|
the | 2 | 0 |
brown | 1 | 1 |
fox | 1 | 2 |
jumps | 1 | 3 |
over | 1 | 4 |
lazy | 1 | 5 |
dog | 1 | 6 |
oov | 0 | 7 |
But why the token “The” with capital t and the token “the” with small t are mapped to the same token at vocab?
Text Preprocessing Importance
Text preprocessing is a critical step in the Bag-of-Words (BoW) model, contributing significantly to the quality of the representation and the overall performance of natural language processing (NLP) tasks. To understand the importance of text preprocessing in BoW models, let’s look at a few key aspects:
1. Noise Reduction:
Description: Text data often contains noise, such as irrelevant characters, symbols, and special characters, which can negatively impact the model’s accuracy.
Solution: Removing punctuation and other non-alphanumeric characters helps eliminate noise, ensuring that the model focuses on meaningful words.
2. Stop Word Removal:
Description: Stop words are common words (e.g., “the,” “is,” “and”) that add little semantic value. Including them in the BoW representation can lead to a more sparse and less meaningful vector.
Solution: Removing stop words reduces dimensionality, allowing the model to concentrate on terms that are more informative and improving the efficiency of the BoW representation.
3. Lowercasing:
Description: Text data may contain words in different cases, and treating them as distinct can lead to redundancy and a larger vocabulary size.
Solution: Converting all words to lowercase ensures consistency, preventing the model from treating “Word” and “word” as separate entities and improving the efficiency of the BoW model.
4. Stemming and Lemmatization:
Description: Words in different grammatical forms (e.g., “run,” “running,” “ran”) convey similar meanings. Treating them as separate words can result in a larger vocabulary and increased sparsity.
Stemming: Reducing words to their root or base form (e.g., “running” to “run”) helps consolidate related terms, reducing dimensionality. The process involves stripping words of their affixes, such as prefixes or suffixes, to create a common representation for variations of a word.
Lemmatization: Similar to stemming but involves mapping words to their dictionary form, considering grammatical differences (e.g., “ran” to “run”). This offers more accurate representations but is computationally more intensive. Unlike stemming, lemmatization ensures that the resulting forms are valid words found in a dictionary.
Sparse Representation in Bag of Words (BoW):
In a typical corpus, only a small subset of the entire vocabulary appears in any given document. As a result, the BoW vector for a document is predominantly filled with zeros, and only a few dimensions have non-zero values, representing the frequency of the words that occur in that specific document.
Advantages of sparsity
Computational Complexity:
- Sparse representations reduce the computational complexity of vector operations. Since most elements in a BoW vector are zero, operations involving these zeros can often be skipped, leading to computational efficiency.
- This is particularly advantageous when dealing with large datasets or high-dimensional feature spaces, as calculations can be expedited by focusing only on the non-zero elements.
Memory Usage:
- Sparse representations are memory-efficient. Storing dense vectors for every document in a large corpus can be resource-intensive, but sparse representations store only the non-zero values, significantly reducing memory requirements.
- This is crucial for applications where memory constraints are a concern, such as when processing large text datasets.
Challenges of Bag of Words:
1- Loss of Semantic Meaning:
BoW models do not consider the order of words or their semantic relationships, leading to a loss of nuanced meaning. Phrases and idioms may not be captured accurately.
2- Inability to Capture Word Order:
Since BoW models treat documents as unordered sets of words, they cannot capture the sequential or contextual information inherent in language.
3- Impact of Document Length:
BoW models can be sensitive to document length variations. Longer documents may have higher word counts, potentially influencing the model’s representation and making comparisons challenging.
4- Curse of Dimensionality:
In high-dimensional spaces, where the vocabulary is extensive, the sparsity of BoW vectors can lead to the “curse of dimensionality,” affecting the efficiency of certain algorithms.
Mitigations and Considerations:
1- N-grams and Phrases:
Using n-grams (sequences of adjacent words) or considering phrases can partially address the issue of word order and dimensionality.
2- TF-IDF (Term Frequency-Inverse Document Frequency):
Incorporating TF-IDF weighting can help mitigate the impact of common words, emphasize on discriminative terms more, and solve variable document length problems.
3- Word Embeddings: word2vec models
To address this limitation, more sophisticated word representation models like Skip-Gram and Continuous Bag of Words (CBOW) were introduced. These models aim to capture the semantic information and relationships among words, providing a more nuanced representation.
Similar but different BoW models
While the terms “Bag of Words” and “Continuous Bag of Words” share the “Bag of Words” part, they refer to different concepts and models within NLP. The use of “Bag of Words” in both names may reflect a common theme in NLP, where words are treated as fundamental units of analysis, but the addition of “Continuous” in CBOW highlights the specific approach it takes concerning context and dense vector representation (embedding) for words.
BoW is a simple, frequency-based model that treats a document as an unordered set of words, discarding word order and context. It is often used for tasks where word frequency is important but the word order is not critical.
- BoW represents a document as an unordered set of words, disregarding grammar and word order.
- BoW creates a vector where each dimension corresponds to a unique word in the vocabulary, and the value represents the frequency of that word in the document.
- BoW does not capture the context or sequence of words within a document.
- BoW treats each word in isolation and independently of its neighboring words.
CBOW is a word embedding model in natural language processing that predicts a target word based on its context words. It operates by summing or averaging the word vectors of context words to generate the target word’s embedding. CBOW is known for its simplicity and efficiency in capturing semantic relationships within a given context.
- CBOW is a neural network-based model that learns to predict a target word based on its context words.
- CBoW considers a fixed-size context window around the target word and uses the surrounding words to predict the target word.
- CBoW captures the context and surrounding words.
- CBoW is designed to understand the distributional semantics of words and their relationships within a local context.
CBoW Word Embedding
Word2Vec models employ a clever strategy known as the “fake task trick” to learn distributed representations of words in a continuous vector space. In simpler terms, this technique involves training a neural network for a specific task, but the primary objective is not to perform that task; rather, the focus is on capturing semantic relationships between words through the learned weights in the hidden layer.
Read more about Word2Vec models in Word2Vec Article
Continuous Bag of Words (CBOW), utilizes the fake task trick to achieve word embedding, where each word is represented by a dense vector capturing its semantic meaning. In the case of CBOW, the fake task involves predicting the center word of a context window based on the surrounding words. The true purpose lies in extracting the weights learned during this task to represent words effectively.
How it works
- Context Extraction:
- A fixed-size context window is moved through the training data, extracting context words around each target word.
- Training Objective:
- The model is trained to predict the target word based on its context.
- The objective is to maximize the probability of the true target word given its context.
- Word Embeddings:
- The weights in the hidden layer, which act as word embeddings, are extracted for each word in the vocabulary.
- Semantic Relationships:
- The resulting word embeddings capture semantic relationships, allowing words with similar meanings to have similar vector representations.
By implementing the fake task trick and focusing on the contextual prediction of words, CBOW efficiently learns word embeddings that encapsulate semantic relationships.
CBOW Architecture
Continuous Bag of Words (CBOW) model architecture is designed to learn contextual word embeddings by predicting a target word from its surrounding context words (fake task). The architecture comprises the following components:
Input Layer:
- Represents the context words within a fixed-size window around the target word.
Hidden Layer:
- Acts as an embedding lookup layer, transforming the input context into a continuous vector representation.
- The weights learned in this layer capture semantic relationships between words.
Output Layer:
- Represents the predicted probabilities of each word in the vocabulary being the target word.
- Utilizes a softmax activation function to convert raw scores into probabilities.
Workflow of CBOW
1- We generate our one-hot word vectors for the input context tokens.
2- We get our embedded word vectors for the context words.
Multiplying one hot vector by a weight matrix, essentially acts as a lookup, retrieving the row in the matrix that corresponds to the position of ‘1’ in the vector.
3- Average these vectors to get an embedding vector “v” representing context tokens.
4- Generate a score vector z = Uv (U is 2nd weight matrix) As the dot product of similar vectors is higher, it will push similar words close to each other to achieve a high score.
5- Turn the scores into probabilities using Softmax. y^ = Softmax(z)
6- Define the objective function and minimize the loss
(1)
uc
represents the vector representation of the target word context.denotes the average vector representation of the context words.
∣V∣
is the vocabulary size.u
j
represents the vector representation of thejth
word in the vocabulary.- The summation term computes the exponential of the dot product between the vector representation of each word in the vocabulary and the average context vector.
Applications of CBoW
Text Classification
The semantic information encoded in the word vectors helps improve the performance of classifiers by capturing contextual and syntactic information.
Sentiment Analysis
Sentiment analysis tasks benefit from CBOW embeddings as they capture sentiment-related context and semantic relationships between words. We can then use these embeddings as features for sentiment classification models.
Information Retrieval
CBOW embeddings can enhance information retrieval systems by capturing semantic relationships between words. This is useful for improving the relevance of search results based on the meaning of words in a query.
Question Answering
In question-answering systems, CBOW embeddings contribute to understanding the semantics of both questions and answers. This aids in matching relevant information and improving the accuracy of question-answering models
Limitations of CBoW
Loss of Word Order Information
CBOW operates on the assumption that the meaning of a word can be adequately captured by considering its surrounding context words, regardless of their order.
In CBOW, the vectors representing context words are averaged to create a single vector that serves as the input to the model for predicting the target word.
This averaging process discards the sequential information and treats all context words equally, assuming that their combined influence is representative of the word’s meaning.
This means that the model might struggle to capture certain semantics that depend on the order of words in a sentence.
Dimensionality curse with large vocabularies
The size of the word embeddings produced by CBOW is a hyperparameter chosen before training. For example, if you decide to have 300-dimensional word embeddings, each word in the vocabulary will be represented by a vector of length 300.
With a large vocabulary, the number of parameters in the model increases significantly. If the vocabulary size is very large, the model needs to learn a high-dimensional vector for each word, resulting in a computationally expensive process. For example, if our vocabulary size is 10000 words and our embedding dimension is 300 this will result in a weight matrix of size 3 million weights those weights need to be updated each iteration which requires a lot of computational resources.
Implementation of CBoW with Tensorflow
1. Import required libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
2. Tokenize data and build vocabulary
- The Tokenizer class tokenizes the text and creates a vocabulary.
- The <oov> helps the model deal with words that are not present in the vocabulary.
# Sample corpus
corpus = [
"the quick brown fox jumps",
"over the lazy dog",
]
# Tokenize and create vocabulary
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
3. Preprocess data to generate our targets and context words in a specified window
- The texts_to_sequences method of the tokenizer converts the input sentences in the corpus into sequences of integers.
– First, Each unique word in the corpus gets a unique integer index.
– Then, the resulting sequences variable is a list of lists, where each inner list represents the sequence of word indices. - Next, for each target word, a context window selects the words within a certain range around the target word.
– The left_window and right_window variables determine the boundaries of the context window.
– Then, we extract context words from the document by slicing it based on these boundaries.
context_window = 2
def generate_data(corpus, window_size, tokenizer):
sequences = tokenizer.texts_to_sequences(corpus)
contexts, targets= [], []
for doc in sequences:
current_index = 0
doc_len = len(doc)
# grab center word and its context words
while current_index < doc_len:
# target word
target_word = doc[current_index]
# context words in window size
left_window = max(0, current_index - window_size)
right_window= min(current_index + window_size, doc_len)
context_words = doc[left_window:current_index] + doc[current_index+1: right_window]
# add conext and target to our training data
contexts.append(context_words)
targets.append(target_word)
current_index += 1
contexts = pad_sequences(contexts, maxlen=context_window*2)
return np.array(contexts), np.array(targets)
X_train, y_train = generate_data(corpus, context_window, tokenizer)
4. Define our model
- First, Embedding Layer:
– This layer is responsible for creating word embeddings.
– Embedding layer takes one-hot encoded words as input and converts them into dense vectors of fixed size (embedding_dim).
– The input_dim will be vocab_size, which is the size of the vocabulary.
– The input_length equals context_window*2 to accommodate left and right windows. - Second, GlobalAveragePooling1D Layer:
– This layer calculates the average of all the embeddings in the sequence dimension.
– In addition, it helps reduce the dimensionality of the data before passing it to the next layer. - Finally, Dense Layer:
– This is the output layer with a number of units equal to the vocabulary size (vocab_size).
– First, it computes the cosine similarity between context vector and other word embeddings in the vocab.
– We then apply the activation function ‘softmax’ to the resulting cosine similarity.
– Finally, softmax converts cosine similarity into a probability distribution over the vocabulary, indicating the likelihood of each word being the target word.
embedding_dim = 100
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=context_window*2),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(vocab_size, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.summary()
5. Training the model
epochs = 50
batch_size = 16
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)
6. Test our Learned weights of the words
- First, get_word_vector function. This function retrieves the learned embedding vector for a given word from the learned_embeddings matrix.
- Then, once we have the target word vector, we compute the cosine similarity between the target word vector and all other word vectors in the embedding matrix.
– By taking the dot product of the target vector with the learned_embeddings matrix. - Next, we identify the indices of the words with the highest cosine similarity scores.
– The np.argsort function returns the indices that would sort the cosine similarity array in ascending order.
– By taking the last -top_n elements, we get the indices of the top_n words with the highest similarity scores. - Finally, we use the indices to retrieve the actual words from the index_to_word dictionary, excluding the padding token (index 0).
– The result is a list of words that are most similar to the input word, based on the learned word embeddings.
# learned embeddings
learned_embeddings = model.layers[0].get_weights()[0]
index_to_word = {i: w for w, i in word_index.items()}
def get_word_vector(word):
index = word_index.get(word, word_index['<OOV>'])
return learned_embeddings[index]
word = 'fox'
# Example: Get the word vector for the word 'bank'
word_vector = get_word_vector(word)
# Find similar words to a given word
def find_similar_words(word, top_n=2):
target_vector = get_word_vector(word)
distances = learned_embeddings @ target_vector
closest_indices = np.argsort(distances)[-top_n:]
similar_words = [index_to_word[index] for index in closest_indices if index!=0]
return similar_words
# Example: Find similar words to 'fox'
similar_words = find_similar_words(word, 3)
print(f"Similar words to {word}:", similar_words)