Tokenization: The Cornerstone for NLP Tasks

Tokenization stands at the heart of Natural Language Processing (NLP), serving as a critical bridge that narrows the gap between human communication and machine understanding, enabling computers to grasp the intricacies of language.

One of the primary challenges in NLP lies in transforming the rich semantics of human language into a format that machine learning models can digest. A sentence, composed of words and meanings, needs to be translated into a series of discrete units to become eligible for computational analysis.

What will be covered in this article:

  • Constructing the linguistic bridge between human language and machines
  • Building of vocabulary
  • Tokenization techniques Break Down
  • Text processing
  • The importance of tokenization
  • Tokenization challenges
  • Index assignment methods

Vocabulary Building: A Fundamental Bridge in Natural Language Processing

Building a vocabulary is a pivotal step in natural language processing (NLP). Vocabulary acts as a bridge between the diverse expressions of human language and the numerical requirements of machine learning models. This process involves creating a structured mapping between unique tokens, such as words or subword units, and integer indices.

Building Vocabulary Steps:

  1. Tokenization
  2. Text processing
  3. Index assignment
  4. Vocabulary Representation

What is Tokenization?

Tokenization plays a pivotal role in addressing the challenge of representing text in a format that machines can effectively process and understand. Tokenization serves as a transitional bridge between the richness of human language and the structured, numerical requirements of computational models.

The tokenization process breaks down a sentence into smaller units, called tokens. These tokens can be as small as individual words or even parts of words. Each token represents a meaningful unit of language this technique is fundamental in natural language processing (NLP) and computational linguistics.

Splitting sentence into separate tokens
Figure 1: Splitting Text into tokens.

The choice of tokenization method depends on the nature of the text data and the specific requirements of the NLP task

Tokenization Methods

1. Whitespace Tokenization:

Whitespace tokenization involves splitting a text into tokens based on whitespace (spaces, tabs, and line breaks).

sentence =  "Learning Tokenization with ML Archive"
tokens = sentence.split()
print(f"Sentence: {sentence} \nTokens: {tokens}")
Result of text split based on white spaces.

2. Word Tokenization:

Word Tokenization breaks down a text into individual words where Punctuation marks are treated as separate tokens.

import re

# Sample sentence
sentence = "Word tokenization, where Punctuation marks are treated as separate tokens."

# Tokenize the sentence into words based on spaces and punctuation
tokens = re.findall(r'\b\w+\b|[.,;!?]', sentence)

# Print the result
print("Original Sentence:", sentence)
print("Word Tokens:", tokens)
Result of text split into words.

3. Subword Tokenization:

Subword Tokenization breaks down words into smaller units, such as subwords or characters. This is useful for handling languages with complex word structures or for creating smaller vocabulary sizes.

Splitting text into subword units

Tokenization Libraries

Tokenization can be accomplished using various techniques. The choice of method often depends on the specific requirements of the natural language processing (NLP) task at hand.

1. NLTK Tokenizer

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize'punkt', quiet=True)

sentence = "NLTK is a powerful library for natural language processing."
tokens_nltk = word_tokenize(sentence)
print("NLTK Tokens:", tokens_nltk)
Splitting text result with NLTK lib

2. spaCy Tokenizer

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("spaCy provides advanced tokenization capabilities.")
tokens_spacy = [token.text for token in doc]
print("spaCy Tokens:", tokens_spacy)
Splitting text result with spaCy lib

3. Keras Tokenizer

Keras open-source library is one of the most reliable deep learning frameworks.

from keras.preprocessing.text import text_to_word_sequence

sentence = "Keras is a high-level neural networks API."
tokens_keras = text_to_word_sequence(sentence)
print("Keras Tokens:", tokens_keras)
Splitting text result with Keras

4. Huggingface Tokenizer

The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

from transformers import AutoTokenizer

# Replace 'bert-base-uncased' with the desired model name
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example sentence
sentence = "Hugging Face Transformers simplify NLP workflows."

# Tokenize the sentence
tokens_hugging_face = tokenizer(sentence)
print("Hugging Face Tokens:", tokenizer.convert_ids_to_tokens(tokens_hugging_face['input_ids']))
Splitting text result with Huggingface

Text processing

Once we have our tokens, we often need to clean up the text. This stage is about stripping away the extraneous—punctuation and commonplace words that cloud meaning rather than clarify it. Moreover, standardizing the case by converting all words to lowercase ensures uniformity, essential for the subsequent analytical processes.

Text processing is not a one-size-fits-all procedure but a tailored approach, finely adjusted to fit the unique demands of each task. It transforms the raw, untamed language into a polished, comprehensible format, ready for deeper analysis. This critical phase makes the language not just manageable, but optimally structured for the sophisticated explorations that follow, highlighting the indispensable role of tokenization in rendering data not only accessible but genuinely insightful.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer'stopwords', quiet=True)'wordnet', quiet=True)

# Sample sentence
sentence = "Text processing after tokenization involves various tasks."

# Tokenization
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

1. Lowercasing:

Convert all tokens to lowercase. This ensures consistency and helps in treating words regardless of their original case.

# Lowercasing
tokens_lower = [token.lower() for token in tokens]
print("Lowercasing:", tokens_lower)

2. Stop words Removal:

Remove common words (stopwords) that do not contribute much to the meaning of the text. (ex. “the” “and” “is”)

# Tokenization
tokens = word_tokenize(sentence)

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens_lower if token not in stop_words]
print("filtered_tokens:", filtered_tokens)

3. Lemmatization and Stemming:

Lemmatization and Stemming reduce words to their base or root form.

Stemming: Stemming involves reducing words to their base or root form by removing prefixes or suffixes. The goal is to obtain a common base form that may not always be a valid word. Stemming is a more aggressive and heuristic-based approach.

Lemmatization: Lemmatization, on the other hand, considers the context of the word and reduces it to its base or dictionary form, known as the lemma. The result is always a valid word. Lemmatization is more context-aware and linguistically grounded compared to stemming.

Stemming vs lemmatization
Figure 2: Stemming vs Lemmatization.
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("stemmed_tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("lemmatized_tokens: ", lemmatized_tokens)

4. Removing Punctuation and Special Characters:

Eliminate punctuation marks and other special characters that may not be essential for the analysis.

# Removing punctuation
cleaned_tokens = [token for token in lemmatized_tokens if token not in string.punctuation]
print("cleaned_tokens:", cleaned_tokens)

5. Handling Numeric Tokens:

Decide whether to keep, replace, or remove numerical tokens based on the specific task. For some applications, numbers may be relevant; for others, they can be treated as noise.

# Removing Numeric Tokens
cleaned_tokens = [token for token in cleaned_tokens if not token.isdigit()]

6. Removing HTML Tags or Markup:

If working with web data, remove any HTML tags or markup language to focus on the text content.

from bs4 import BeautifulSoup

def remove_html_tags(text):
    # Use BeautifulSoup to remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    cleaned_text = soup.get_text(separator=' ')
    return cleaned_text

# Sample HTML text
html_text = "<p>This is <b>bold</b> and <i>italic</i> text.</p>"

# Remove HTML tags
cleaned_text = remove_html_tags(html_text)

print("Original HTML Text:", html_text)
print("Cleaned Text (without HTML tags):", cleaned_text)

7. Tailored Tokenization: Custom Token Filtering

Customize your tokenization process with domain-specific token filtering to meet the precise needs of your analytical tasks. This tailored approach enhances the relevance and accuracy of your data analysis, showcasing tokenization’s pivotal role in crafting data for specific contexts.

Text Processing Importance

Importance of text processing.
Figure 3: Text Processing Importance

Effective Dimensionality Reduction:
Text processing contributes to dimensionality reduction by eliminating unnecessary features, handling synonyms, and reducing the overall complexity of the data. This results in more efficient and focused models.

Enhanced Data Quality:
Text processing improves data quality by normalizing, cleaning, and standardizing textual content. This ensures consistency and reduces noise, providing a more reliable foundation for analysis.

Improved Model Performance:
Cleaned and pre-processed text serves as optimal input for machine learning models. This, in turn, enhances the performance of models by reducing overfitting, speeding up training times, and improving generalization to unseen data.

Enhanced Interpretability:
The processed and refined data provides clearer insights, making it easier to interpret model predictions and outputs. Improved Understanding: By reducing noise and enhancing features, text processing aids in creating models that are more interpretable and explainable.

Better Generalization:
The processed and refined data provides clearer insights, making it easier to interpret model predictions and outputs. Improved Understanding: By reducing noise and enhancing features, text processing aids in creating models that are more interpretable and explainable.

Text processing Challenges

Text processing challenges
Figure 4: Challenges of Text Processing

Ambiguity: Words with multiple meanings or ambiguous structures can pose challenges in deciding the appropriate tokenization.
ex. “Bass” can refer to a fish or a musical instrument.

Contractions and Informal Language: Contractions and informal language, common in text, may not follow standard rules, making it difficult to tokenize accurately.
ex. “Can’t” might be split into “can” and “‘t.”

Named Entities: Tokenizing named entities and compound words correctly is crucial, especially in languages where such constructs are prevalent.
ex: “New York” should be treated as a single token, not two separate words.

Handling Special Characters and Punctuation: Deciding whether to keep, remove, or tokenize special characters and punctuation can impact the interpretation of text.
ex. “C++” or “e-mail” may require special consideration.

Context-Dependent Tokenization: The same word may need different tokenization based on its context within a sentence or document.
ex. “He fishes in the river” vs. “Bass fishing is his hobby.”

Efficiency in Large Datasets: Efficient tokenization becomes crucial when dealing with large datasets, requiring optimizations to avoid performance bottlenecks.
ex. Processing millions of documents quickly without sacrificing accuracy.

Tokens Index Assignment

Once our text is split into tokens we assign unique integer indices to each token. Commonly, the index 0 is reserved for special tokens (e.g., padding or unknown tokens), and the rest of the indices are assigned using different approaches.

1. Frequency-Based Approaches:

1.1 Count-Based Vocabulary:

Assign indices based on the frequency of occurrence. More frequent tokens are usually assigned lower indices.

1.2 TF-IDF (Term Frequency-Inverse Document Frequency):

TF-IDF, or Term Frequency-Inverse Document Frequency, is a numerical statistic used in natural language processing and information retrieval to evaluate the importance of a term within a document relative to a collection of documents (corpus). It is a technique that aims to highlight terms that are both frequent within a specific document and rare across the entire corpus, emphasizing their significance.

Calculation process of TF-IDF:
1.2.1. Term Frequency (TF):
Measures how often a term appears in a document. Calculated as the number of times a term (word) appears in a document divided by the total number of terms in that document. It helps to identify the importance of a term within a specific document.

(1)   \begin{equation*}\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}\end{equation*}

1.2.2. Inverse Document Frequency (IDF):
Measures how unique or rare a term is across the entire corpus. Calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing the term. It helps to identify the importance of a term in the broader context of the corpus.

(2)   \begin{equation*}\text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in the corpus } D}{\text{Number of documents containing term } t}\right)\end{equation*}

1.2.3. TF-IDF Score:
Combines both the TF and IDF components to compute a score for each term in a document. The higher the TF-IDF score, the more important the term is within that document and the entire corpus.

(3)   \begin{equation*}\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)\end{equation*}

2. Top-N Tokens:

Select the top-N most frequent tokens in the corpus and assign indices accordingly. This approach helps focus on the most informative words while discarding rare or noisy ones.

3. Custom Approaches:

Domain-Specific Vocabulary: For specialized domains, manually curate a vocabulary to include terms relevant to the specific context. This ensures that the vocabulary is tailored to the characteristics of the data.

Vocabulary Representation:

Vocabulary with words as key and their index as value
Figure 5: Vocabulary Dictionary

Create a Vocabulary Dictionary where tokens are keys, and their corresponding indices are values. This dictionary represents the vocabulary.


To prepare the text for model input, text is tokenized then each token is converted into a binary vector, with a length equal to the size of the vocabulary. The vector is filled with zeros except for the position corresponding to the index of the token, which is set to one. This one-hot vector representation effectively encodes the presence or absence of each token in the sentence.