The Ultimate Guide To Natural Language Processing (NLP)
We all hear about Natural language processing and how much impact it has in the state of the art technologies, we have also heard about OpenAI‘s GPT3 and ChatGPT and how astonishing its performance as if there is an alien behind the scene that understands our questions and give an intelligent answer. But what exactly is NLP and how can we learn it? Here’s the ultimate guide to Natural Language Processing (NLP).
After reading this post you should know the following:
- What is Natural Language Processing (NLP)?
- Why is NLP considered a hard problem?
- NLP use cases and projects
- Tokenization
- Resources & References
What is Natural Language Processing (NLP)?
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. Some people think it is only related to dataset of text. But this is not fully correct; As it is includes the any way of communications between humans, such as:
- Visual – QA or OCR (Optical character recognition)
- Textual
- Verbal – Speech
Why is Natural Language Processing considered a hard problem?
NLP is not easy. There are several factors that makes this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context.
Also Humans use words and phrases differently, speak with different accents, use idioms, metaphors, homophones, and several other complexities of language.
This makes the development process is time-consuming. and therefore, using pre-existing NLP technologies might reduce the product building time.
Natural Language Processing Use Cases and Projects
The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
NLP leverages ML-based techniques to derive the meaningful insights from the text so that the organizations can analyze and extract information related to customers, location. Here we will state some use cases and projects for NLP.
Sentiment Analysis
This is one of the most popular NLP projects. It is widely used by companies to monitor the review of their product through customer feedback. If the review is positive, it means that they are on the right track. Otherwise, it needs improvement. Here is a quick code example:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("The football match was great!.")
output --> [{'label': 'POSITIVE', 'score': 0.9998636245727539}]
Translation
We can also use NLP to translate from a language to another and also keep the context correct!. Here is a quick example translating English to Dutch:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de")
translator("i want to get into conventional conversation and contribute to the conference")
output --> [{'translation_text': 'Ich möchte in ein konventionelles Gespräch kommen und zur Konferenz beitragen'}]
Topic classification
The task is to have a document and use relevant algorithms to label the document with an appropriate topic. Let’s say you want to classify paragraphs into topics to group them together. Here is a quick coding example:
classifier = pipeline("zero-shot-classification")
classifier(
"Global economy is going down",
candidate_labels=["education", "politics", "business", "comedy"],
)
output --> {'sequence': 'Global economy is going down',
'labels': ['business', 'politics', 'education', 'comedy'],
'scores': [0.9435223937034607,
0.01945255510509014,
0.01884406805038452,
0.018181029707193375]}
Summarization
This is also a very useful solution if you want to find only the important points in your article instead of reading it as a whole. Here is a quick code example:
summarizer = pipeline("summarization")
summarizer(
"""
There are a few exceptions, however:
Certain determiners, such as all, both, half, double, precede the definite article when used in combination (all the team, both the girls, half the time, double the amount).
Such and what precede the indefinite article (such an idiot, what a day!).
Adjectives qualified by too, so, as and how generally precede the indefinite article: too great a loss, so hard a problem, as delicious an apple as I have ever tasted, I know how pretty a girl she is.
When adjectives are qualified by quite (particularly when it means “fairly”), the word quite (but not the adjective itself) often precedes the indefinite article: quite a long letter.
"""
)
output --> [{'summary_text': ' Certain determiners, such as all, both, half, double, precede the definite article when used in combination (all the team, both the girls, half the time, double the amount). Such and what precede such an idiot, what a day! Adjectives qualified by too, so, as and how generally precede the indefinite article .'}]
There are many other tasks but before going through all of them, we need to understand the building blocks of NLP which will help us implement and perform these tasks.
Tokenization
The most important building block of NLP is the tokenizer. This is the process of splitting our body of text out into ‘tokens’. That just means, split your text out from a sentence, into a list of words. For example: “I have an appointment tomorrow” would become [“I”, “have”, “an”, “appointment”, “tomorrow”]
There are many types of tokenizers such as:
- Word-based (Split the raw text into words and find a numerical representation for each of them)
- Character-based (Split the text into characters, rather than words)
- Sub-word (Rely on the principle that frequently used words should not be split into smaller sub-words, but rare words should be decomposed into meaningful sub-words.)
In order to use a tokenizer, you can use one of the huggingface models, and use it like the following example:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer("Hello, I am learning new things on Machine Learning Archive! ")
output -->
{
'input_ids': [101, 8667, 117, 146, 1821, 3776, 1207, 1614, 1113, 7792, 9681, 15041, 106, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
The input_ids are the tokens that we are looking for, which is a numerical representation for the sub-word.
token_type_ids Show that these tokens belong to one sentence.
attention_mask Shows that all of these tokens should be used during training in the transformer, otherwise ignore it
These are the output of tokenizer and also they are the input for the transformer as explained. Let’s see how to pass it to the transformer and perform a classification task for positive vs negative sentence!
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["Hello, I am learning new things on Machine Learning Archive!"]
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
output.logits
output -->
tensor([[-3.8988, 4.1792]], grad_fn=<AddmmBackward0>)
By taking argmax it will give us 1 which is the class of positive
We will discuss more on the NLP topic in the upcoming posts!
Resources & References
- Full Source code for the post
- HuggingFace Models and Tutorials
- Read more about Transformers