BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google in 2018 for Natural Language Processing tasks such as Sentiment analysis, Question answering, Language translation, and including fake news detection. With the rise of social media and online news sources, it has become increasingly difficult to distinguish between real and fake news. This is where BERT, or Bidirectional Encoder Representations from Transformers, comes in. BERT is a natural language processing model that has been trained on a large corpus of text data and can be used to detect fake news.
By analyzing the language used in news articles and comparing it to a database of known fake news articles, We can perform Fake News Detection. Bert can identify patterns and inconsistencies that suggest a news article may be fake. In this way, BERT can help to combat the spread of misinformation and ensure that people have access to accurate and trustworthy information. After reading this post you should know the following:
- What is BERT? And what is it used for?
- Types of BERT Models
- Implementations steps using BERT Base (Code)
- How does the BERT model work?
What is Bert?
BERT is a transformer-based model that uses a bidirectional approach to learn the context of a given word in a sentence by considering both the left and right context of the word. This allows BERT to have a better understanding of the relationships between words in a sentence and to capture the nuances of language. And pre-trained on large amounts of text data, such as Wikipedia and the Book Corpus dataset, using a masked language model and a next sentence prediction task. In the masked language model, BERT randomly masks some of the words in a sentence and trains the model to predict the masked words based on the surrounding context. In the next sentence prediction task, BERT learns to predict whether two sentences are consecutive or not.
After pre-training, the BERT model can be fine-tuned on a specific task, such as sentiment analysis. Fine-tuning involves taking the pre-trained model and training it on a smaller dataset specific to the task at hand. During fine-tuning, the weights of the pre-trained model are adjusted to better fit the new data, allowing the model to perform well on the specific task.
The Most Common sizes of BERT are BERT BASE and BERT LARGE. The large model generates state-of-the-art results , whereas the BASE model is utilized to compare the performance of one architecture to another.
Using semi-supervised learning was a key factor in BERT’s successful completion of numerous NLP tasks. This indicates that the model has been trained for a particular task that enables it to comprehend the linguistic patterns. Once trained, the BERT model has the ability to process language, which may be utilized to strengthen other models that we create and train using supervised learning. Moreover, BERT uses the same model architecture for all the tasks be it Natural language Identification, classification, or Question-Answering with minimal change such as adding an output layer for classification.
BERT’s Special Tokens
BERT uses special tokens [CLS] and [SEP] to understand input properly. [SEP] token has to be inserted at the end of a single input. While [CLS] is a special classification token and the last hidden state of BERT corresponding to this token (h[CLS]) is used for classification tasks. BERT uses Workpiece embeddings input for tokens. Along with token embeddings, BERT uses positional embeddings and segment embeddings for each token. Positional embeddings contain information about the position of tokens in sequence. Segment embeddings help when model input has sentence pairs (fig 2).
Overall, BERT has revolutionized natural language processing by achieving state-of-the-art results on various tasks and allowing for faster and more accurate analysis of text data.
BERT Applications and Usecases
Some of the NLP applications that can utilize the BERT model are:
- Question Answering
- Search Query Classification
- Fake news detection
- Matching and Retrieving text
- Highlighting paragraphs
- Sentiment Analysis
- Language Translation
Bert model Types
There are several types of BERT models, which differ in their size, complexity, and pre-training objectives. Here are some of the main types of BERT models:
BERT Base: The BERT Base model has 12 transformer layers, a hidden size of 768, and 110 million parameters. It is the most commonly used BERT model and provides a good balance between performance and efficiency.
BERT Large: The BERT Large model has 24 transformer layers, a hidden size of 1024, and 340 million parameters. It is a larger and more complex model than BERT Base, and can achieve higher performance on certain NLP tasks, but also requires more computational resources and training data.
Different BERT versions:
BERT Multilingual: The BERT Multilingual model is trained on text data from multiple languages, and can be fine-tuned for various multilingual NLP tasks such as machine translation and cross-lingual information retrieval. It has a shared vocabulary across all languages, and consists of 104 languages.
BERTweet: The BERTweet model is a variant of BERT that is specifically trained on Twitter data, and can be fine-tuned for sentiment analysis, topic modeling, and other Twitter-specific NLP tasks. It has a larger vocabulary than BERT, with additional tokens and subword units specific to Twitter.
DistilBERT: The DistilBERT model is a smaller and faster variant of BERT that uses a distillation technique to compress the BERT model while retaining its performance. It has 6 transformer layers, a hidden size of 768 or 512, and fewer parameters than BERT Base.
RoBERTa: The RoBERTa model is a variant of BERT that is trained with additional optimization techniques and pre-training objectives, and can achieve higher performance on certain NLP tasks than BERT. It has the same architecture as BERT Base or BERT Large, but is trained with larger batch sizes, longer sequences, and dynamic masking.
There are also many other variants and extensions of BERT that have been proposed in recent years, such as ALBERT, ELECTRA, and T5, each with their own unique features and applications.
Implementations steps using BERT Base
Here are the general steps (Fig 3) you can follow to implement a BERT model for fake news detection:
- Preprocess your data: Start by gathering a dataset of drug reviews and their associated sentiment labels . Preprocess your data by tokenizing the text, padding the sequences, and splitting the data into training and testing sets.
- Fine-tune a pre-trained BERT model: Next, fine-tune a pre-trained BERT model on your dataset using a framework like PyTorch or TensorFlow. You can use a pre-trained BERT model like “Bert-base-uncased” and add a classification layer on top for sentiment analysis.
- Train and evaluate your model: Train your BERT model on the training set, and evaluate its performance on the testing set. You can use metrics like accuracy, precision, recall, and F1-score to evaluate the performance of your model.
Importing important libraries
import pandas as pd import numpy as np import tensorflow as tf from tensorflow.keras.layers import Embedding from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.text import one_hot from tensorflow.keras.layers import LSTM from tensorflow.keras.layers import Dense import tensorflow as tf import tensorflow_hub as hub import tensorflow_text as text from matplotlib import pyplot as plt import seaborn as sn from sklearn.model_selection import train_test_split import numpy as np from sklearn.metrics import confusion_matrix, classification_report
Reading our fake news detection dataset from Kaggle
Let’s take a look at our data (Fig 4).
Simple cleaning and preprocessing steps:
#filling nan values with space(' ') train_df.fillna(' ',inplace=True) #combining title and author,title and summary is formed train_df['summary']=train_df['title']+' '+train_df['author']+' '+train_df['text']
Train and test split
x=train_df['summary'] y=train_df['label'] X_train, X_test, y_train, y_test = train_test_split(x,y, stratify=y)
Typically, BERT models come pre-trained. They’re accessible through TensorFlow Hub. All of the downloaded machine learning models are found in TensorFlow Hub.
Two models will be downloaded, one for preprocessing and the other for encoding. Below are the links for the models.
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
Let’s define the “get_sentence_embedding” function to utilize the BERT preprocessing model
def get_sentence_embedding(sentences): preprocessed_text = bert_preprocess(sentences) return bert_encoder(preprocessed_text)['pooled_output']
Let’s start initializing the BERT layers now. Then visualize our model:
# Bert layers text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text') preprocessed_text = bert_preprocess(text_input) outputs = bert_encoder(preprocessed_text) # Neural network layers l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output']) l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l) # Use inputs and outputs to construct a final model model = tf.keras.Model(inputs=[text_input], outputs = [l])
METRICS = [ tf.keras.metrics.BinaryAccuracy(name='accuracy'), tf.keras.metrics.Precision(name='precision'), tf.keras.metrics.Recall(name='recall') ] model.compile(optimizer='adam', loss='binary_crossentropy', metrics=METRICS)
Let’s now start training the model with our Data:
model.fit(X_train, y_train, epochs=10)
Now, We’ll compute the confusion matrix to check the model performance on predicted values and actual values:
y_predicted = model.predict(X_test) y_predicted = y_predicted.flatten() y_predicted = np.where(y_predicted > 0.5, 1, 0) cm = confusion_matrix(y_test, y_predicted)
Then Plot the confusion matrix given the true and predicted labels (Fig 5)
from matplotlib import pyplot as plt import seaborn as sn sn.heatmap(cm, annot=True, fmt='d') plt.xlabel('Predicted') plt.ylabel('Truth')
Here’s the model performance on our validation set, the model reached an F1 score of 86%. (Fig 6)
How does the model above work?
The Tensorflow model uses the BERT pre-trained language model from Tensorflow Hub to embed sentences as fixed-length vectors, and then trains a binary classification model on top of those embeddings.
The get_sentence_embedding function takes a list of sentences and returns their embeddings, computed by first applying the BERT pre-processing module (bert_preprocess), and then passing the pre-processed text through the BERT encoder module (bert_encoder). The ‘pooled_output’ key in the output dictionary from the BERT encoder corresponds to the fixed-length sentence embeddings.
The main model definition has three parts. First, we define an input layer for the raw text input. Then, we apply the BERT pre-processing and encoding layers to the input to compute the sentence embeddings. Finally, we apply a dropout layer and a dense output layer with sigmoid activation to the embeddings to perform binary classification.
The code constructs a Keras model using the tf.keras.Model API, with the input and output layers defined and connected as inputs and outputs to the BERT pre-processing and encoding layers. And loads the pre-trained BERT model from Tensorflow Hub, which corresponds to the BERT base model with 12 transformer layers, a hidden layer with size of 768, and 110 million parameters. You can then compile and train the model on a labeled dataset using standard Keras training APIs.