Understanding Long Short-Term Memory (LSTM) Networks
LSTMs Long Short-Term Memory is a type of RNNs Recurrent Neural Network that can detain long-term dependencies in sequential data. LSTMs are able to process and analyze sequential data, such as time series, text, and speech. They use a memory cell and gates to control the flow of information, allowing them to selectively retain or discard information as needed and thus avoid the vanishing gradient problem that plagues traditional RNNs. LSTMs are widely used in various applications such as natural language processing, speech recognition, and time series forecasting. After reading this post you should know the following:
- What is Long short-term memory?
- Advantages and disadvantages of using LSTM
- How Does Long short-term memory Work?
- Data Loading
- Create Training / Test Data
- Perform Preprocessing
- Train a Model
- Measure Model Performance
What is Long Short-Term Memory?
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) architecture that is designed to process sequential data and has the ability to remember long-term dependencies. It was introduced by Hochreiter and Schmidhuber in 1997 as a solution to the problem of vanishing gradients in traditional RNNs.
In an LSTM network, each recurrent unit contains a cell state and three types of gates: input, forget, and output gates. The input gate controls the flow of new information into the cell state, while the forget gate controls the flow of information that is no longer relevant. The output gate controls the flow of information from the cell state to the output of the unit.
The cell state is updated at each time step using a combination of the input, forget, and output gates, as well as the previous cell state. This allows the LSTM network to selectively remember or forget information over long periods of time, making it well-suited for tasks such as speech recognition, language translation, and stock price prediction.
Overall, LSTMs have become a popular and effective tool in the field of deep learning, and have been used in a wide range of applications across various industries(Figure 0).
Advantages and Disadvantages of Using LSTM
There are several advantages and disadvantages to using Long Short-Term Memory (LSTM) networks in machine learning and deep learning applications. Here are some of the key advantages and disadvantages:
Advantages:
- Ability to process sequential data: LSTMs are designed to work with sequential data, such as time series data or natural language text. This makes them well-suited for a wide range of applications, including speech recognition, language translation, and sentiment analysis.
- Ability to handle long-term dependencies: LSTMs are specifically designed to address the problem of vanishing gradients, which can occur in traditional RNNs when trying to process long sequences. This makes them well-suited for tasks that require processing long-term dependencies, such as predicting stock prices or weather patterns.
- Memory cell: The memory cell in an LSTM allows the network to selectively remember or forget information over long periods of time, making it more effective at handling complex tasks than other types of RNNs.
Disadvantages:
- Training complexity: LSTMs are more complex than traditional RNNs, which can make them more difficult to train. This complexity can also make it harder to interpret and debug an LSTM network.
- Overfitting: LSTMs are prone to overfitting, especially when working with small datasets. This can lead to poor performance on new, unseen data.
- Computational cost: LSTMs require more computational resources than traditional RNNs, which can make them slower and more expensive to train.
- Lack of transparency: Like other deep learning models, LSTMs can be difficult to interpret and explain. This can make it harder to understand how the model arrived at its predictions, which can be a concern in some applications.
In summary, LSTMs are a powerful tool for processing sequential data and handling long-term dependencies, but they can be more complex to train and may require more computational resources than other types of RNNs. They are best suited for applications where the benefits of their memory cell and ability to handle long-term dependencies outweigh the potential drawbacks.
How Does Long Short-Term Memory Work?
Long Short-Term Memory (LSTM) networks work by processing sequential data through a series of recurrent units, each of which contains a memory cell and three types of gates: input, forget, and output gates.
At each time step, the input gate of the LSTM unit determines which information from the current input should be stored in the memory cell. The forget gate determines which information from the previous memory cell should be discarded, and the output gate controls which information from the current input and the memory cell should be passed to the output of the unit.
The memory cell in the LSTM unit is responsible for maintaining long-term information about the input sequence. It does this by selectively updating its contents using the input and forget gates. The output gate then determines which information from the memory cell should be passed to the next LSTM unit or output layer.
During training, the parameters of the LSTM network are learned by minimizing a loss function using backpropagation through time (BPTT). This involves computing the gradients of the loss with respect to the parameters at each time step. Then propagating them backwards through the network to update the parameters.
Once the LSTM network has been trained, it can be used for a variety of tasks, such as predicting future values in a time series or classifying text. During inference, the input sequence is fed through the network, and the output is generated by the final output layer.
Overall, LSTMs are a powerful tool for processing sequential data and handling long-term dependencies, making them well-suited for a wide range of applications in machine learning and deep learning(Figure 1).
Implementation Steps of LSTMs
we will discuss how you can use NLP to determine whether the news is real or fake. Nowadays, fake news has become a common problem. Even respected media organizations are known to propagate fake news and are losing credibility. It can be difficult to trust news, because it can be difficult to know whether a news story is real or fake.
First we import the needed libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
import nltk
nltk.download('stopwords')
# here we are importing nltk,stopwords and porterstemmer we are using stemming on the text
# we have and stopwords will help in removing the stopwords in the text
#re is regular expressions used for identifying only words in the text and ignoring anything else
import nltk
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
ps=PorterStemmer()
from sklearn.metrics import classification_report
load the fake news data from FakeData and see the features of the data(Figure2)
train_df=pd.read_csv(PATH_TO_YOUR_FILE)
# here we are printing first five lines of our train dataset
train_df.head()
Data Cleaning and Pre-Processing
Combining the “title”, “author” and “text” columns into a new column called “summary”. And also filling any missing values in the data frame with a space
#filling nan values with space(' ')
train_df.fillna(' ',inplace=True)
#combining title and author,title and summary is formed
train_df['summary']=train_df['title']+' '+train_df['author']+' '+train_df['text']
x=train_df['summary']
y=train_df['label']
Removing non-alphabetic characters, converting the text to lowercase, tokenizing the text into words, removing stopwords, and stemming the remaining words using the Porter Stemming algorithm. Finally, y joining the preprocessed words back into a string and adding it to the “corpus” list .
# here we are creating corpus for the test dataset exactly the same as we created for the
# training dataset
corpus=[]
for i in range(0,len(train_df)):
review=re.sub('[^a-zA-Z]',' ',x[i])
review=review.lower()
review=review.split()
review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
review=' '.join(review)
corpus.append(review)
Preparing text data for using in deep learning model.
- First, setting the vocabulary size to 10000. This means that only the top 10000 most common words in the corpus will be used, and any other words will be discarded.
- Next, using the
one_hot
function to convert each word in the corpus into a one-hot encoded vector representation with a length ofvoc_size
. This is a common way to represent text data in deep learning models. - Then, specifying a sentence length of 500, which means that all sentences in the corpus will be padded or truncated to have a length of 500. This is necessary because deep learning models generally expect input data to have a fixed size.
- Finally, using the
pad_sequences
function to pad the one-hot encoded vectors to the specified length ofsent_length
. using the “pre” padding mode, which means that any padding will be added to the beginning of the sequence. - Note ,if you want to use a word embedding technique, you can replace the
one_hot
function with a more sophisticated method such asWord2Vec, GloVe, or FastText.
#vocabulary size
voc_size=10000
# TensorFlow has an operation for one-hot encoding
one_hot_reps1=[one_hot(word,voc_size) for word in corpus]
# here we are specifying a sentence length so that every sentence in the corpus will be of same length
sent_length=500
#making all the sentence as equall size vector
#two types of padding pre and post
embedded_docs1=pad_sequences(one_hot_reps1,padding='pre',maxlen=sent_length)
Converting the preprocessed text data and labels into numpy array using the np.array
function.
x=np.array(embedded_docs1)
#label should be 0,1 for lstm
y=np.array(y)
Building Models
#Creating model
from tensorflow.keras.layers import Dropout
import warnings
warnings.filterwarnings('ignore')
embedded_feature_vector=300
nn=Sequential([
Embedding(voc_size,embedded_feature_vector,input_length=sent_length),
Dropout(0.5),
LSTM(199),
Dropout(0.4),
Dense(399,activation='relu'),
Dense(43,activation='relu'),
Dense(1,activation='sigmoid')])
nn.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
Splitting and Training
# here we are splitting the data for training and testing the model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model on the training data with validation split
nn.fit(X_train, y_train, validation_split=0.2, epochs=50, batch_size=64)
Evolution step
Predict on test data then show classification report
y_pred=nn.predict(X_test)
#use threshold or round to the predicted output here use threshold to binary
y_pred=(y_pred>0.5)
y_pred=y_pred.reshape(-1,)
y_pred= np.array(y_pred)
y_test =np.array(y_test)
print(classification_report(y_test, y_pred))
Then Plot the confusion matrix given the true and predicted labels(Figure3)
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred)
from matplotlib import pyplot as plt
import seaborn as sn
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
Resources:
Full source code on Github