Sentiment Analysis with Random Forests
Sentiment Analysis is a computational technique that involves the use of Natural Language Processing (NLP) and Machine Learning to determine the emotional tone or sentiment expressed in a piece of text, such as a review, tweet, or customer feedback. It aims to classify the sentiment as positive, negative, or neutral, providing valuable insights into public opinion and user sentiments. Sentiment Analysis with Random Forest takes advantage of the Random Forest algorithm’s capabilities, an ensemble learning method, to enhance the accuracy and efficiency of sentiment classification, making it a promising approach for extracting meaningful sentiment information from large textual datasets. After reading this post you should know the following:
- What is Sentiment analysis?
- Challenges of Sentiment Analysis
- Sentiment Analysis Use Cases
- [Code] Data Loading – Drug review dataset
- [Code] Feature Extraction
- [Code] Random Forest model training
- [Code] Results
What is Sentiment analysis?
Text Sentiment Classification is the process of determining whether a given block of text exhibits a positive, negative, or neutral sentiment. In a generic use case, sentiment analysis involves contextually analyzing words to reveal the social sentiment towards a brand, enabling businesses to assess the market potential of their products. The primary goal of sentiment analysis is to examine public sentiment to support corporate growth, focusing on emotions and polarity such as happiness, sadness, and anger. Various Natural Language Processing techniques, including Automated, Hybrid, and Rule-based approaches, are employed to achieve accurate sentiment classification.
For instance, if we wanted to determine whether a product was meeting customer demands or whether the rea demand in the market for this product? We can track the reviews for that product using sentiment analysis. When there is a big amount of unstructured data and we wish to classify that data by automatically tagging it, sentiment analysis is an effective method to apply (Figure 1).
Challenges of Sentiment Analysis
- Ambiguity and context: Words can have different meanings based on the context they are used in, making it challenging to accurately determine sentiment.
- Sarcasm and irony: Our model may struggle to identify sarcasm or irony, as the intended sentiment may be opposite to the literal meaning of the words.
- Negation handling: Negation words like “not” can completely change the sentiment of a sentence, posing difficulties for our model.
- Domain-specific language: Models trained on one domain may not generalize well to other domains due to variations in language and vocabulary.
- Handling emoticons and emojis: Emoticons and emojis can convey emotions, but their interpretation can be complex for sentiment analysis algorithms and Models.
- Data imbalance: Imbalanced datasets, where one sentiment class has significantly more instances than others, can affect the model’s performance.
- Language variation: Different languages and regional dialects can express sentiments differently, making sentiment analysis across multiple languages a challenge.
- Subjectivity and individual differences: Sentiments can be subjective, varying among individuals and cultural backgrounds.
- Data noise: Noisy or irrelevant data in text, such as typos, abbreviations, or grammatical errors, can impact the accuracy of our model.
Sentiment Analysis Use Cases
As we just saw, corporations can gain insights from sentiment analysis that can aid them in making data-driven decisions. Let’s look at some other application cases:
- Sentiment analysis can be used by brands to determine how the public perceives them via social media. For instance, a business can compile all Tweets mentioning or tagging the business and do sentiment analysis to find out how the public perceives the business.
- Product/Service Analysis: Brands/Organizations can use customer reviews’ sentiment analysis to determine how well a product or service is performing in the market and then adjust their plans going forward.
- Stock Price Prediction: It’s critical for investors to forecast whether a company’s shares will increase or decrease. By running sentiment analysis on news headlines from articles that contain the company’s name, one can ascertain the same. The stock price of a company should increase if the news headlines about it are positive, and the opposite should be true.
How does Sentiment Analysis work?
- Rule-based approach: Tokenization, parsing, and the lexicon technique are rule-based in this instance. The method counts the number of positive and negative phrases in the sample. The emotion is positive if there are more positive statements than negative ones; otherwise, it is the opposite.
- Automated Approach: This strategy relies on machine learning. Predictive analysis is first performed once the datasets have been trained. Word extraction from the text is the subsequent procedure. Several methods, including Naïve Bayes, Linear Regression, Random forest, and Deep Learning, can be used to extract text.
- Hybrid Approach: This approach combines the previously mentioned rule-based and automatic processes. The benefit is that accuracy is high compared to the other two methods..
Implementation Steps with Automated Approach on Drug Reviews dataset
First we import the needed libraries and load the data of drug reviews from UCI.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(PATH_TO_YOUR_FILE_1)
test = pd.read_csv(PATH_TO_YOUR_FILE_2)
# as both the dataset contains same columns we can combine them for better analysis
data = pd.concat([df, test])
lets see the features of the data (Figure 2).
data.head()
Data Understanding
In order to understand the data and apply the necessary Data preprocessing techniques, we need to visualize and understand the most important signals in our dataset.
Let’s see the top 30 problems ranked:
top_30_problems = data.condition.value_counts()[:30]
plt.figure(figsize = (15,7))
top_30_problems.plot(kind = 'bar');
plt.title('Top 30 Problems',fontsize = 20);
Then let’s visualize the top 30 drugs by count:
top_30_drugs = data.drugName.value_counts()[:30]
plt.figure(figsize = (15,7))
top_30_drugs.plot(kind = 'bar');
plt.title('Top 30 Drugs by Count',fontsize = 20);
Data Pre-processing
The following data pre-processing are typically applied to a text feature for text classification in order to handle missing data, condense the vocabulary, reduce the size of training and vocabulary Text cleaning or Text pre-processing is a mandatory step when we are working with text.
Lets see the implementation by dealing with missing values and removal of punctuation.
import string
data['review_clean']=data['review'].str.replace('[{}]'.format(string.punctuation), '')
data.head()
data = data.fillna({'review':''})
Let’s make a sentiment Label column
data['sentiment'] = data['rating'].apply(lambda rating : +1 if rating > 5 else -1)
data.head()
Create a train and test split
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(data,test_size = 0.20)
print('Size of train_data is :', train_data.shape)
print('Size of test_data is :', test_data.shape)
Feature Extraction
As machine learning and deep learning models only comprehend numbers, each word (document) must be encoded as a vector of numbers after being preprocessed (a process known as vectorization).
Both CountVectorizer and HashingVectorizer are designed to accomplish the same task. We will use HashingVectorizer on our data, In which each token directly maps to a column position in a matrix (figure 5).
Lets see the implementation , we will now compute the word count for each word that appears in the reviews with HashingVectorizer.
import gc
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()
train_matrix = vectorizer.transform(train_data['review_clean'].values.astype('U'))
test_matrix = vectorizer.transform(test_data['review_clean'].values.astype('U'))
gc.collect()
Modelling
Load the classifier and fit the training dataset
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
rf = clf.fit(train_matrix,train_data['sentiment'])
Measure Model Performance
Predict on the test set first
y_pred = rf.predict(test_matrix)
from sklearn.metrics import f1_score
f1_score(y_pred,test_data.sentiment)
Then use metrics to check the model performance on predicted values and actual values
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(test_data.sentiment,y_pred)
print(cm)
Then Plot the confusion matrix given the true and predicted labels (Figure 4)
from matplotlib import pyplot as plt
import seaborn as sn
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
Here’s the Confusion matrix for our model with Random forests (Figure 6):
Resources:
Full source code on Github