Guide to Sentiment Analysis using Natural Language Processing

Nikhil 30 Jul, 2024

15 min read

Introduction

Sentiment analysis has become crucial in today’s digital age, enabling businesses to glean insights from vast amounts of textual data, including customer reviews, social media comments, and news articles. By utilizing natural language processing (NLP) techniques, sentiment analysis using NLP categorizes opinions as positive, negative, or neutral, providing valuable feedback on products, services, or brands. This analysis is powered by various algorithms such as Naive Bayes, Support Vector Machines (SVM), and Recurrent Neural Networks (RNN), which help in understanding the overall sentiment and emotional tone conveyed in the text, making it an indispensable tool for business intelligence and decision-making.

Learning Outcomes

Gain comprehensive knowledge about various sentiment analysis tools and their applications in analyzing textual data from different sources.
Develop the ability to classify sentiments into categories such as positive, negative, and neutral using advanced sentiment classification techniques.
Learn how to calculate sentiment scores and interpret their significance in understanding the overall sentiment conveyed in a given text.
Acquire the skills to analyze sentiment in tweets, recognizing the unique challenges and opportunities presented by social media data.
Explore how artificial intelligence (AI) enhances the accuracy and efficiency of sentiment analysis, and understand its role in automating the sentiment analysis using NLP process.
Develop the ability to assess customer sentiment from various textual data sources, providing insights into customer opinions and improving customer experience strategies.
Equip data scientists with the necessary tools and techniques to effectively conduct NLP for sentiment analysis and derive actionable insights from textual data.

This article was published as a part of the Data Science Blogathon.

What is Sentiment Analysis?

sentiment analysis using NLP is a method that identifies the emotional state or sentiment behind a situation, often using NLP to analyze text data. Language serves as a mediator for human communication, and each statement carries a sentiment, which can be positive, negative, or neutral.

Suppose there is a fast-food chain company selling a variety of food items like burgers, pizza, sandwiches, and milkshakes. They have created a website where customers can order food and provide reviews.

For instance,

User Review 1: “I love this cheese sandwich, it’s so delicious,” is a positive review, indicating customer satisfaction.
User Review 2: “This chicken burger has a very bad taste,” is negative, highlighting an issue with the burger.
User Review 3: “I ordered this pizza today,” is neutral, not indicating the customer’s satisfaction.

By analyzing these reviews, the company can conclude that they need to focus on promoting their sandwiches and improving their burger quality to increase overall sales.

Guide to Understand and Implement Natural Language Processing

But, now a problem arises, that there will be hundreds and thousands of user reviews for their products and after a point of time it will become nearly impossible to scan through each user review and come to a conclusion.

A Sentiment Analysis Model is crucial for identifying patterns in user reviews, as initial customer preferences may lead to a skewed perception of positive feedback. By processing a large corpus of user reviews, the model provides substantial evidence, allowing for more accurate conclusions than assumptions from a small sample of data.

We will explore the workings of a basic Sentiment Analysis model using NLP later in this article. Furthermore, principal sentiments like “positive” and “negative” can be broken down into more nuanced sub-sentiments such as “Happy,” “Love,” “Surprise,” “Sad,” “Fear,” and “Angry,” depending on specific business requirements.

Real-World Example

Historical Perspective: Initially, social media services like Facebook had only two reactions: “like” or no reaction (dislike).
Granular Sentiments: Over time, reactions evolved into more granular sentiments such as “like,” “love,” “sad,” and “angry.”
Enhanced Customer Experience: Companies promoting products on Facebook now receive more specific feedback, improving customer experience.
Targeted Customer Handling: Granular feedback allows companies to address customers with different sentiments (e.g., “sad” vs. “angry”) more effectively.
Need for Advanced Sentiment Analysis: Modern business requires more than the bare minimum; targeted strategies based on specific sentiments are essential.
Challenges of Natural Language: Human communication in natural language is complex and messy, making it challenging for machines to interpret.
Role of Natural Language Processing (NLP): NLP is needed to help computers understand human language, which includes various styles and sentiments.
Sentiment Analysis as a Sub-field of NLP: Sentiment Analysis uses machine learning techniques to identify and extract insights from textual data.

Also Read: Sentiment Analysis Using Python

Types of Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that includes deciding and concentrating on the emotional data in an info text. This can be an assessment, an evaluation, or an inclination about a specific point or item. Here are the fundamental sorts of feeling examination:

Fine-grained Sentiment Analysis: This goes beyond just positive, negative, or neutral. It involves very specific ratings, like a 5-star rating, for example.
Emotion detection: This aims to detect emotions like happiness, frustration, anger, sadness, etc. The biggest challenge here is being able to accurately identify these emotions in text.
Aspect-based Sentiment Analysis: This is generally used to understand specific aspects of a certain product or service. For example, in a review like “The battery life of this phone is great, but the screen is not very clear”, the sentiment towards the battery life is positive, but it’s negative towards the screen.
Multilingual sentiment analysis: This can be particularly challenging because the same word can convey different sentiments in different languages.
Intent Analysis: This goes a step further to understand the user’s intention behind a certain statement. For example, a statement like “I would need a car” might indicate a purchasing intent.

Sentiment analysis using NLP is a mind boggling task because of the innate vagueness of human language. Mockery, for example, is especially difficult to identify. Subsequently, the precision of opinion investigation generally relies upon the intricacy of the errand and the framework’s capacity to gain from a lot of information.

Also Read: Theory Behind the Basics of NLP

Why Is Sentiment Analysis Important?

NLP for sentiment analysis is important for several reasons:

Business Intelligence: It helps businesses understand how their customers feel about their products or services. This can guide improvements, address customer concerns, and enhance overall customer satisfaction.
Market Research: By analyzing public sentiment towards products, services, or brand mentions on social media, companies can gain insights into market trends and competitors.
Customer Service: Sentiment analysis can help identify negative reviews or feedback in real-time, allowing for quicker responses and problem resolution.
Product Analytics: It can be used to understand user feedback on various aspects of a product, helping drive product strategy and development.
Public Relations: Sentiment analysis can help monitor public sentiment towards a company or individual, enabling proactive management of public relations.
Politics and Public Policy: In politics, sentiment analysis is used to gauge public opinion towards policies or political entities, which can inform strategy and messaging.

Keep in mind, the objective of sentiment analysis using NLP isn’t simply to grasp opinion however to utilize that comprehension to accomplish explicit targets. It’s a useful asset, yet like any device, its worth comes from how it’s utilized.

Sentiment Analysis Challenges

Sentiment analysis, while powerful, comes with its own set of challenges:

Sarcasm and Irony: These linguistic features can completely reverse the sentiment of a statement. Detecting sarcasm and irony is a complex task even for humans, and it’s even more challenging for AI systems.
Contextual Understanding: The sentiment of certain words can change based on the context in which they’re used. For example, the word “sick” can have a negative connotation in a health-related context (“I’m feeling sick”) but can be positive in a different context (“That’s a sick beat!”).
Negations and Double Negatives: Phrases like “not bad” or “not unimpressive” can be difficult to interpret correctly because they require understanding of double negatives and other linguistic nuances.
Emojis and Slang: Text data, especially from social media, often contains emojis and slang. The sentiment of these can be hard to determine as their meanings can be subjective and vary across different cultures and communities.
Multilingual Sentiment Analysis: Sentiment analysis becomes significantly more difficult when applied to multiple languages. Direct translation might not carry the same sentiment, and cultural differences can further complicate the analysis.
Aspect-Based Sentiment Analysis: Determining sentiment towards specific aspects within a text can be challenging. For instance, a restaurant review might have a positive sentiment towards the food, but a negative sentiment towards the service.

These challenges highlight the complexity of human language and communication. Overcoming them requires advanced NLP techniques, deep learning models, and a large amount of diverse and well-labelled training data. Despite these challenges, sentiment analysis continues to be a rapidly evolving field with vast potential.

Applications of Sentiment Analysis

Sentiment Analysis has a wide range of applications across various domains. Here are some key applications:

Customer Feedback: Businesses use sentiment analysis to process customer feedback and reviews. This helps them understand customer satisfaction and preferences, and make data-driven decisions.
Social Media Monitoring: Brands monitor social media platforms to understand public sentiment about their products or services. This can help in reputation management and in identifying potential crises before they escalate.
Market Research: Sentiment analysis can be used to understand public opinion about a product or a political event. This can provide valuable insights for market research.
Product Analytics: Companies use sentiment analysis to gather insights from product reviews. This can guide product enhancements and innovations.
Healthcare: In healthcare, sentiment analysis can be used to understand patient experiences and feedback about treatments, doctors, or hospitals.
Finance: In the financial sector, sentiment analysis is used to gauge market sentiment. Traders and investors use this information to make informed decisions.
Politics: In politics, NLP for sentiment analysis is used to understand public opinion about certain policies or politicians. This can guide political campaigns and strategies.
Human Resources: HR departments use sentiment analysis using NLP to understand employee feedback and improve workplace culture.

Remember, these are just a few examples. The potential applications of sentiment analysis are vast and continue to grow with advancements in AI and machine learning technologies.

Step by Step procedure to Implement Sentiment Analysis

_{First, let’s import all the python libraries that we will use throughout the program.}

Step1: Basic Python Libraries

Pandas – library for data analysis and data manipulation
Matplotlib – library used for data visualization
Seaborn – a library based on matplotlib and it provides a high-level interface for data visualization
WordCloud – library to visualize text data
re – provides functions to pre-process the strings as per the given regular expression

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import re

Step2: Natural Language Processing

nltk – Natural Language Toolkit is a collection of libraries for natural language processing
stopwords – a collection of words that don’t provide any meaning to a sentence
WordNetLemmatizer – used to convert different forms of words into a single item but still keeping the context intact.

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Step3: Scikit-Learn (Machine Learning Library for Python)

CountVectorizer – transform text to vectors
GridSearchCV – for hyperparameter tuning
RandomForestClassifier – machine learning algorithm for classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Step4: Evaluation Metrics

Accuracy Score: no. of correctly classified instances/total no. of instances
Precision Score: the ratio of correctly predicted instances over total positive instances
Recall Score: the ratio of correctly predicted instances over total instances in that class
Roc Curve: a plot of true positive rate against false positive rate
Classification Report: report of precision, recall and f1 score
Confusion Matrix: a table used to describe the classification models

from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report
from scikitplot.metrics import plot_confusion_matrix

Step5: Evaluate Dataset

We will use the dataset which is available on Kaggle for sentiment analysis using NLP, which consists of a sentence and its respective sentiment as a target variable. This dataset contains 3 separate files named train.txt, test.txt and val.txt.

You can find the dataset here.

Now, we will read the training data and validation data. As the data is in text format, separated by semicolons and without column names, we will create the data frame with read_csv() and parameters as “delimiter” and “names”.

df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])

Now, we will concatenate these two data frames, as we will be using cross-validation and we have a separate test dataset, so we don’t need a separate validation set of data. And, then we will reset the index to avoid duplicate indexes.

df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)

We can view a sample of the contents of the dataset using the “sample” method of pandas, and check the no. of records and features using the “shape” method.

import pandas as pd
df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])
df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)
print("Shape of the DataFrame:",df.shape)
print(df.sample(5))

Now, we will check for the various target labels in our dataset using seaborn.

As we can see that, we have 6 labels or targets in the dataset. We can make a multi-class classifier for Sentiment Analysis using NLP. But, for the sake of simplicity, we will merge these labels into two classes, i.e. Positive and Negative sentiment.

Positive Sentiment – “joy”,”love”,”surprise”
Negative Sentiment – “anger”,”sadness”,”fear”

Now, we will create a custom encoder to convert categorical target labels to numerical form, i.e. (0 and 1)

def custom_encoder(df):
    df.replace(to_replace ="surprise", value =1, inplace=True)
    df.replace(to_replace ="love", value =1, inplace=True)
    df.replace(to_replace ="joy", value =1, inplace=True)
    df.replace(to_replace ="fear", value =0, inplace=True)
    df.replace(to_replace ="anger", value =0, inplace=True)
    df.replace(to_replace ="sadness", value =0, inplace=True)

custom_encoder(df['label'])

Now, we can see that our target has changed to 0 and 1,i.e. 0 for Negative and 1 for Positive, and the data is more or less in a balanced state.

Step6: Data Pre-processing

Now, we will perform some pre-processing on the data before converting it into vectors and passing it to the machine learning model.

We will create a function for pre-processing of data.

First, we will iterate through each record, and using a regular expression, we will get rid of any characters apart from alphabets.
Then, we will convert the string to lowercase as, the word “Good” is different from the word “good”.
Because, without converting to lowercase, it will cause an issue when we will create vectors of these words, as two different vectors will be created for the same word which we don’t want to.
Then we will check for stopwords in the data and get rid of them. Stopwords are commonly used words in a sentence such as “the”, “an”, “to” etc. which do not add much value.
Then, we will perform lemmatization on each word,i.e. change the different forms of a word into a single item called a lemma.

A lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the same lexeme to their respective lemma. And, then return a corpus of processed data.

But first, we will create an object of WordNetLemmatizer and then we will perform the transformation.

#object of WordNetLemmatizer
lm = WordNetLemmatizer()

def text_transformation(df_col):
    corpus = []
    for item in df_col:
        new_item = re.sub('[^a-zA-Z]',' ',str(item))
        new_item = new_item.lower()
        new_item = new_item.split()
        new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
        corpus.append(' '.join(str(x) for x in new_item))
    return corpus

corpus = text_transformation(df['text'])

Now, we will create a Word Cloud. It is a data visualization technique used to depict text in such a way that, the more frequent words appear enlarged as compared to less frequent words. This gives us a little insight into, how the data looks after being processed through all the steps until now.

rcParams['figure.figsize'] = 20,8
word_cloud = ""
for row in corpus:
    for word in row:
        word_cloud+=" ".join(word)
wordcloud = WordCloud(width = 1000, height = 500,background_color ='white',min_font_size = 10).generate(word_cloud)
plt.imshow(wordcloud)

Output:

Step7: Bag of Words

Now, we will use the Bag of Words Model(BOW), which is used to represent the text in the form of a bag of words ,i.e. the grammar and the order of words in a sentence are not given any importance, instead, multiplicity, i.e. (the number of times a word occurs in a document) is the main point of concern.

Basically, it describes the total occurrence of words within a document.

Scikit-Learn provides a neat way of performing the bag of words technique using CountVectorizer.

Now, we will convert the text data into vectors, by fitting and transforming the corpus that we have created.

cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = df.label

We will take ngram_range as (1,2) which signifies a bigram.

Ngram is a sequence of ‘n’ of words in a row or sentence. ‘ngram_range’ is a parameter, which we use to give importance to the combination of words, such as, “social media” has a different meaning than “social” and “media” separately.

We can experiment with the value of the ngram_range parameter and select the option which gives better results.

Now comes the machine learning model creation part and in this project, I’m going to use Random Forest Classifier, and we will tune the hyperparameters using GridSearchCV.

GridSearchCV() will take the following parameters.

Estimator or model: RandomForestClassifier in our case
parameter: dictionary of hyperparameter names and their values
cv: signifies cross-validation folds
return_train_score: returns the training scores of the various models
n_jobs: no. of jobs to run parallelly (“-1” signifies that all CPU cores will be used which reduces the training time drastically)

First, We will create a dictionary, “parameters” which will contain the values of different hyperparameters.

We will pass this as a parameter to GridSearchCV to train our random forest classifier model using all possible combinations of these parameters to find the best model.

parameters = {'max_features': ('auto','sqrt'),
             'n_estimators': [500, 1000, 1500],
             'max_depth': [5, 10, None],
             'min_samples_split': [5, 10, 15],
             'min_samples_leaf': [1, 2, 5, 10],
             'bootstrap': [True, False]}

Now, we will fit the data into the grid search and view the best parameter using the “best_params_” attribute of GridSearchCV.

grid_search = GridSearchCV(RandomForestClassifier(),parameters,cv=5,return_train_score=True,n_jobs=-1)
grid_search.fit(X,y)
grid_search.best_params_

Output:

And then, we can view all the models and their respective parameters, mean test score and rank as GridSearchCV stores all the results in the cv_results_ attribute.

for i in range(432):
    print('Parameters: ',grid_search.cv_results_['params'][i])
    print('Mean Test Score: ',grid_search.cv_results_['mean_test_score'][i])
    print('Rank: ',grid_search.cv_results_['rank_test_score'][i])

Output: (a sample of the output)

Now, we will choose the best parameters obtained from GridSearchCV and create a final random forest classifier model and then train our new model.

rfc = RandomForestClassifier(max_features=grid_search.best_params_['max_features'],                                  max_depth=grid_search.best_params_['max_depth'],
                                  n_estimators=grid_search.best_params_['n_estimators'],                                      min_samples_split=grid_search.best_params_['min_samples_split'],                                    min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                    bootstrap=grid_search.best_params_['bootstrap'])
rfc.fit(X,y)

Step8: Test Data Transformation

Now, we will read the test data and perform the same transformations we did on training data and finally evaluate the model on its predictions.

test_df = pd.read_csv('test.txt',delimiter=';',names=['text','label'])

X_test,y_test = test_df.text,test_df.label
#encode the labels into two classes , 0 and 1
test_df = custom_encoder(y_test)
#pre-processing of text
test_corpus = text_transformation(X_test)
#convert text data into vectors
testdata = cv.transform(test_corpus)
#predict the target
predictions = rfc.predict(testdata)

Step9: Model Evaluation

We will evaluate our model using various metrics such as Accuracy Score, Precision Score, Recall Score, Confusion Matrix and create a roc curve to visualize how our model performed.

rcParams['figure.figsize'] = 10,5
plot_confusion_matrix(y_test,predictions)
acc_score = accuracy_score(y_test,predictions)
pre_score = precision_score(y_test,predictions)
rec_score = recall_score(y_test,predictions)
print('Accuracy_score: ',acc_score)
print('Precision_score: ',pre_score)
print('Recall_score: ',rec_score)
print("-"*50)
cr = classification_report(y_test,predictions)
print(cr)

Output:

Confusion Matrix:

Step10: Roc Curve

We will find the probability of the class using the predict_proba() method of Random Forest Classifier and then we will plot the roc curve.

predictions_probability = rfc.predict_proba(testdata)
fpr,tpr,thresholds = roc_curve(y_test,predictions_probability[:,1])
plt.plot(fpr,tpr)
plt.plot([0,1])
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

As we can see that our model performed very well in classifying the sentiments, with an Accuracy score, Precision and Recall of approx 96%. And the roc curve and confusion matrix are great as well which means that our model is able to classify the labels accurately, with fewer chances of error.

Now, we will check for custom input as well and let our model identify the sentiment of the input statement.

Predict for Custom Input:

def expression_check(prediction_input):
    if prediction_input == 0:
        print("Input statement has Negative Sentiment.")
    elif prediction_input == 1:
        print("Input statement has Positive Sentiment.")
    else:
        print("Invalid Statement.")

# function to take the input statement and perform the same transformations we did earlier
def sentiment_predictor(input):
    input = text_transformation(input)
    transformed_input = cv.transform(input)
    prediction = rfc.predict(transformed_input)
    expression_check(prediction)

input1 = ["Sometimes I just want to punch someone in the face."]
input2 = ["I bought a new phone and it's so good."]

sentiment_predictor(input1)
sentiment_predictor(input2)

Output:

Hurray, As we can see that our model accurately classified the sentiments behind the two sentences.

Conclusion

Sentiment analysis using NLP stands as a powerful tool in deciphering the complex landscape of human emotions embedded within textual data. By leveraging various techniques and methodologies such as text analysis and lexicon-based approaches, analysts can extract valuable insights, ranging from consumer preferences to political sentiment, thereby informing decision-making processes across diverse domains. The polarity of sentiments identified helps in evaluating brand reputation and other significant use cases. As we conclude this journey through sentiment analysis, it becomes evident that its significance transcends industries, offering a lens through which we can better comprehend and navigate the digital realm.

Key Takeaways

Sentiment analysis provides valuable insights in various applications, such as customer support and survey responses.
Accurate identification and processing of negative words and neutral sentiment are crucial for precise sentiment classification.
Open source tools facilitate the development and implementation of efficient sentiment analysis models, ensuring accessibility and adaptability.
Regression models effectively predict sentiment scores, while understanding semantic context is critical for accurately interpreting sentiment from unstructured data.

Frequently Asked Questions

Q1. What is Sentiment Analysis in Simple Words?

A. Sentiment analysis is a technique used to determine whether a piece of text (like a review or a tweet) expresses a positive, negative, or neutral sentiment. It helps in understanding people’s opinions and feelings from written language.

Q2. What Are the Three Types of Sentiment Analysis?

A. Fine-grained Sentiment Analysis: This involves classifying sentiments into categories like very positive, positive, neutral, negative, and very negative.
Aspect-based Sentiment Analysis: This focuses on identifying sentiments about specific aspects or features of a product or service, like the taste of food or the speed of service in a restaurant.
Emotion Detection: This type categorizes text into different emotions such as happiness, anger, sadness, etc.

Q3. What Is the Objective of Sentiment Analysis?

A. The objective of sentiment analysis is to automatically identify and extract subjective information from text. It helps businesses and organizations understand public opinion, monitor brand reputation, improve customer service, and gain insights into market trends.

Q4. What Is Sentiment Analysis in Python?

A. Sentiment analysis in Python involves using libraries and tools to analyze text data and determine its sentiment. Commonly used libraries include:
1. NLTK (Natural Language Toolkit): For text processing and classification.
2. TextBlob: For simple sentiment analysis and text processing.
3. VADER (Valence Aware Dictionary and sEntiment Reasoner): For analyzing social media texts.
4. Transformers (Hugging Face): For using pre-trained models to perform sentiment analysis.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Nikhil 30 Jul, 2024

Data Scientist with 6 years of experience in analysing large datasets and delivering valuable insights via advanced data-driven methods. Proficient in Time Series Forecasting, Natural Language Processing and with a demonstrated history of working in the Telecom, Healthcare and Retail Supply Chain industries.

Advanced Classification NLP Project Python