Over time, the data have increased exponentially. Digitalization in all areas generates a large number of data per second. Today, this large amount of data is an unstructured type, i.e. there are no predefined formats. This includes the images, videos, audio, text, etc. Among these, the text is the most generated type of unstructured data. Whether it’s a machine’s log file or a customer’s product review, everything is based mainly on text. Companies and enterprises are using these data to make decisions. These decisions may include the modification of existing policies, the creation of new products, the formulation of new offers to the community, etc. Since the generated text is human-readable, it cannot be understood by machines. Therefore, to make machine-friendly, the concept of natural language processing or NLP has been introduced.
In this article, we will learn how to analyze the text in order to find its sentiment score. Since we want to calculate sentiment scores, we would look at text data generated by humans, such as product reviews, restaurants, etc., or tweets from famous personalities. With the help of these scores, we can understand the emotion behind the text and classify them for further decision-making.
1. Need of Sentiment Scores
2. Method 1: Using Positive and Negative Word Count – With Normalization
3. Method 2: Using Positive and Negative Word Counts – With Semi Normalization
4. Method 3: Using VADER SentimentIntensityAnalyser
5. Conclusions
Today, text data is widely used to understand the behavior of customers and users towards services. For example, companies use product reviews to understand customers’ opinions on products. Based on similar analyses, customers can be classified into groups and make specific decisions for customers who are unhappy. The use of texts is endless and data scientists around the world have helped analyze the trends of texts. You can also use star ratings to understand the sentiment of users’ feedback on products or services. However, in order to understand the actual reaction, analyzing text-based feedback is a better solution.
In order to understand the emotion behind a text, we can analyze the text using NLP techniques to find patterns. However, if the same process has to be done in a large number of texts, it will take a lot of time, and new texts will be added until then. For these cases, the calculation of sentiment scores is useful. Thus, a positive score shows a positive text and a negative score shows a negative text. The automation of this sentiment score calculator will help to analyze the text and make quick decisions. For calculating the sentiment scores, we would be using our methods as well using some pre-defined rule-based methods for calculating these scores in a jiffy.
In this method, we will calculate the Sentiment Scores by classifying and counting the Negative and Positive words from the given text and taking the ratio of the difference of Positive and Negative Word Counts and Total Word Count. We will be using the Amazon Cell Phone Reviews dataset from Kaggle.
To calculate the sentiment, here we will use the formula:
Let’s first import the Libraries
import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re
Now, we will import the dataset. Out of all the columns present in this dataset, we will use the ‘body’ column which contains the cell phone reviews.
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
Next, we will create an instance of WordNetLemmatizer, which we will be using in the next step. Also, we will call the Stop Words from the NLTK library.
lemma = WordNetLemmatizer() stop_words = stopwords.words('english')
Next, we will define a function to preprocess the text. In this function, first, we converted the input string to lowercase. Then, we extracted only the letters from this string. This task is performed using Regular Expressions. Next, we tokenized the string. After tokenizing, we filtered the text of stop words. At last, we lemmatized the remaining words and returned these words.
def text_prep(x): corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize
Now, we will apply this function to every text in the ‘body‘ column.
preprocess_tag = [text_prep(i) for i in df['body']] df["preprocess_txt"] = preprocess_tag
Next, we will calculate the number of resultant words for each text. This will be helpful later when we calculate the Sentiment Score.
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
In the next step, we defined the generic Negative Words and Positive Words list. For this, we have taken the positive-text.txt and negative-text.txt files or generally known as Opinion Lexicon by the respective authors(s). The same can be downloaded from here.
file = open('negative-words.txt', 'r') neg_words = file.read().split() file = open('positive-words.txt', 'r') pos_words = file.read().split()
Next, we will make a count of positive and negative words for the text. For this, we will count all the words in preprocess_txt that are present in positive words list pos_words, and similarly will count all the words in preprocess_txt that are present in the negative words list neg_words. We will add these values into the main Pandas DataFrame.
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg
Finally, we will calculate the sentiment. We will create a ‘sentiment‘ column and perform the calculation for each text.
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
Putting it all together:
import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re df = pd.read_csv('20191226-reviews.csv', usecols=['body']) lemma = WordNetLemmatizer() stop_words = stopwords.words('english') def text_prep(x: str) -> list: corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize preprocess_tag = [text_prep(i) for i in df['body']] df["preprocess_txt"] = preprocess_tag df['total_len'] = df['preprocess_txt'].map(lambda x: len(x)) file = open('negative-words.txt', 'r') neg_words = file.read().split() file = open('positive-words.txt', 'r') pos_words = file.read().split() num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2) df.head()
On executing this code, we get:
Image Source – Personal Computer
In this method, we calculate the sentiment score by evaluating the ratio of Count of Positive Words and Count of Negative Words + 1. Since there is no difference of values involved, the sentiment value will always be more than 0. Also, adding 1 in the denominator would save from Zero Division Error. Let’s start with the implementation. The implementation code will remain the same till the Negative and Positive Word Count from the previous part with a difference that this time we don’t need the total word count, thus will be omitting that part.
import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re df = pd.read_csv('20191226-reviews.csv', usecols=['body']) lemma = WordNetLemmatizer() stop_words = stopwords.words('english') def text_prep(x: str) -> list: corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize preprocess_tag = [text_prep(i) for i in df['body']] df["preprocess_txt"] = preprocess_tag file = open('negative-words.txt', 'r') neg_words = file.read().split() file = open('positive-words.txt', 'r') pos_words = file.read().split() num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg
Next, we will apply the formula and add the formulated value into the main DataFrame.
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)
Putting it all together.
import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re df = pd.read_csv('20191226-reviews.csv', usecols=['body']) lemma = WordNetLemmatizer() stop_words = stopwords.words('english') def text_prep(x: str) -> list: corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize preprocess_tag = [text_prep(i) for i in df['body']] df["preprocess_txt"] = preprocess_tag file = open('negative-words.txt', 'r') neg_words = file.read().split() file = open('positive-words.txt', 'r') pos_words = file.read().split() num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2) df.head()
On executing this code, we get:
Image Source – Personal Computer
In this method, we will use the Sentiment Intensity Analyser which uses the VADER Lexicon. VADER is a long-form for Valence Aware and sEntiment Reasoner, a rule-based sentiment analysis tool. VADER calculates text emotions and determines whether the text is positive, neutral or, negative. This analyzer calculates text sentiment and produces four different classes of output scores: positive, negative, neutral, and compound.
Here, we will make use of the Compound Score. A compound score is the aggregate of the score of a word, or precisely, the sum of all words in the lexicon, normalized between -1 and 1. It is not recommended to use general text preprocessing techniques prior to calculation because this may affect VADER results. Using these VADER results, you can also classify each text into categories according to your needs.
Now, we will use VADER to implement the calculation of the sentiment score. Since this process is different from the previous two methods, we will be implementing it from scratch.
First, we will import the libraries. We will use the NLTK library for importing the SentimentIntensityAnalyzer.
import pandas as pd from nltk.sentiment.vader import SentimentIntensityAnalyzer
Next, we will create the instance of SentimentIntensityAnalyzer.
sent = SentimentIntensityAnalyzer()
Next, we will read the Cell Phone Review Data, which we have been using previously.
df = pd.read_csv('', usecols = ['body'])
Finally, we will calculate the Compound Poalrtity Score and add it into the Main DataFrame for each of the texts.
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']] df['sentiment_score'] = polarity
Putting it all together:
import pandas as pd from nltk.sentiment.vader import SentimentIntensityAnalyzer sent = SentimentIntensityAnalyzer() df = pd.read_csv('', usecols = ['body']) polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']] df['sentiment_score'] = polarity df.head()
On executing this code, we get:
In this article, we learned about three different ways of calculating the Sentiment Scores for each of the given Cell Phone Reviews. The first two
methods are self-calculated while the third method was rule-based and the score is evaluated by itself. Each of these methods has its own pros and cons. There are numerous ways by which one can calculate the sentiment score. One can even define his own method of evaluating the scores. VADER is more robust and precise in terms of Internet slang, which is widely used today. One can try building a new and stronger method of evaluation that might cover a wider range of words.
Connect with me on LinkedIn.
For any suggestions or article requests, you can email me here.
Check out my other Articles Here and on Medium
You can provide your valuable feedback to me on LinkedIn.
Thanks for giving your time!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion
Lorem ipsum dolor sit amet, consectetur adipiscing elit,