Natural Language Processing (NLP)

Text Classification for Sentiment Analysis

using Naive Bayes Classifier

In Natural Language Processing there is a concept known as Sentiment Analysis.

Given a movie review or a tweet, it can be automatically classified in categories. These categories can be user defined (positive, negative) or whichever classes you want.

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

Prerequisites

Basic knowledge of Python is assumed

URL : goo.gl/AwKCn4

This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.

NLTK Includes extensive software, data and documentation,text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning all free on http://www.nltk.org/

It also contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc

Steps to install NLTK and its data:

Install Pip: run in terminal:

sudo easy_install pip

Install NLTK: run in terminal :

sudo pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:

In [1]:
import nltk
#nltk.download()

Let's start with the basics:

AFINN-111 - A list of english words rated for valence with an integer between -5 and + 5.

In [2]:
#Build dictionary in python
sentiment_dictionary ={}

for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)

Tokenization

is the process of breaking down a stream of text into words, phrases or symbols known as tokens

In [4]:
from nltk.tokenize import word_tokenize
words =word_tokenize('I hate this novel!'.lower())
print(words)
['i', 'hate', 'this', 'novel', '!']
In [5]:
words[0]
Out[5]:
'i'
In [6]:
sum(sentiment_dictionary.get(word,0)for word in words)
Out[6]:
-1

What if the text is really long?

In [7]:
#split into sentences
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')
for s in sentences:print(s)
I love this book!
Though I hate the beginning.
It would be great for you.
In [8]:
#computer score for each sentence
for sentence in sentences:
    words = word_tokenize(sentence)
    print(sum(sentiment_dictionary.get(word,0)for word in words))
3
-3
3

What about new words? or domain specific terms?

Naive Bayes Algorithm

This is a classification algorithm that works on Bayes theorem of probability to predict the class of unknown outcome. It assumes that the presence of a particular feature in a class in unrelated to the presence of any other feature.

In [9]:
import nltk.classify.util #calculates accuracy
from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes
from nltk.corpus import movie_reviews #imports movie reviews from nltk
from nltk.corpus import stopwords #imports stopwords from nltk
from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk
In [10]:
#import movie_reviews
from nltk.corpus import movie_reviews

#see words in the review
movie_reviews.words()
Out[10]:
[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]
In [11]:
movie_reviews.categories()
Out[11]:
[u'neg', u'pos']
In [12]:
#frequency distribution of words in movie review
all_words = movie_reviews.words()
freq_dist = nltk.FreqDist(all_words)
freq_dist.most_common(10)
Out[12]:
[(u',', 77717),
 (u'the', 76529),
 (u'.', 65876),
 (u'a', 38106),
 (u'and', 35576),
 (u'of', 34123),
 (u'to', 31937),
 (u"'", 30585),
 (u'is', 25195),
 (u'in', 21822)]

Stopwords

These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc

When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info.

In [13]:
#inbuilt list of stopwords in nltk
stopwords.words('english')[:16]
Out[13]:
[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your',
 u'yours',
 u'yourself',
 u'yourselves',
 u'he',
 u'him',
 u'his']

How do we remove stopwords?

In [14]:
sent = "the program was open to all women between the ages of 17 and 35, in good health, who had graduated from an accredited high school"
In [15]:
#a token is a word or entity in a text
words = word_tokenize(sent)
useful_words = [word for word in words if word not in stopwords.words('english')]
print(useful_words)
['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']
In [16]:
# This is how the Naive Bayes classifier expects the input
def create_word_features(words):
    useful_words = [word for word in words if word not in stopwords.words("english")]
    my_dict = dict([(word, True) for word in useful_words])
    return my_dict

#For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated.
#If a word already exists, it won’t be added to the dictionary.
In [17]:
create_word_features(["the", "quick", "brown", "quick", "a", "fox"])
Out[17]:
{'brown': True, 'fox': True, 'quick': True}
In [18]:
neg_reviews = [] #We creates an empty list

#loop over all the files in the neg folder and applies the create_word_features
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append((create_word_features(words),"negative")) 
    
print(len(neg_reviews))
1000
In [19]:
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append((create_word_features(words), "positive"))
    
#print(pos_reviews[0])    
print(len(pos_reviews))
 
1000

Training and Testing the Naive Bayes Classifier

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

In [20]:
train_set = neg_reviews[:750] + pos_reviews[:750]
test_set =  neg_reviews[750:] + pos_reviews[750:]
print(len(train_set),  len(test_set))
(1500, 500)
In [21]:
#create the NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
In [22]:
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print(accuracy * 100)
72.4
In [28]:
review_emoji_movie = '''
This engaging adventure triumphs because of its empowering storyline, which pays tribute to Polynesian culture, and because of its feel-good music, courtesy of Hamilton creator Lin-Manuel Miranda.
'''
In [29]:
words = word_tokenize(review_emoji_movie)
words = create_word_features(words)
classifier.classify(words)
 
Out[29]:
'positive'