In Natural Language Processing there is a concept known as Sentiment Analysis.
Given a movie review or a tweet, it can be automatically classified in categories. These categories can be user defined (positive, negative) or whichever classes you want.
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.
This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.
NLTK Includes extensive software, data and documentation,text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning all free on http://www.nltk.org/
It also contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc
Install Pip: run in terminal:
sudo easy_install pip
Install NLTK: run in terminal :
sudo pip install -U nltk
Download NLTK data: run python shell (in terminal) and write the following code:
import nltk
#nltk.download()
AFINN-111 - A list of english words rated for valence with an integer between -5 and + 5.
#Build dictionary in python
sentiment_dictionary ={}
for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):
word, score = line.split('\t')
sentiment_dictionary[word] = int(score)
is the process of breaking down a stream of text into words, phrases or symbols known as tokens
from nltk.tokenize import word_tokenize
words =word_tokenize('I hate this novel!'.lower())
print(words)
['i', 'hate', 'this', 'novel', '!']
words[0]
'i'
sum(sentiment_dictionary.get(word,0)for word in words)
-1
#split into sentences
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')
for s in sentences:print(s)
I love this book! Though I hate the beginning. It would be great for you.
#computer score for each sentence
for sentence in sentences:
words = word_tokenize(sentence)
print(sum(sentiment_dictionary.get(word,0)for word in words))
3 -3 3
This is a classification algorithm that works on Bayes theorem of probability to predict the class of unknown outcome. It assumes that the presence of a particular feature in a class in unrelated to the presence of any other feature.
import nltk.classify.util #calculates accuracy
from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes
from nltk.corpus import movie_reviews #imports movie reviews from nltk
from nltk.corpus import stopwords #imports stopwords from nltk
from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk
#import movie_reviews
from nltk.corpus import movie_reviews
#see words in the review
movie_reviews.words()
[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]
movie_reviews.categories()
[u'neg', u'pos']
#frequency distribution of words in movie review
all_words = movie_reviews.words()
freq_dist = nltk.FreqDist(all_words)
freq_dist.most_common(10)
[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822)]
These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc
When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info.
#inbuilt list of stopwords in nltk
stopwords.words('english')[:16]
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his']
sent = "the program was open to all women between the ages of 17 and 35, in good health, who had graduated from an accredited high school"
#a token is a word or entity in a text
words = word_tokenize(sent)
useful_words = [word for word in words if word not in stopwords.words('english')]
print(useful_words)
['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']
# This is how the Naive Bayes classifier expects the input
def create_word_features(words):
useful_words = [word for word in words if word not in stopwords.words("english")]
my_dict = dict([(word, True) for word in useful_words])
return my_dict
#For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated.
#If a word already exists, it won’t be added to the dictionary.
create_word_features(["the", "quick", "brown", "quick", "a", "fox"])
{'brown': True, 'fox': True, 'quick': True}
neg_reviews = [] #We creates an empty list
#loop over all the files in the neg folder and applies the create_word_features
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append((create_word_features(words),"negative"))
print(len(neg_reviews))
1000
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append((create_word_features(words), "positive"))
#print(pos_reviews[0])
print(len(pos_reviews))
1000
The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
train_set = neg_reviews[:750] + pos_reviews[:750]
test_set = neg_reviews[750:] + pos_reviews[750:]
print(len(train_set), len(test_set))
(1500, 500)
#create the NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print(accuracy * 100)
72.4
review_emoji_movie = '''
This engaging adventure triumphs because of its empowering storyline, which pays tribute to Polynesian culture, and because of its feel-good music, courtesy of Hamilton creator Lin-Manuel Miranda.
'''
words = word_tokenize(review_emoji_movie)
words = create_word_features(words)
classifier.classify(words)
'positive'