nltk
also provides access to a dataset of tweets from Twitter, it includes a set of tweets already classified as negative or positive.
In this exercise notebook we would like to replicate the sentiment analysis classification performed on the movie reviews corpus on this dataset.
First we want to download the dataset and inspect it:
import nltk
# DO NOT MODIFY
nltk.download("twitter_samples")
from nltk.corpus import twitter_samples
[nltk_data] Downloading package twitter_samples to [nltk_data] /Users/altintas/nltk_data... [nltk_data] Package twitter_samples is already up-to-date!
First let's check the common fileids
method of nltk
corpora:
twitter_samples.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
The twitter_samples object has a tokenized()
method that returns all tweets from a fileid already individually tokenized. Read its documentation and use it to find the number of positive and negative tweets.
number_of_positive_tweets = None
### BEGIN SOLUTION
### END SOLUTION
number_of_negative_tweets = None
### BEGIN SOLUTION
### END SOLUTION
# DO NOT MODIFY
assert number_of_positive_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"
As in the lecture, we can build a bag-of-words model to train our machine learning algorithm.
import string
First step we define a list of words that we want to filter out of our dataset:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
def build_bag_of_words_features_filtered(words):
"""Build a bag of words model"""
### BEGIN SOLUTION
### END SOLUTION
assert len(build_bag_of_words_features_filtered(["what", "the", "?", ","]))==0, "Make sure we are filtering out both stopwords and punctuation"
Before performing sentiment analysis, let's first inspect the dataset a little bit more by creating a list of all words.
words = []
for dataset in ["positive_tweets.json", "negative_tweets.json"]:
for tweet in twitter_samples.tokenized(dataset):
words.extend(tweet)
Study the code above, see that it is a case of nested loop, for each dataset we are looping through each tweet. Also notice we are using extend
, how does it differ from append
? Try it on a simple case, or read the documentation or Google for it!
Now let's filter out punctuation and stopwords:
filtered_words = None
### BEGIN SOLUTION
### END SOLUTION
First we want to filter out useless_words
as defined in the previous section, this will reduce the lenght of the dataset by more than a factor of 2:
# DO NOT MODIFY
assert len(filtered_words) == 85637, "Make sure that the filtering is applied correctly"
The collection
package of the standard library contains a Counter
class that is handy for counting frequencies of words in our list:
# DO NOT MODIFY
from collections import Counter
counter = Counter(filtered_words)
It also has a most_common()
method to access the words with the higher count:
most_common_words = None
### BEGIN SOLUTION
### END SOLUTION
assert most_common_words[0][0] == ":(", "The most common word should be :("
assert len(most_common_words) == 10, "Make sure you are only getting the first 10"
Using our build_bag_of_words_features
function we can build separately the negative and positive features.
The format of the positive features should be:
[
( { "here":1, "some":1, "words":1 }, "pos" ),
( { "another":1, "tweet":1}, "pos" )
]
It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.
negative_features = None
### BEGIN SOLUTION
### END SOLUTION
positive_features = None
### BEGIN SOLUTION
### END SOLUTION
positive_features[0][0]
{'#FollowFriday': 1, ':)': 1, '@France_Inte': 1, '@Milipol_Paris': 1, '@PKuchly57': 1, 'community': 1, 'engaged': 1, 'members': 1, 'top': 1, 'week': 1}
assert positive_features[0][1] == "pos", "Make sure the feature is a list of tuples whose second element is pos or neg"
assert positive_features[0][0]["engaged"] == 1, "Make sure that the first element of each tuple is a dictionary of words"
from nltk.classify import NaiveBayesClassifier
Let's use 80% of the data for training, the rest for validation:
split = int(len(positive_features) * 0.8)
split
4000
classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])
Let's check the accuracy on the training and on the test sets, make sure to turn those into a percent value
training_accuracy = None
### BEGIN SOLUTION
### END SOLUTION
test_accuracy = None
### BEGIN SOLUTION
### END SOLUTION
It looks like the accuracy for the test is very high compared to the movie review dataset, check the most informative features below to understand why:
classifier.show_most_informative_features()
Most Informative Features :( = 1 neg : pos = 2362.3 : 1.0 :) = 1 pos : neg = 1139.0 : 1.0 See = 1 pos : neg = 37.7 : 1.0 TOO = 1 neg : pos = 36.3 : 1.0 THANKS = 1 neg : pos = 35.0 : 1.0 THAT = 1 neg : pos = 27.7 : 1.0 miss = 1 neg : pos = 26.4 : 1.0 sad = 1 neg : pos = 25.0 : 1.0 x15 = 1 neg : pos = 23.7 : 1.0 Thank = 1 pos : neg = 22.3 : 1.0