Exercise Notebook on Natural Language Processing¶

nltk also provides access to a dataset of tweets from Twitter, it includes a set of tweets already classified as negative or positive.

In this exercise notebook we would like to replicate the sentiment analysis classification performed on the movie reviews corpus on this dataset.

Exercise 1: Download and inspect the twitter_samples dataset¶

First we want to download the dataset and inspect it:

In [1]:

import nltk

In [2]:

# DO NOT MODIFY

nltk.download("twitter_samples")
from nltk.corpus import twitter_samples

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/altintas/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!

First let's check the common fileids method of nltk corpora:

In [3]:

twitter_samples.fileids()

Out[3]:

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

The twitter_samples object has a tokenized() method that returns all tweets from a fileid already individually tokenized. Read its documentation and use it to find the number of positive and negative tweets.

In [4]:

number_of_positive_tweets = None
### BEGIN SOLUTION

### END SOLUTION

In [5]:

number_of_negative_tweets = None
### BEGIN SOLUTION

### END SOLUTION

In [6]:

# DO NOT MODIFY

assert number_of_positive_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"

Exercise 2: Build a bag-of-words model function¶

As in the lecture, we can build a bag-of-words model to train our machine learning algorithm.

In [7]:

import string

First step we define a list of words that we want to filter out of our dataset:

In [8]:

useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)

In [9]:

def build_bag_of_words_features_filtered(words):
    """Build a bag of words model"""
    ### BEGIN SOLUTION

    ### END SOLUTION

In [10]:

assert len(build_bag_of_words_features_filtered(["what", "the", "?", ","]))==0, "Make sure we are filtering out both stopwords and punctuation"

Exercise 3: create a list of all words¶

Before performing sentiment analysis, let's first inspect the dataset a little bit more by creating a list of all words.

In [11]:

words = []
for dataset in ["positive_tweets.json", "negative_tweets.json"]:
    for tweet in twitter_samples.tokenized(dataset):
        words.extend(tweet)

Study the code above, see that it is a case of nested loop, for each dataset we are looping through each tweet. Also notice we are using extend, how does it differ from append? Try it on a simple case, or read the documentation or Google for it!

Now let's filter out punctuation and stopwords:

In [12]:

filtered_words = None
### BEGIN SOLUTION

### END SOLUTION

First we want to filter out useless_words as defined in the previous section, this will reduce the lenght of the dataset by more than a factor of 2:

In [13]:

# DO NOT MODIFY 

assert len(filtered_words) == 85637, "Make sure that the filtering is applied correctly"

Exercise 4: find the most common words¶

The collection package of the standard library contains a Counter class that is handy for counting frequencies of words in our list:

In [14]:

# DO NOT MODIFY 

from collections import Counter

counter = Counter(filtered_words)

It also has a most_common() method to access the words with the higher count:

In [15]:

most_common_words = None
### BEGIN SOLUTION

### END SOLUTION

In [16]:

assert most_common_words[0][0] == ":(", "The most common word should be :("
assert len(most_common_words) == 10, "Make sure you are only getting the first 10"

Exercise 5: Build the features for machine learning¶

Using our build_bag_of_words_features function we can build separately the negative and positive features.

The format of the positive features should be:

[
    ( { "here":1, "some":1, "words":1 }, "pos" ),
    ( { "another":1, "tweet":1}, "pos" )
]

It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.

In [17]:

negative_features = None
### BEGIN SOLUTION

### END SOLUTION

In [18]:

positive_features = None
### BEGIN SOLUTION

### END SOLUTION

In [19]:

positive_features[0][0]

Out[19]:

{'#FollowFriday': 1,
 ':)': 1,
 '@France_Inte': 1,
 '@Milipol_Paris': 1,
 '@PKuchly57': 1,
 'community': 1,
 'engaged': 1,
 'members': 1,
 'top': 1,
 'week': 1}

In [20]:

assert positive_features[0][1] == "pos", "Make sure the feature is a list of tuples whose second element is pos or neg"
assert positive_features[0][0]["engaged"] == 1, "Make sure that the first element of each tuple is a dictionary of words"

Exercise 6: Train a NaiveBayesClassifier¶

In [21]:

from nltk.classify import NaiveBayesClassifier

Let's use 80% of the data for training, the rest for validation:

In [22]:

split = int(len(positive_features) * 0.8)

In [23]:

split

Out[23]:

In [24]:

classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

Let's check the accuracy on the training and on the test sets, make sure to turn those into a percent value

In [25]:

training_accuracy = None
### BEGIN SOLUTION

### END SOLUTION

In [26]:

test_accuracy = None
### BEGIN SOLUTION

### END SOLUTION

It looks like the accuracy for the test is very high compared to the movie review dataset, check the most informative features below to understand why:

In [27]:

classifier.show_most_informative_features()

Most Informative Features
                      :( = 1                 neg : pos    =   2362.3 : 1.0
                      :) = 1                 pos : neg    =   1139.0 : 1.0
                     See = 1                 pos : neg    =     37.7 : 1.0
                     TOO = 1                 neg : pos    =     36.3 : 1.0
                  THANKS = 1                 neg : pos    =     35.0 : 1.0
                    THAT = 1                 neg : pos    =     27.7 : 1.0
                    miss = 1                 neg : pos    =     26.4 : 1.0
                     sad = 1                 neg : pos    =     25.0 : 1.0
                     x15 = 1                 neg : pos    =     23.7 : 1.0
                   Thank = 1                 pos : neg    =     22.3 : 1.0