Sentiment Analysis¶

Author: Marco Tavora ¶

What is Sentiment Analysis?¶

According to Wikipedia:

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. [...] Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).

Another, more business oriented, definition is:

[The goal of sentiment analysis is to] understand the social sentiment of your brand, product or service while monitoring online conversations. Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in source material.

Goal¶

In this project we will perform a kind of "reverse sentiment analysis" on a dataset consisting of movie review from Rotten Tomatoes. The dataset already contains the classification, which can be positive or negative, and the task at hand is to identify which words appear more frequently on reviews from each of the classes.

In this project, the Naive Bayes algorithm will be used, more specifically the Bernoulli Naive Bayes. From Wikipedia:

In the multivariate Bernoulli event model, features are independent binary variables describing inputs.

Furthermore,

If $x_i$ is a boolean expressing the occurrence or absence of the $i$-th term from the vocabulary, then the likelihood of a document given a class $C_{k}$ is given by:

$$ p({x_1}, \ldots ,{x_n}\mid {C_k}) = \prod\limits_{i = 1}^n {p_{ki}^{{x_i}}} {(1 - {p_{ki}})^{(1 - {x_i})}}$$

where $p_{{ki}}$ is the probability that a review $k$ belonging to class $C_{k}$ contains the term $x_{i}$. The classification $C_{1}$ is either 0 or 1 (negative or positive). In other words, the Bernoulli NB will tell us which words are more likely to appear given that the review is "fresh" versus or given that it is "rotten".

Importing libraries and the data¶

In [27]:

import pandas as pd
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score, train_test_split

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # so we can see the value of multiple statements at once.

In [3]:

rotten = pd.read_csv('rt_critics.csv')
rotten.head()

Out[3]:

	critic	fresh	imdb	publication	quote	review_date	rtid	title
0	Derek Adams	fresh	114709.0	Time Out	So ingenious in concept, design and execution ...	2009-10-04	9559.0	Toy story
1	Richard Corliss	fresh	114709.0	TIME Magazine	The year's most inventive comedy.	2008-08-31	9559.0	Toy story
2	David Ansen	fresh	114709.0	Newsweek	A winning animated feature that has something ...	2008-08-18	9559.0	Toy story
3	Leonard Klady	fresh	114709.0	Variety	The film sports a provocative and appealing st...	2008-06-09	9559.0	Toy story
4	Jonathan Rosenbaum	fresh	114709.0	Chicago Reader	An entertaining computer-generated, hyperreali...	2008-03-10	9559.0	Toy story

The columns fresh contains three classes, namely, "fresh", "rotten" and "none". The third one needs to be removed which can be done using the Python method isin( ) which returns a boolean DataFrame showing whether each element in the DataFrame is contained in values.

In [4]:

rotten['fresh'].value_counts()

Out[4]:

fresh     8613
rotten    5436
none        23
Name: fresh, dtype: int64

In [5]:

rotten = rotten[rotten['fresh'].isin(['fresh','rotten'])]
rotten.head()

Out[5]:

	critic	fresh	imdb	publication	quote	review_date	rtid	title
0	Derek Adams	fresh	114709.0	Time Out	So ingenious in concept, design and execution ...	2009-10-04	9559.0	Toy story
1	Richard Corliss	fresh	114709.0	TIME Magazine	The year's most inventive comedy.	2008-08-31	9559.0	Toy story
2	David Ansen	fresh	114709.0	Newsweek	A winning animated feature that has something ...	2008-08-18	9559.0	Toy story
3	Leonard Klady	fresh	114709.0	Variety	The film sports a provocative and appealing st...	2008-06-09	9559.0	Toy story
4	Jonathan Rosenbaum	fresh	114709.0	Chicago Reader	An entertaining computer-generated, hyperreali...	2008-03-10	9559.0	Toy story

In [6]:

rotten['fresh'].value_counts()

Out[6]:

fresh     8613
rotten    5436
Name: fresh, dtype: int64

Dummifying the `fresh` column:¶

We now turn the fresh column into 0s and 1s using .map( ).

In [7]:

rotten['fresh'] = rotten['fresh'].map(lambda x: 1 if x == 'fresh' else 0)
rotten.head()

Out[7]:

	critic	fresh	imdb	publication	quote	review_date	rtid	title
0	Derek Adams	1	114709.0	Time Out	So ingenious in concept, design and execution ...	2009-10-04	9559.0	Toy story
1	Richard Corliss	1	114709.0	TIME Magazine	The year's most inventive comedy.	2008-08-31	9559.0	Toy story
2	David Ansen	1	114709.0	Newsweek	A winning animated feature that has something ...	2008-08-18	9559.0	Toy story
3	Leonard Klady	1	114709.0	Variety	The film sports a provocative and appealing st...	2008-06-09	9559.0	Toy story
4	Jonathan Rosenbaum	1	114709.0	Chicago Reader	An entertaining computer-generated, hyperreali...	2008-03-10	9559.0	Toy story

CountVectorizer¶

We need number to run our model i.e. our predictor matrix of words must be numerical. For that we will use CountVectorizer. From the sklearn documentation, CountVectorizer

Converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

We have to choose a range value ngram_range. The latter is:

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [10]:

from sklearn.feature_extraction.text import CountVectorizer
ngram_range = (1,2)
max_features = 2000

cv = CountVectorizer(ngram_range=ngram_range, max_features=max_features, binary=True, stop_words='english')

The next step is to "learn the vocabulary dictionary and return term-document matrix" using cv.fit_transform.

In [13]:

words = cv.fit_transform(rotten.quote)

The dataframe corresponding to this term-document matrix will be called df_words. This is our predictor matrix.

P.S.: The method todense() returns a dense matrix representation of the matrix words.

In [14]:

words.todense()

Out[14]:

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [15]:

df_words = pd.DataFrame(words.todense(), 
                        columns=cv.get_feature_names())

In [16]:

df_words.head()

Out[16]:

	...	year
0	...	0
1	...	1
2	...	0
3	...	0
4	...	0

5 rows × 2000 columns

In this dataframe:

Rows are classes
Columns are features.

In [61]:

df_words.iloc[0,:].value_counts()

Out[61]:

0    1993
1       7
Name: 0, dtype: int64

In [60]:

df_words.iloc[1,:].value_counts()

Out[60]:

0    1997
1       3
Name: 1, dtype: int64

Training/test split¶

We proceed as usual with a train/test split:

In [17]:

X_train, X_test, y_train, y_test = train_test_split(df_words.values, rotten.fresh.values, test_size=0.25)

Model¶

We will now use BernoulliNB() on the training data to build a model to predict if the class is "fresh" or "rotten" based on the word appearances:

In [18]:

nb = BernoulliNB()
nb.fit(X_train, y_train)

Out[18]:

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Using cross-validation to compute the score:

In [20]:

nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=5)
round(np.mean(nb_scores),3)

Out[20]:

0.734

We will now obtain the probability of words given the "fresh" classification¶

The log probabilities of a feature for given a class is obtained using nb.feature_log_prob_. We then exponentiate the result to get the actual probabilities. To organize our results we build a DataFrame which includes a new column showing the difference in probabilities:

In [38]:

feat_lp = nb.feature_log_prob_
fresh_p = np.exp(feat_lp[1])
rotten_p = np.exp(feat_lp[0])
print(fresh_p[0:7])
print(rotten_p[0:7])

df_new = pd.DataFrame({'fresh_probability':fresh_p, 
                       'rotten_probability':rotten_p, 
                       'feature':df_words.columns.values})

df_new['probability_diff'] = df_new['fresh_probability'] - df_new['rotten_probability']

df_new.head()

[0.0026418 0.0010878 0.0027972 0.0012432 0.0013986 0.0026418 0.0024864]
[0.00487211 0.00073082 0.00146163 0.00170524 0.00243605 0.00292326
 0.00292326]

Out[38]:

	feature	fresh_probability	rotten_probability	probability_diff
0	10	0.002642	0.004872	-0.002230
1	100	0.001088	0.000731	0.000357
2	20	0.002797	0.001462	0.001336
3	50s	0.001243	0.001705	-0.000462
4	90s	0.001399	0.002436	-0.001037

E.g. if the review is "fresh" there is a probability of 0.248% that the word "ability" present.

Evaluating the model on the test set versus baseline¶

In [28]:

nb.score(X_test, y_test)
np.mean(y_test)

Out[28]:

0.7272986051807572

Out[28]:

0.6205522345573584

Which words are more likely to be found in "fresh" and "rotten" reviews:¶

In [44]:

df_fresh = df_new.sort_values('probability_diff', ascending=False)
df_rotten = df_new.sort_values('probability_diff', ascending=True)
df_fresh.head()
df_rotten.head()

Out[44]:

	feature	fresh_probability	rotten_probability	probability_diff
641	film	0.160839	0.117905	0.042934
137	best	0.042424	0.019488	0.022936
753	great	0.029060	0.009501	0.019559
531	entertaining	0.023465	0.005603	0.017863
1256	performance	0.021756	0.006334	0.015422

Out[44]:

	feature	fresh_probability	rotten_probability	probability_diff
993	like	0.043667	0.067479	-0.023811
111	bad	0.006993	0.025335	-0.018342
1398	really	0.006682	0.022899	-0.016217
1139	movie	0.127894	0.142266	-0.014371
910	isn	0.011655	0.025335	-0.013680

In [45]:

print('Words are more likely to be found in "fresh"')
df_fresh['feature'].tolist()[0:5]

print('Words are more likely to be found in "rotten"')
df_rotten['feature'].tolist()[0:4]

Words are more likely to be found in "fresh"

Out[45]:

['film', 'best', 'great', 'entertaining', 'performance']

Words are more likely to be found in "rotten"