Sentiment Symposium - Build a Sentiment Predictor in 5 Minutes in Python

Łukasz Augustyniak

Piotr Bródka

e-Mail: [email protected]
Twitter: @luk_augustyniak
LinkedIn: Łukasz Augustyniak
GitHub: laugustyniak
Ipython Notebook view: SAS2015 Notebook

European research centre of Network intelliGence for INnovation Enhancement


Purpose of the presentation:

  • Learning by practice
  • Real example implementation
  • Using trained model for production

Why Python?

  • code readability
  • its syntax allows programmers to express concepts in fewer lines of code than C++ or Java
  • really strong open source community
  • ideal for fast prototyping and building models (research)

Why IPython Notebook?

The IPython Notebook - is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. Just like you see it now :)

In [28]:
sas2015 = 'Welcome at Sentiment Symposium'
In [29]:
print sas2015
Welcome at Sentiment Symposium
In [30]:
sas2015 + ' 2015'
Out[30]:
'Welcome at Sentiment Symposium 2015'

How can I install Python and IPython Notebook?

Python's distribution - Anaconda

Python interpreter with pre-installed libraries - Anaconda - is a completely free Python distribution (including for commercial use and redistribution). It includes over 195 of the most popular Python packages for science, math, engineering, data analysis.

Python's libraries

Scikit-Learn & Pandas

scikit-learn - Machine Learning in Python

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

pandas - Python Data Analysis Library

  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • great for loading and transforming data
In [31]:
import pandas as pd

Where is NTLK?

Whole preprocessing and model creation is possible with scikit-learn library.

Load dataset with Pandas

Path for dataset

In [32]:
from os import path
notebook_path = 'C:/Users/Dell/Documents/GitHub/Presentations/sas2015/'

SemEval 2014 dataset - http://alt.qcri.org/semeval2014/

approximately 6 000 of tweets with annotation negative/neutral/positive

Load data into Data Frame structure

Tabular data structure with labeled axes (rows and columns).

Arithmetic operations align on both row and column labels. The primary pandas data structure

Lovely one liner for data loading :)

In [33]:
data = pd.read_csv(path.join(notebook_path, 'data', 'SemEval-2014.csv'), index_col=0)
In [34]:
data
Out[34]:
sentiment document
0 3 Gas by my house hit $3.39!!!! I'm going to Cha...
1 1 Theo Walcott is still shit, watch Rafa and Joh...
2 1 its not that I'm a GSP fan, i just hate Nick D...
3 1 Iranian general says Israel's Iron Dome can't ...
4 3 with J Davlar 11th. Main rivals are team Polan...
5 1 Talking about ACT's && SAT's, deciding where I...
6 2 Why is "Happy Valentines Day" trending? It's o...
7 1 They may have a SuperBowl in Dallas, but Dalla...
8 2 Im bringing the monster load of candy tomorrow...
9 2 Apple software, retail chiefs out in overhaul:...
10 2 #Livewire Nadal confirmed for Mexican Open in ...
11 2 #Iran US delisting MKO from global terrorists ...
12 2 Expect light-moderate rains over E. Visayas; C...
13 3 One ticket left for the @49ers game tomorrow! ...
14 2 Game 1 of the NLCS and a rematch of the NFC Ch...
15 3 Never start working on your dreams and goals t...
16 2 BLACK FRIDAY Huge Saving Aerial View of a City...
17 3 YES we all know INDIO vs CV is tomorrow the BE...
18 2 Mohamed Morsi, Egypt's Muslim Brotherhood pres...
19 2 C'mon Avila! You just got tagged out by a guy ...
20 2 At the first Grammy Awards, held on 4 May 1959...
21 3 Good morning Thursday. "Life is fragile. We're...
22 3 #Twitition Mcfly come back to Argentina but th...
23 1 My teachers call themselves givng us candy.......
24 2 #Broncos Peyton Manning named AFC Offensive Pl...
25 2 @TooZany is bringing out Kendrick Lamar the 6t...
26 2 Andre's Wigan Warning - #COYS Official Site Wi...
27 2 When my professor passes out candy and says "a...
28 2 How are they going to act in new york with the...
29 1 Homegrown talent missing on Signing Day: Throu...
... ... ...
6235 3 Yay !!!!RT @kellymonaco1: Excited to interview...
6236 3 @TomFelton Hope you win tonight Tom! Your US a...
6237 3 If I'm reading the Twitter Trend list correctl...
6238 3 Colts game tonight! Yay!
6239 2 Is this on tv again? "@kugrlover: Most of Reds...
6240 3 What will have a better TV rating...#Cardinals...
6241 3 Yes. I'm ready for HS, college & pro. Bring it...
6242 3 Trying to leave, I'm only 10 minutes late (so ...
6243 1 I fail to see why the Rams are playing TWICE o...
6244 2 On the night Hank Williams came to town.
6245 3 There won't be just a Party in the USA tonight...
6246 3 After that, I'll start plugging mine and @joes...
6247 3 Man was that Jets and Cowboys game awesome or ...
6248 3 MNF tonight! Let's go Sexy Rexy!
6249 2 Just checking in to see if I had a nightmare l...
6250 1 Lmao RT @HeatherNoel13: Curtis painter looks l...
6251 2 Monday Night Football #TeamTexans all day & to...
6252 2 @PierreGarcon85 come out and watch THOSE GUYS ...
6253 3 Huge thanks to those of you who came out to my...
6254 1 #Londonriots is trending 3rd worldwide ..... T...
6255 3 I had a fun day on terra nova. Followed by a h...
6256 1 Today, we found out that Rob Henry tore his AC...
6257 3 Monday Night Football - Gary Neville did well ...
6258 3 Happy birthday, Hank Williams. In honor if the...
6259 3 New cast of DWTS tba at 8pm tonight!! So excit...
6260 1 @stoney16 @JeffMossDSR I'd recommend just turn...
6261 3 RT @MNFootNg It's monday and Monday Night Foot...
6262 3 All I know is the road for that Lomardi start ...
6263 2 All Blue and White fam, we r meeting at Golden...
6264 1 I'm pisseeedddd that I missed Kid Cudi's show ...

6265 rows × 2 columns

In [35]:
%matplotlib inline

data.sentiment.hist()
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d9f278>

Get documents and labels into more intutive names

In [36]:
docs = data['document'] 
y = data['sentiment'] # standart name for labels/classes variable 
In [37]:
docs[0]
Out[37]:
"Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)"
In [38]:
y[0]
Out[38]:
3

Build Bag of Word model with Scikit-Learn

Convert a collection of text documents to a matrix of token counts

'I like new Note IV.' -> [0, 1, 1, 1, 1, 0, 0]
'I was dissapointed by new Samsung phone.' -> [1, 0, 0, 1, 0, 1, 1]

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english')
X = count_vect.fit_transform(docs)

print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=74011 for #documents=6265

Cross-Validation

Good practice for research

In [40]:
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression

def sentiment_classification(X, y, n_folds=10, classifier=None):
        """
        Counting sentiment with cross validation - supervised method
        :type X: ndarray feature matrix for classification
        :type y: list or ndarray of classes
        :type n_folds: int # of folds for CV
        :type classifier: classifier which we train and predict sentiment
        :return: measures: accuracy, precision, recall, f1
        """
        results = {'acc': [], 'prec': [], 'rec': [], 'f1': [], 'cm': []}
        kf = cross_validation.StratifiedKFold(y, n_folds=n_folds, shuffle=True)
        for train_index, test_index in kf:
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]

            ######################## Most important part ########################## 
            clf = classifier.fit(X_train, y_train) # train the classifier 
            predicted = clf.predict(X_test) # predict test the classifier 
            #######################################################################

            results['acc'].append(metrics.accuracy_score(y_test, predicted))
            results['prec'].append(metrics.precision_score(y_test, predicted, average='weighted'))
            results['rec'].append(metrics.recall_score(y_test, predicted, average='weighted'))
            results['f1'].append(metrics.f1_score(y_test, predicted, average='weighted'))
            results['cm'].append(metrics.confusion_matrix(y_test, predicted))

        return results

Run sentiment classification

In [41]:
results = sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression())
In [42]:
import numpy as np
print 'Accuracy: %s' % np.mean(results['acc'])
print 'F1-measure: %s' % np.mean(results['f1'])
Accuracy: 0.67119115848
F1-measure: 0.649526332256

Saving trained classfier

Great for production purposes!

In the computer programming language Python, pickle is the standard mechanism for object serialization; pickling is the common term among Python programmers for serialization (unpickling for deserializing).

In [43]:
classifier = LogisticRegression()
clf = classifier.fit(X, y) # trained
clf
Out[43]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
In [44]:
from sklearn.externals import joblib
fn_clf = 'sentiment-classifier.pkl'
joblib.dump(clf, fn_clf)
Out[44]:
['sentiment-classifier.pkl',
 'sentiment-classifier.pkl_01.npy',
 'sentiment-classifier.pkl_02.npy',
 'sentiment-classifier.pkl_03.npy']
In [45]:
clf_loaded = joblib.load(fn_clf)

print 'predictions => %s' % clf_loaded.predict(X)
print 'classifier: %s' % clf_loaded
predictions => [3 1 1 ..., 3 2 1]
classifier: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

Whole code

In [46]:
# load data
data = pd.read_csv('C:/Users/Dell/Documents/GitHub/Presentations/sas2015/data/SemEval-2014.csv', index_col=0)
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english')
X = count_vect.fit_transform(data.document)
results = sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression()) 
joblib.dump(clf, 'sentiment-classifier.pkl')
# save classifier
Out[46]:
['sentiment-classifier.pkl',
 'sentiment-classifier.pkl_01.npy',
 'sentiment-classifier.pkl_02.npy',
 'sentiment-classifier.pkl_03.npy']
In [47]:
print 'Accuracy: %s' % np.mean(results['acc'])
print 'F1-measure: %s' % np.mean(results['f1'])
Accuracy: 0.667995763517
F1-measure: 0.645418740216

What are we doing now?

- Hybrid method lexicon-based + supervised learning

- Sentiment lexicons generation (English and Polish) for various product domains

- Sentiment analysis for Polish

API for Polish text analysis (especially sentiment) - coming soon ~2-3 months check http://streamlytics.io/

Additional clues

Vectorizer parameters

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency (TF) strictly lower than the given threshold. This value is also called cut-off in the literature.

If float, the parameter represents a proportion of documents, integer absolute counts.

In [48]:
min_df=2
In [49]:
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', min_df=min_df)
X = count_vect.fit_transform(docs)
print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=11871 for #documents=6265

max_features : int or None, default=None

Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [50]:
max_features=1000
In [51]:
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', max_features=max_features)
X = count_vect.fit_transform(docs)
print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=1000 for #documents=6265

Check different minimum thresholds (minimum number of time a word appears in dataset)

In [52]:
min_words = [1, 2, 5, 10, 100, 1000]
In [53]:
features_counts = []

for m in min_words:
    docs_fitted = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', min_df=m).fit_transform(docs)
    print '#features=%s for #documents=%s (min_df=%s)' % (docs_fitted.shape[1], docs_fitted.shape[0], m)
    features_counts.append((m, docs_fitted.shape[1]))
#features=74011 for #documents=6265 (min_df=1)
#features=11871 for #documents=6265 (min_df=2)
#features=3735 for #documents=6265 (min_df=5)
#features=1630 for #documents=6265 (min_df=10)
#features=69 for #documents=6265 (min_df=100)
#features=2 for #documents=6265 (min_df=1000)
In [54]:
import matplotlib.pyplot as plt

plt.bar(range(len(features_counts)), [x[1] for x in features_counts], align='center')
plt.xticks(range(len(features_counts)), [x[0] for x in features_counts])
plt.xlabel('min_df')
plt.ylabel('#features')

plt.show()

Use sparse matrices! Why? Time and memory complexity...

In [55]:
X
Out[55]:
<6265x1000 sparse matrix of type '<type 'numpy.int64'>'
	with 44424 stored elements in Compressed Sparse Row format>

Whole matrix will be stored in memory, do not do that!

In [56]:
%timeit sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression())
1 loops, best of 3: 904 ms per loop
In [57]:
X_array = X.toarray()
%timeit sentiment_classification(X_array, y, n_folds=4, classifier=LogisticRegression())
1 loops, best of 3: 904 ms per loop