# 基于机器学习的情感分析¶

[email protected]

# classify emotion¶

Different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using different algorithms: e.g., naive Bayes classiﬁer trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.

# classify polarity¶

To classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon.

# NLTK¶

Anaconda自带的（默认安装的）第三方包。http://www.nltk.org/

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

In [18]:
import nltk

pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'),
('This view is horrible', 'negative'),
('I feel tired this morning', 'negative'),
('I am not looking forward to the concert', 'negative'),
('He is my enemy', 'negative')]

In [19]:
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
tweets.append((words_filtered, sentiment))
tweets[:2]

Out[19]:
[(['love', 'this', 'car'], 'positive'),
(['this', 'view', 'amazing'], 'positive')]
In [21]:
test_tweets = [
(['feel', 'happy', 'this', 'morning'], 'positive'),
(['larry', 'friend'], 'positive'),
(['not', 'like', 'that', 'man'], 'negative'),
(['house', 'not', 'great'], 'negative'),
(['your', 'song', 'annoying'], 'negative')]


# Extracting Features¶

Then we need to get the unique word list as the features for classification.

In [23]:
# get the word lists of tweets
def get_words_in_tweets(tweets):
all_words = []
for (words, sentiment) in tweets:
all_words.extend(words)
return all_words

# get the unique word from the word list
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features

word_features = get_word_features(get_words_in_tweets(tweets))
' '.join(word_features)

Out[23]:
'amazing about concert excited friend best horrible this enemy great love morning not feel tired forward car looking view the like'

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor.

In [24]:
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features

In [30]:
help(nltk.classify.util.apply_features)

Help on function apply_features in module nltk.classify.util:

apply_features(feature_func, toks, labeled=None)
Use the LazyMap class to construct a lazy list-like
object that is analogous to map(feature_func, toks).  In
particular, if labeled=False, then the returned list-like
object's values are equal to::

[feature_func(tok) for tok in toks]

If labeled=True, then the returned list-like object's values
are equal to::

[(feature_func(tok), label) for (tok, label) in toks]

The primary purpose of this function is to avoid the memory
overhead involved in storing all the featuresets for every token
in a corpus.  Instead, these featuresets are constructed lazily,
as-needed.  The reduction in memory overhead can be especially
significant when the underlying list of tokens is itself lazy (as
is the case with many corpus readers).

:param feature_func: The function that will be applied to each
token.  It should return a featureset -- i.e., a dict
mapping feature names to feature values.
:param toks: The list of tokens to which feature_func should be
applied.  If labeled=True, then the list elements will be
passed directly to feature_func().  If labeled=False,
then the list elements should be tuples (tok,label), and
tok will be passed to feature_func().
:param labeled: If true, then toks contains labeled tokens --
i.e., tuples of the form (tok, label).  (Default:
auto-detect based on types.)


In [36]:
training_set[0]

Out[36]:
({'contains(about)': False,
'contains(amazing)': False,
'contains(best)': False,
'contains(car)': True,
'contains(concert)': False,
'contains(enemy)': False,
'contains(excited)': False,
'contains(feel)': False,
'contains(forward)': False,
'contains(friend)': False,
'contains(great)': False,
'contains(horrible)': False,
'contains(like)': False,
'contains(looking)': False,
'contains(love)': True,
'contains(morning)': False,
'contains(not)': False,
'contains(the)': False,
'contains(this)': True,
'contains(tired)': False,
'contains(view)': False},
'positive')
In [25]:
training_set = nltk.classify.util.apply_features(extract_features,\
tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [26]:
# You may want to know how to define the ‘train’ method in NLTK here:
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
# Create the P(label) distribution
label_probdist = estimator(label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
model = NaiveBayesClassifier(label_probdist, feature_probdist)
return model

In [8]:
tweet_positive = 'Larry is my friend'
classifier.classify(extract_features(tweet_positive.split()))

Out[8]:
'positive'
In [9]:
tweet_negative = 'Larry is not my friend'
classifier.classify(extract_features(tweet_negative.split()))

Out[9]:
'negative'
In [10]:
# Don’t be too positive, let’s try another example:
tweet_negative2 = 'Your song is annoying'
classifier.classify(extract_features(tweet_negative2.split()))

Out[10]:
'positive'
In [27]:
def classify_tweet(tweet):
return classifier.classify(extract_features(tweet))
# nltk.word_tokenize(tweet)

total = accuracy = float(len(test_tweets))

for tweet in test_tweets:
if classify_tweet(tweet[0]) != tweet[1]:
accuracy -= 1

print('Total accuracy: %f%% (%d/20).' % (accuracy / total * 100, accuracy))

Total accuracy: 80.000000% (4/20).


# 使用sklearn的分类器¶

In [12]:
# nltk有哪些分类器呢？
nltk_classifiers = dir(nltk)
for i in nltk_classifiers:
if 'Classifier' in i:
print(i)

ClassifierBasedPOSTagger
ClassifierBasedTagger
ClassifierI
ConditionalExponentialClassifier
DecisionTreeClassifier
MaxentClassifier
MultiClassifierI
NaiveBayesClassifier
PositiveNaiveBayesClassifier
SklearnClassifier
WekaClassifier

In [28]:
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
classif = SklearnClassifier(LinearSVC())
svm_classifier = classif.train(training_set)

In [29]:
# Don’t be too positive, let’s try another example:
tweet_negative2 = 'Your song is annoying'
svm_classifier.classify(extract_features(tweet_negative2.split()))

Out[29]:
'negative'

# 推荐阅读：¶

Sentiment analysis with machine learning in R http://chengjun.github.io/en/2014/04/sentiment-analysis-with-machine-learning-in-R/

# 安装textblob¶

https://github.com/sloria/TextBlob

pip install -U textblob

In [17]:
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
#  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
#            'ultimate movie monster',
#            'amoeba-like mass', ...])

for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# 0.060
# -0.341

blob.translate(to="es")  # 'La amenaza titular de The Blob...'

0.06000000000000001
-0.34166666666666673

Out[17]:
TextBlob("La amenaza principal de The Blob siempre me ha parecido la mejor película
monstruo: una masa insaciablemente hambrienta, similar a una ameba capaz de penetrar
prácticamente cualquier salvaguardia, capaz de - como un doctor condenado escalofriante
lo describe - "asimilando carne en contacto.
Las malditas comparaciones con la gelatina pueden ser condenadas, es un concepto con la mayor cantidad de
devastador de posibles consecuencias, a diferencia del escenario gris goo
propuesto por teóricos tecnológicos temerosos de
la inteligencia artificial corre desenfrenada.")

# Sentiment Analysis Using GraphLab¶

In this notebook, I will explain how to develop sentiment analysis classifiers that are based on a bag-of-words model. Then, I will demonstrate how these classifiers can be utilized to solve Kaggle's "When Bag of Words Meets Bags of Popcorn" challenge.

## Code Recipe: Creating Sentiment Classifier¶

Using GraphLab it is very easy and straight foward to create a sentiment classifier based on bag-of-words model. Given a dataset stored as a CSV file, you can construct your sentiment classifier using the following code:

In [ ]:
import graphlab as gl
delimiter='\t',quote_char='"',
column_type_hints = {'id':str,
'sentiment' : int,
'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(
train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(
train_data['review'],2)
cls = gl.classifier.create(train_data, target='sentiment',
features=['1grams features',
'2grams features'])


In the rest of this notebook, we will explain this code recipe in details, by demonstrating how this recipe can used to create IMDB movie reviews sentiment classifier.

## Set up¶

Before we begin constructing the classifiers, we need to import some Python libraries: graphlab (gl), and IPython display utilities. We also set IPython notebook and GraphLab Canvas to produce plots directly in this notebook.

In [2]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


# Bag of Words Meets Bags of Popcorn¶

Throughout this notebook, I will use Kaggle's IMDB movies reviews datasets that is available to download from the following link: https://www.kaggle.com/c/word2vec-nlp-tutorial/data. I downloaded labeledTrainData.tsv and testData.tsv files, and unzipped them to the following local files.

### DeepLearningMovies¶

Kaggle's competition for using Google's word2vec package for sentiment analysis

https://github.com/wendykan/DeepLearningMovies

In [3]:
traindata_path = "/Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv"
testdata_path = "/Users/datalab/bigdata/kaggle_popcorn_data/testData.tsv"


We will load the data with IMDB movie reviews to an SFrame using SFrame.read_csv function.

In [10]:
movies_reviews_data = gl.SFrame.read_csv(traindata_path,header=True,
delimiter='\t',quote_char='"',
column_type_hints = {'id':str,
'sentiment' : str,
'review':str } )

Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.331166 secs.
Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 0.687084 secs.

By using the SFrame show function, we can visualize the data and notice that the train dataset consists of 12,500 positive and 12,500 negative, and overall 24,932 unique reviews.

In [11]:
movies_reviews_data

Out[11]:
id sentiment review
5814_8 1 With all this stuff going
down at the moment with ...
2381_9 1 "The Classic War of the
Worlds" by Timothy Hines ...
7759_3 0 The film starts with a
manager (Nicholas Bell) ...
3630_4 0 It must be assumed that
those who praised this ...
9495_8 1 Superbly trashy and
wondrously unpretentious ...
8196_8 1 I dont know why people
think this is such a bad ...
7166_2 0 This movie could have
been very good, but c ...
10633_1 0 I watched this video at a
319_1 0 A friend of mine bought
this film for £1, and ...
8713_10 1 <br /><br />This movie is
full of references. Like ...
[25000 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

## Constructing Bag-of-Words Classifier¶

One of the common techniques to perform document classification (and reviews classification) is using Bag-of-Words model, in which the frequency of each word in the document is used as a feature for training a classifier. GraphLab's text analytics toolkit makes it easy to calculate the frequency of each word in each review. Namely, by using the count_ngrams function with n=1, we can calculate the frequency of each word in each review. By running the following command:

In [12]:
movies_reviews_data['1grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data ['review'],1)


By running the last command, we created a new column in movies_reviews_data SFrame object. In this column each value is a dictionary object, where each dictionary's keys are the different words which appear in the corresponding review, and the dictionary's values are the frequency of each word. We can view the values of this new column using the following command.

In [13]:
movies_reviews_data#[['review','1grams features']]

Out[13]:
id sentiment review 1grams features
5814_8 1 With all this stuff going
down at the moment with ...
{'all': 4, 'moonwalker':
2, 'just': 3, 'dance' ...
2381_9 1 "The Classic War of the
Worlds" by Timothy Hines ...
{'being': 2, 'looks': 1,
'cruise': 1, 'its': 1, ...
7759_3 0 The film starts with a
manager (Nicholas Bell) ...
{'rating': 1, 'hickox':
1, 'moments': 1, 'john': ...
3630_4 0 It must be assumed that
those who praised this ...
{'allowed': 1, 'text': 2,
'altogether': 1, ...
9495_8 1 Superbly trashy and
wondrously unpretentious ...
{'impression': 1, 'all':
2, 'just': 1, 'less': 1, ...
8196_8 1 I dont know why people
think this is such a bad ...
{'and': 3, 'liked': 2,
'dont': 1, 'gratuitous': ...
7166_2 0 This movie could have
been very good, but c ...
{'and': 3, 'this': 4,
'would': 2, 'just': 1, ...
10633_1 0 I watched this video at a
{'rocket': 1, 'money': 1,
'over': 1, 'astronauts': ...
319_1 0 A friend of mine bought
this film for £1, and ...
{'all': 1, 'overpriced':
1, 'just': 1, ...
8713_10 1 <br /><br />This movie is
full of references. Like ...
{'and': 1, 'one"': 1,
'we\xc2\xb4ll': 1, ...
[25000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We are now ready to construct and evaluate the movie reviews sentiment classifier using the calculated above features. But first, to be able to perform a quick evaluation of the constructed classifier, we need to create labeled train and test datasets. We will create train and test datasets by randomly splitting the train dataset into two parts. The first part will contain 80% of the labeled train dataset and will be used as the training dataset, while the second part will contain 20% of the labeled train dataset and will be used as the testing dataset. We will create these two dataset by using the following command:

In [14]:
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)


We are now ready to create a classifier using the following command:

In [15]:
model_1 = gl.classifier.create(train_set, target='sentiment', \
features=['1grams features'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set validation_set=None to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 19102
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 79525
Number of coefficients    : 79526
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 1.253623     | 0.951052          | 0.843878            |
| 2         | 5        | 1.000000  | 1.440817     | 0.974348          | 0.861224            |
| 3         | 6        | 1.000000  | 1.568855     | 0.992776          | 0.881633            |
| 4         | 7        | 1.000000  | 1.701091     | 0.994608          | 0.879592            |
| 5         | 8        | 1.000000  | 1.829955     | 0.997225          | 0.870408            |
| 6         | 9        | 1.000000  | 1.980522     | 0.998063          | 0.871429            |
| 10        | 13       | 1.000000  | 2.517222     | 1.000000          | 0.859184            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 19102
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 79525
Number of coefficients    : 79526
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 0.232552     | 0.951052          | 0.843878            |
| 2         | 5        | 1.000000  | 0.426794     | 0.979636          | 0.866327            |
| 3         | 6        | 1.000000  | 0.543997     | 0.991467          | 0.868367            |
| 4         | 7        | 1.000000  | 0.660582     | 0.994398          | 0.870408            |
| 5         | 8        | 1.000000  | 0.786014     | 0.996911          | 0.870408            |
| 6         | 9        | 1.000000  | 0.926245     | 0.998429          | 0.872449            |
| 10        | 13       | 1.000000  | 1.426972     | 0.999738          | 0.874490            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.859184
PROGRESS: SVMClassifier                   : 0.87449
PROGRESS: ---------------------------------------------
PROGRESS: Selecting SVMClassifier based on validation set performance.


We can evaluate the performence of the classifier by evaluating it on the test dataset

In [16]:
result1 = model_1.evaluate(test_set)


In order to get an easy view of the classifier's prediction result, we define and use the following function

In [17]:
def print_statistics(result):
print "*" * 30
print "Accuracy        : ", result["accuracy"]
print "Confusion Matrix: \n", result["confusion_matrix"]
print_statistics(result1)

******************************
Accuracy        :  0.873322488817
Confusion Matrix:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |  378  |
|      1       |        0        |  245  |
|      1       |        1        |  2148 |
|      0       |        0        |  2147 |
+--------------+-----------------+-------+
[4 rows x 3 columns]



As can be seen in the results above, in just a few relatively straight foward lines of code, we have developed a sentiment classifier that has accuracy of about ~0.88. Next, we demonstrate how we can improve the classifier accuracy even more.

## Improving The Classifier¶

One way to improve the movie reviews sentiment classifier is to extract more meaningful features from the reviews. One method to add additional features, which might be meaningful, is to calculate the frequency of every two consecutive words in each review. To calculate the frequency of each two consecutive words in each review, as before, we will use GraphLab's count_ngrams function only this time we will set n to be equal 2 (n=2) to create new column named '2grams features'.

In [18]:
movies_reviews_data['2grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data['review'],2)

In [19]:
movies_reviews_data

Out[19]:
id sentiment review 1grams features 2grams features
5814_8 1 With all this stuff going
down at the moment with ...
{'all': 4, 'moonwalker':
2, 'just': 3, 'dance' ...
is': 1, 'started ...
2381_9 1 "The Classic War of the
Worlds" by Timothy Hines ...
{'being': 2, 'looks': 1,
'cruise': 1, 'its': 1, ...
{'to great': 1,
'different things': 1, ...
7759_3 0 The film starts with a
manager (Nicholas Bell) ...
{'rating': 1, 'hickox':
1, 'moments': 1, 'john': ...
{'tourists and': 1, 'the
security': 1, 'dangerous ...
3630_4 0 It must be assumed that
those who praised this ...
{'allowed': 1, 'text': 2,
'altogether': 1, ...
{'somewhere either': 1,
'an aural': 1, 'also ...
9495_8 1 Superbly trashy and
wondrously unpretentious ...
{'impression': 1, 'all':
2, 'just': 1, 'less': 1, ...
{'somewhat give': 1, 'few
things': 1, 'hooray t ...
8196_8 1 I dont know why people
think this is such a bad ...
{'and': 3, 'liked': 2,
'dont': 1, 'gratuitous': ...
{'4 5': 1, 'action and':
1, 'its a': 1, 'movie ...
7166_2 0 This movie could have
been very good, but c ...
{'and': 3, 'this': 4,
'would': 2, 'just': 1, ...
{'movie i': 1, 'this
woman': 1, 'it would' ...
10633_1 0 I watched this video at a
{'rocket': 1, 'money': 1,
'over': 1, 'astronauts': ...
{'s house': 1, 'clips
of': 1, 'own voice': 1, ...
319_1 0 A friend of mine bought
this film for £1, and ...
{'all': 1, 'overpriced':
1, 'just': 1, ...
incredibly': 1, 'and ...
8713_10 1 <br /><br />This movie is
full of references. Like ...
{'and': 1, 'one"': 1,
'we\xc2\xb4ll': 1, ...
{'others the': 1, 'wild
[25000 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

As before, we will construct and evaluate a movie reviews sentiment classifier. However, this time we will use both the '1grams features' and the '2grams features' features

In [20]:
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)
model_2 = gl.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'])
result2 = model_2.evaluate(test_set)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set validation_set=None to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 19029
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1248277
Number of coefficients    : 1248278
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000053  | 0.935864     | 0.999474          | 0.882241            |
| 2         | 5        | 1.000000  | 1.593506     | 0.999947          | 0.881292            |
| 3         | 6        | 1.000000  | 2.058826     | 1.000000          | 0.882241            |
| 4         | 7        | 1.000000  | 2.568989     | 1.000000          | 0.882241            |
| 5         | 8        | 1.000000  | 3.071948     | 1.000000          | 0.883191            |
| 6         | 9        | 1.000000  | 3.509510     | 1.000000          | 0.881292            |
| 10        | 13       | 1.000000  | 5.384164     | 1.000000          | 0.882241            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 19029
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1248277
Number of coefficients    : 1248278
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000053  | 0.885426     | 0.999474          | 0.882241            |
| 2         | 5        | 1.000000  | 1.592887     | 1.000000          | 0.881292            |
| 3         | 6        | 1.000000  | 1.997824     | 1.000000          | 0.881292            |
| 4         | 7        | 1.000000  | 2.394116     | 1.000000          | 0.881292            |
| 5         | 8        | 1.000000  | 2.762796     | 0.004046          | 0.141500            |
| 6         | 10       | 1.000000  | 3.348769     | 1.000000          | 0.881292            |
| 10        | 15       | 1.000000  | 5.180340     | 1.000000          | 0.881292            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.882241
PROGRESS: SVMClassifier                   : 0.881292
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.

In [21]:
print_statistics(result2)

******************************
Accuracy        :  0.877592517283
Confusion Matrix:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |  386  |
|      0       |        0        |  2139 |
|      1       |        1        |  2177 |
|      1       |        0        |  216  |
+--------------+-----------------+-------+
[4 rows x 3 columns]



Indeed, the new constructed classifier seems to be more accurate with an accuracy of about ~0.9.

## Unlabeled Test File¶

To test how well the presented method works, we will use all the 25,000 labeled IMDB movie reviews in the train dataset to construct a classifier. Afterwards, we will utilize the constructed classifier to predict sentiment for each review in the unlabeled dataset. Lastly, we will create a submission file according to Kaggle's guidelines and submit it.

In [22]:
traindata_path = "/Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv"
testdata_path = "/Users/datalab/bigdata/kaggle_popcorn_data/testData.tsv"
#creating classifier using all 25,000 reviews
column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(train_data['review'],2)

cls = gl.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])
#creating the test dataset
column_type_hints = {'id':str, 'review':str } )
test_data['1grams features'] = gl.text_analytics.count_ngrams(test_data['review'],1)
test_data['2grams features'] = gl.text_analytics.count_ngrams(test_data['review'],2)

#predicting the sentiment of each review in the test dataset
test_data['sentiment'] = cls.classify(test_data)['class'].astype(int)

#saving the prediction to a CSV for submission
test_data[['id','sentiment']].save("/Users/datalab/bigdata/kaggle_popcorn_data/predictions.csv", format="csv")

Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.322891 secs.
Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 0.685283 secs.
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set validation_set=None to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.

WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 23821
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1458421
Number of coefficients    : 1458422
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 1.131870     | 0.999118          | 0.887193            |
| 2         | 5        | 1.000000  | 1.900255     | 0.999916          | 0.888041            |
| 3         | 6        | 1.000000  | 2.396344     | 0.999958          | 0.888041            |
| 4         | 7        | 1.000000  | 2.875342     | 0.999958          | 0.885496            |
| 5         | 8        | 1.000000  | 3.352443     | 1.000000          | 0.885496            |
| 6         | 9        | 1.000000  | 3.834838     | 1.000000          | 0.885496            |
| 10        | 13       | 1.000000  | 5.783333     | 1.000000          | 0.887193            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 23821
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1458421
Number of coefficients    : 1458422
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 1.143198     | 0.999118          | 0.887193            |
| 2         | 5        | 1.000000  | 1.981716     | 0.999916          | 0.887193            |
| 3         | 6        | 1.000000  | 2.471168     | 0.999958          | 0.887193            |
| 4         | 7        | 1.000000  | 2.982287     | 0.999958          | 0.888041            |
| 5         | 8        | 1.000000  | 3.534177     | 0.478359          | 0.533503            |
| 6         | 10       | 1.000000  | 4.354360     | 0.999916          | 0.888041            |
| 10        | 15       | 1.000000  | 6.709526     | 0.983460          | 0.860899            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing max_iterations.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.887193
PROGRESS: SVMClassifier                   : 0.860899
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.

Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/testData.tsv
Parsing completed. Parsed 100 lines in 0.400391 secs.
Finished parsing file /Users/datalab/bigdata/kaggle_popcorn_data/testData.tsv
Parsing completed. Parsed 25000 lines in 0.722877 secs.

We then submitted the predictions.csv file to the Kaggle challange website and scored AUC of about 0.88.