基于机器学习的情感分析



王成军

[email protected]

计算传播网 http://computational-communication.com

情感分析(sentiment analysis)和意见挖掘(opinion mining)虽然相关,但是从社会科学的角度而言,二者截然不同。这里主要是讲情感分析(sentiment or emotion),而非意见挖掘(opinion, 后者通过机器学习效果更可信)。

classify emotion

Different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using different algorithms: e.g., naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.

classify polarity

To classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon.

Preparing the data

NLTK

Anaconda自带的(默认安装的)第三方包。http://www.nltk.org/

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

In [1]:
import nltk

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'),
    ('This view is horrible', 'negative'),
    ('I feel tired this morning', 'negative'),
    ('I am not looking forward to the concert', 'negative'),
    ('He is my enemy', 'negative')]
In [2]:
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, sentiment))
tweets[:2]
Out[2]:
[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive')]
In [3]:
test_tweets = [
    (['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]

Extracting Features

Then we need to get the unique word list as the features for classification.

In [5]:
# get the word lists of tweets
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

# get the unique word from the word list	
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

word_features = get_word_features(get_words_in_tweets(tweets))
' '.join(word_features)
Out[5]:
'forward great like love concert tired this car about morning looking feel amazing friend horrible not the enemy excited best view'

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor.

In [6]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
In [7]:
training_set = nltk.classify.util.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
In [9]:
# You may want to know how to define the ‘train’ method in NLTK here:
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    model = NaiveBayesClassifier(label_probdist, feature_probdist)
    return model
In [11]:
tweet_positive = 'Larry is my friend'
print classifier.classify(extract_features(tweet_positive.split()))
positive
In [12]:
tweet_negative = 'Larry is not my friend'
print classifier.classify(extract_features(tweet_negative.split()))
negative
In [13]:
# Don’t be too positive, let’s try another example:
tweet_negative2 = 'Your song is annoying'
print classifier.classify(extract_features(tweet_negative2.split()))
positive
In [15]:
def classify_tweet(tweet):
    return classifier.classify(extract_features(tweet)) 
    # nltk.word_tokenize(tweet)

total = accuracy = float(len(test_tweets))

for tweet in test_tweets:
    if classify_tweet(tweet[0]) != tweet[1]:
        accuracy -= 1

print('Total accuracy: %f%% (%d/20).' % (accuracy / total * 100, accuracy))
Total accuracy: 80.000000% (4/20).

使用sklearn的分类器

In [18]:
# nltk有哪些分类器呢?
nltk_classifiers = dir(nltk)
for i in nltk_classifiers:
    if 'Classifier' in i:
        print i
ClassifierBasedPOSTagger
ClassifierBasedTagger
ClassifierI
ConditionalExponentialClassifier
DecisionTreeClassifier
MaxentClassifier
MultiClassifierI
NaiveBayesClassifier
PositiveNaiveBayesClassifier
SklearnClassifier
WekaClassifier
In [22]:
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
classif = SklearnClassifier(LinearSVC())
svm_classifier = classif.train(training_set)
In [23]:
# Don’t be too positive, let’s try another example:
tweet_negative2 = 'Your song is annoying'
print svm_classifier.classify(extract_features(tweet_negative2.split()))
negative

作业1:

使用另外一种sklearn的分类器来对tweet_negative2进行情感分析

作业2:

使用https://github.com/victorneo/Twitter-Sentimental-Analysis 所提供的推特数据进行情感分析,可以使用其代码 https://github.com/victorneo/Twitter-Sentimental-Analysis/blob/master/classification.py

Sentiment Analysis using TextBlob

安装textblob

https://github.com/sloria/TextBlob

pip install -U textblob

python -m textblob.download_corpora

In [ ]:
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

blob.translate(to="es")  # 'La amenaza titular de The Blob...'

Sentiment Analysis Using GraphLab

In this notebook, I will explain how to develop sentiment analysis classifiers that are based on a bag-of-words model. Then, I will demonstrate how these classifiers can be utilized to solve Kaggle's "When Bag of Words Meets Bags of Popcorn" challenge.

Code Recipe: Creating Sentiment Classifier

Using GraphLab it is very easy and straight foward to create a sentiment classifier based on bag-of-words model. Given a dataset stored as a CSV file, you can construct your sentiment classifier using the following code:

In [ ]:
import graphlab as gl
train_data = gl.SFrame.read_csv(traindata_path,header=True, 
                                delimiter='\t',quote_char='"', 
                                column_type_hints = {'id':str, 
                                                     'sentiment' : int, 
                                                     'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(
    train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(
    train_data['review'],2)
cls = gl.classifier.create(train_data, target='sentiment', 
                           features=['1grams features','2grams features'])

In the rest of this notebook, we will explain this code recipe in details, by demonstrating how this recipe can used to create IMDB movie reviews sentiment classifier.

Set up

Before we begin constructing the classifiers, we need to import some Python libraries: graphlab (gl), and IPython display utilities. We also set IPython notebook and GraphLab Canvas to produce plots directly in this notebook.

In [3]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')

IMDB movies reviews Dataset

Bag of Words Meets Bags of Popcorn

Throughout this notebook, I will use Kaggle's IMDB movies reviews datasets that is available to download from the following link: https://www.kaggle.com/c/word2vec-nlp-tutorial/data. I downloaded labeledTrainData.tsv and testData.tsv files, and unzipped them to the following local files.

DeepLearningMovies

Kaggle's competition for using Google's word2vec package for sentiment analysis

https://github.com/wendykan/DeepLearningMovies

In [8]:
traindata_path = "/Users/chengjun/bigdata/kaggle_popcorn_data/labeledTrainData.tsv"
testdata_path = "/Users/chengjun/bigdata/kaggle_popcorn_data/testData.tsv"

Loading Data

We will load the data with IMDB movie reviews to an SFrame using SFrame.read_csv function.

In [6]:
movies_reviews_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', 
                                         column_type_hints = {'id':str, 'sentiment' : str, 'review':str } )
Finished parsing file /Users/chengjun/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.361192 secs.
Finished parsing file /Users/chengjun/bigdata/kaggle_popcorn_data/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 0.694849 secs.

By using the SFrame show function, we can visualize the data and notice that the train dataset consists of 12,500 positive and 12,500 negative, and overall 24,932 unique reviews.

In [7]:
movies_reviews_data.show()