Text Classification with Python

  • nltk, TextBlob, scikit, etc.
  • [Ref]: Perkins 2010: chapter 7

Big issues:

  1. How can we identify patterns/features of linguistic data that are salient for classifying it?
  2. How can we build predicative models to perform NLP tasks automatically?
  3. What can we learn about language from these computing scenario?
  • Text classification is an NLP task of categorizing texts by deciding what class label to assign to it.

    • A binary classifier decides between two labels (such as positive or negative in sentiment analysis), whereas a multi-label classifier can assign one or more labels to pieces of texts.
  • How does it work?

    • The classifier learns from labeled feature sets (i.e., training data), to later classify an unlabeled feature set.
    • The labels can be either predefined (out of linguists hands) or automatically extracted.
    • A feature set is basically a key-value mapping of feature names to feature values. In case of text classification, the feature names are usually words, and the values are all True.

      Feature exatractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem.

Bag-of-word Approach

  • Bag-of-word model is the simplest one, it constructs a word presence feature set.
  • Feature extraction is the process of transforming a list of words into a feature set that is usable by a classifier.
  • When using NLTK classifier, it is noted that the classifer expects dict style feature sets, so we need to transform the text into a dict.
In [4]:
# First, write a Feature extractor (the following is taken from nltk-trainer package)
# download featx.py (written by Perkins)

import math
from nltk import probability

def bag_of_words(words):
        return dict([(word, True) for word in words])

def bag_of_words_in_set(words, wordset):
        return bag_of_words(set(words) & wordset)
    
def word_counts(words):
        return dict(probability.FreqDist((w for w in words)))

def word_counts_in_set(words, wordset):
        return word_counts((w for w in words if w in wordset))

def train_test_feats(label, instances, featx=bag_of_words, fraction=0.75):
        labeled_instances = [(featx(i), label) for i in instances]
        
        if fraction != 1.0:
                l = len(instances)
                cutoff = int(math.ceil(l * fraction))
                return labeled_instances[:cutoff], labeled_instances[cutoff:]
        else:
                return labeled_instances, labeled_instances
In [57]:
import nltk
bag_of_words(['this', 'is', 'awesome'])
Out[57]:
{'awesome': True, 'is': True, 'this': True}
In [21]:
def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

bag_of_words_not_in_set(['this','is','awesome'],['this'])
Out[21]:
{'awesome': True, 'is': True}
In [22]:
from nltk.corpus import stopwords

def bag_of_non_stopwords(words, stopfile = 'english'):
    badwords = stopwords.words(stopfile)
    return bag_of_words_not_in_set(words, badwords)

bag_of_non_stopwords(['this','is','awesome'])
Out[22]:
{'awesome': True}

Bi-gram

  • It is sometimes useful to take significant bi-grams into the bag-of-word model
In [58]:
from featx import bag_of_bigrams_words
bag_of_bigrams_words(['this','is','an','incredible','place'])
Out[58]:
{'an': True,
 'incredible': True,
 'is': True,
 'place': True,
 'this': True,
 ('an', 'incredible'): True,
 ('incredible', 'place'): True,
 ('is', 'an'): True,
 ('this', 'is'): True}

Training a (naive Bayes) Classifier

  • Once we have extracted features from text, we can train a classifier.
  • The easiest classifer to get started is the NaiveBayesClassifer.
    • It uses Bayes Theorem to predict the probability that a given feature set belongs to a particular label. Recall the formular:
      P(label|features) = P(label) * P(features|label) / P(features)
  • Corpus: movie reviews corpus
    • each file in the corpus is composed of either positive or negative movie reviews.
    • let's try a sentiment analysis.
In [44]:
from nltk.corpus import movie_reviews
from featx import label_feats_from_corpus, split_label_feats
movie_reviews.categories()
Out[44]:
['neg', 'pos']
  • the label_feats_from_corpus() function takes a corpus, and a feature_detector function, which is bag_of_words() by default.
In [60]:
lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()
Out[60]:
['neg', 'pos']
  • Once we get a mapping of label:feature sets, as shown in lfeats:
    defaultdict(<type 'list'>, {'neg': [{'all': True, 'concept': True, 'skip': True, 'go': True, 'seemed': True, 'suits': True, 'presents': True, 'to': True, 'sitting': True, 'very': True, 'horror': True, 'continues': True, 'every': True, 'exact': True, 'cool': True, 'entire': True, 'did': True, 'dig': True, 'flick': True, 'neighborhood': True, 'crow': True, 'street': True, 'video': True, 'further': True,.............
    we need to split the data into training and testing ones.
In [61]:
# (split = 0.75) by default

train_feats, test_feats = split_label_feats(lfeats)
len(train_feats)
Out[61]:
1500
In [47]:
len(test_feats)
Out[47]:
500

So there are 1,000 pos files, 1,000 neg files, and we end up woth 1,500 labeled training instances and 500 labeled testing instances. Now we can train a NaiveBayesClassifier using its train() method.

In [49]:
from nltk.classify import NaiveBayesClassifier
nb_classifier = NaiveBayesClassifier.train(train_feats)
nb_classifier.labels()
Out[49]:
['neg', 'pos']

Once trained, let's test the classifer on some made-up reviews. The classify() method takes a single argument, which should be a feature set. We can use bag_of_words() feature detector on a made-up list of words to get the feature set.

In [50]:
negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous'])
nb_classifier.classify(negfeat)
Out[50]:
'neg'
In [51]:
posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible'])
nb_classifier.classify(posfeat)
Out[51]:
'pos'
  • We can test the accuracy of the classifer
In [52]:
from nltk.classify.util import accuracy
accuracy(nb_classifier, test_feats)
Out[52]:
0.728
  • To get the classification probability of each label, you can use the prob_classify() method.
In [53]:
probs = nb_classifier.prob_classify(test_feats[0][0])
probs.samples()
# ['neg','pos']
probs.max()
#'pos'
probs.prob('pos')
#0.99999996464309127
probs.prob('neg')
#3.5356889692409258e-08
Out[53]:
3.535688969240926e-08
  • The most_informative_features() method returns a list of form [(feature name, feature value)] ordered by most informative to least informative. In the case, the feature value will always be True, though.
In [54]:
nb_classifier.most_informative_features(n=5)
Out[54]:
[('magnificent', True),
 ('outstanding', True),
 ('insulting', True),
 ('vulnerable', True),
 ('ludicrous', True)]
  • The show_most_informative_features() method will print out the results and include the probability of a feature pair belonging to each label.
In [62]:
nb_classifier.show_most_informative_features(n=5)
Most Informative Features
             outstanding = True              pos : neg    =     12.0 : 1.0
             magnificent = True              pos : neg    =     11.5 : 1.0
               ludicrous = True              neg : pos    =     10.0 : 1.0
               insulting = True              neg : pos    =     10.0 : 1.0
              vulnerable = True              pos : neg    =      9.5 : 1.0
In [55]:
from nltk.probability import LaplaceProbDist
nb_classifier = NaiveBayesClassifier.train(train_feats, estimator=LaplaceProbDist)
accuracy(nb_classifier, test_feats)
Out[55]:
0.716

Text Classification with TextBlob

  • how to use (TextBlob) to create your own text classification systems.
    • (source): http://www.stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/
  • assumes that you have TextBlob >= 0.6.0 and nltk >= 2.0 installed.

    • pip install -U textblob nltk
  • (download the necessary NLTK corpora with one command)

    • curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python

A Tweet Sentiment Analyzer

Our first classifier will be a simple sentiment analyzer trained on a small dataset of fake tweets. To begin, we’ll import the text.classifiers and create some training and test data.

In [23]:
from text.classifiers import NaiveBayesClassifier
In [24]:
train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

We create a new classifier by passing training data into the constructor for a NaiveBayesClassifier.

In [25]:
cl = NaiveBayesClassifier(train)

We can now classify arbitrary text using the NaiveBayesClassifier.classify(text) method.

In [26]:
cl.classify("Their burgers are amazing")  # "pos"
cl.classify("I don't like their pizza.")  # "neg"
Out[26]:
'neg'
  • Another way to classify strings of text is to use TextBlob objects. You can pass classifiers into the constructor of a TextBlob.
In [27]:
from text.blob import TextBlob
blob = TextBlob("The beer was amazing. "
                "But the hangover was horrible. My boss was not happy.", 
                classifier=cl)

You can then call the classify() method on the blob.

In [28]:
blob.classify()  # "neg"
Out[28]:
'neg'

You can also take advantage of TextBlob’s sentence tokenization and classify each sentence indvidually.

In [29]:
for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())
# "pos", "neg", "neg"
The beer was amazing.
pos
But the hangover was horrible.
neg
My boss was not happy.
neg

Let’s check the accuracy on the test set.

In [30]:
cl.accuracy(test)  
Out[30]:
0.8333333333333334

We can also find the most informative features, which indicate that tweets containing the word “my” but not containing the word “place” tend to be negative.

In [31]:
cl.show_informative_features(5)
Most Informative Features
            contains(my) = True              neg : pos    =      1.7 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
            contains(my) = False             pos : neg    =      1.3 : 1.0
         contains(place) = False             neg : pos    =      1.2 : 1.0
            contains(of) = False             pos : neg    =      1.2 : 1.0
  • We can improve our classifier by adding more training and test data. Here we’ll add data from the movie review corpus which was downloaded with NLTK.
In [12]:
import random
from nltk.corpus import movie_reviews

reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

new_train, new_test = reviews[0:100], reviews[101:200]

Let’s see what one of these documents looks like.

In [13]:
print(new_train[0])
(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of', 'making', 'all', 'types', 'of', 'films', ',', 'and', 'these', 'folks', 'just', 'didn', "'", 't', 'snag', 'this', 'one', 'correctly', '.', 'they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.', 'so', 'what', 'are', 'the', 'problems', 'with', 'the', 'movie', '?', 'well', ',', 'its', 'main', 'problem', 'is', 'that', 'it', "'", 's', 'simply', 'too', 'jumbled', '.', 'it', 'starts', 'off', '"', 'normal', '"', 'but', 'then', 'downshifts', 'into', 'this', '"', 'fantasy', '"', 'world', 'in', 'which', 'you', ',', 'as', 'an', 'audience', 'member', ',', 'have', 'no', 'idea', 'what', "'", 's', 'going', 'on', '.', 'there', 'are', 'dreams', ',', 'there', 'are', 'characters', 'coming', 'back', 'from', 'the', 'dead', ',', 'there', 'are', 'others', 'who', 'look', 'like', 'the', 'dead', ',', 'there', 'are', 'strange', 'apparitions', ',', 'there', 'are', 'disappearances', ',', 'there', 'are', 'a', 'looooot', 'of', 'chase', 'scenes', ',', 'there', 'are', 'tons', 'of', 'weird', 'things', 'that', 'happen', ',', 'and', 'most', 'of', 'it', 'is', 'simply', 'not', 'explained', '.', 'now', 'i', 'personally', 'don', "'", 't', 'mind', 'trying', 'to', 'unravel', 'a', 'film', 'every', 'now', 'and', 'then', ',', 'but', 'when', 'all', 'it', 'does', 'is', 'give', 'me', 'the', 'same', 'clue', 'over', 'and', 'over', 'again', ',', 'i', 'get', 'kind', 'of', 'fed', 'up', 'after', 'a', 'while', ',', 'which', 'is', 'this', 'film', "'", 's', 'biggest', 'problem', '.', 'it', "'", 's', 'obviously', 'got', 'this', 'big', 'secret', 'to', 'hide', ',', 'but', 'it', 'seems', 'to', 'want', 'to', 'hide', 'it', 'completely', 'until', 'its', 'final', 'five', 'minutes', '.', 'and', 'do', 'they', 'make', 'things', 'entertaining', ',', 'thrilling', 'or', 'even', 'engaging', ',', 'in', 'the', 'meantime', '?', 'not', 'really', '.', 'the', 'sad', 'part', 'is', 'that', 'the', 'arrow', 'and', 'i', 'both', 'dig', 'on', 'flicks', 'like', 'this', ',', 'so', 'we', 'actually', 'figured', 'most', 'of', 'it', 'out', 'by', 'the', 'half', '-', 'way', 'point', ',', 'so', 'all', 'of', 'the', 'strangeness', 'after', 'that', 'did', 'start', 'to', 'make', 'a', 'little', 'bit', 'of', 'sense', ',', 'but', 'it', 'still', 'didn', "'", 't', 'the', 'make', 'the', 'film', 'all', 'that', 'more', 'entertaining', '.', 'i', 'guess', 'the', 'bottom', 'line', 'with', 'movies', 'like', 'this', 'is', 'that', 'you', 'should', 'always', 'make', 'sure', 'that', 'the', 'audience', 'is', '"', 'into', 'it', '"', 'even', 'before', 'they', 'are', 'given', 'the', 'secret', 'password', 'to', 'enter', 'your', 'world', 'of', 'understanding', '.', 'i', 'mean', ',', 'showing', 'melissa', 'sagemiller', 'running', 'away', 'from', 'visions', 'for', 'about', '20', 'minutes', 'throughout', 'the', 'movie', 'is', 'just', 'plain', 'lazy', '!', '!', 'okay', ',', 'we', 'get', 'it', '.', '.', '.', 'there', 'are', 'people', 'chasing', 'her', 'and', 'we', 'don', "'", 't', 'know', 'who', 'they', 'are', '.', 'do', 'we', 'really', 'need', 'to', 'see', 'it', 'over', 'and', 'over', 'again', '?', 'how', 'about', 'giving', 'us', 'different', 'scenes', 'offering', 'further', 'insight', 'into', 'all', 'of', 'the', 'strangeness', 'going', 'down', 'in', 'the', 'movie', '?', 'apparently', ',', 'the', 'studio', 'took', 'this', 'film', 'away', 'from', 'its', 'director', 'and', 'chopped', 'it', 'up', 'themselves', ',', 'and', 'it', 'shows', '.', 'there', 'might', "'", 've', 'been', 'a', 'pretty', 'decent', 'teen', 'mind', '-', 'fuck', 'movie', 'in', 'here', 'somewhere', ',', 'but', 'i', 'guess', '"', 'the', 'suits', '"', 'decided', 'that', 'turning', 'it', 'into', 'a', 'music', 'video', 'with', 'little', 'edge', ',', 'would', 'make', 'more', 'sense', '.', 'the', 'actors', 'are', 'pretty', 'good', 'for', 'the', 'most', 'part', ',', 'although', 'wes', 'bentley', 'just', 'seemed', 'to', 'be', 'playing', 'the', 'exact', 'same', 'character', 'that', 'he', 'did', 'in', 'american', 'beauty', ',', 'only', 'in', 'a', 'new', 'neighborhood', '.', 'but', 'my', 'biggest', 'kudos', 'go', 'out', 'to', 'sagemiller', ',', 'who', 'holds', 'her', 'own', 'throughout', 'the', 'entire', 'film', ',', 'and', 'actually', 'has', 'you', 'feeling', 'her', 'character', "'", 's', 'unraveling', '.', 'overall', ',', 'the', 'film', 'doesn', "'", 't', 'stick', 'because', 'it', 'doesn', "'", 't', 'entertain', ',', 'it', "'", 's', 'confusing', ',', 'it', 'rarely', 'excites', 'and', 'it', 'feels', 'pretty', 'redundant', 'for', 'most', 'of', 'its', 'runtime', ',', 'despite', 'a', 'pretty', 'cool', 'ending', 'and', 'explanation', 'to', 'all', 'of', 'the', 'craziness', 'that', 'came', 'before', 'it', '.', 'oh', ',', 'and', 'by', 'the', 'way', ',', 'this', 'is', 'not', 'a', 'horror', 'or', 'teen', 'slasher', 'flick', '.', '.', '.', 'it', "'", 's', 'just', 'packaged', 'to', 'look', 'that', 'way', 'because', 'someone', 'is', 'apparently', 'assuming', 'that', 'the', 'genre', 'is', 'still', 'hot', 'with', 'the', 'kids', '.', 'it', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', 'where', "'", 's', 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7', '/', '10', ')', '-', 'blair', 'witch', '2', '(', '7', '/', '10', ')', '-', 'the', 'crow', '(', '9', '/', '10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4', '/', '10', ')', '-', 'lost', 'highway', '(', '10', '/', '10', ')', '-', 'memento', '(', '10', '/', '10', ')', '-', 'the', 'others', '(', '9', '/', '10', ')', '-', 'stir', 'of', 'echoes', '(', '8', '/', '10', ')'], 'neg')

Notice that unlike the data in Part 1, the text comes as a list of words instead of a single string. TextBlob is smart about this; it will treat both forms of data as expected.

  • We can now update our classifier with the new training data using the update(new_data) method, as well as test it using the larger test dataset.
In [14]:
cl.update(new_train)
accuracy = cl.accuracy(test + new_test)    
  • Here’s the full, updated script:
In [16]:
import random
from nltk.corpus import movie_reviews
from text.classifiers import NaiveBayesClassifier
random.seed(1)
 
train = [
('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')
]
test = [
('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
 
cl = NaiveBayesClassifier(train)
 
# Grab some movie review data
reviews = [(list(movie_reviews.words(fileid)), category)
           for category in movie_reviews.categories()
           for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)
new_train, new_test = reviews[0:100], reviews[101:200]
 
# Update the classifier with the new training data
cl.update(new_train)
 
# Compute accuracy
accuracy = cl.accuracy(test + new_test)
print("Accuracy: {0}".format(accuracy))
 
# Show 5 most informative features
cl.show_informative_features(5)
Accuracy: 0.742857142857
Most Informative Features
          contains(none) = True              neg : pos    =      7.4 : 1.0
       contains(attempt) = True              neg : pos    =      6.8 : 1.0
          contains(read) = True              neg : pos    =      6.8 : 1.0
        contains(career) = True              neg : pos    =      6.1 : 1.0
          contains(cool) = True              neg : pos    =      6.1 : 1.0
Accuracy: 0.742857142857
Most Informative Features
          contains(none) = True              neg : pos    =      7.4 : 1.0
       contains(attempt) = True              neg : pos    =      6.8 : 1.0
          contains(read) = True              neg : pos    =      6.8 : 1.0
        contains(career) = True              neg : pos    =      6.1 : 1.0
          contains(cool) = True              neg : pos    =      6.1 : 1.0

Advanced:

  • Using Pandas, scikit-learn

Regarding term project, you may want to check this

In [ ]: