Genre Classification with a Bag-of-Words model

This notebook demonstrates how the genre of a text can be automatically predicted from its word counts.

The texts consist of 19th century English novels from project Gutenberg. The selection is based on the successful novels (high download counts) as used in Ashok et al (2013), Success with style.

First some preliminaries. Import the external libraries we will use. Most importantly, we rely on Scikit-Learn to do machine learning. See http://scikit-learn.org/

In [1]:
import io, os, itertools
import numpy, pandas
import matplotlib.pyplot as plt
from sklearn import feature_extraction, preprocessing, decomposition, cross_validation, svm, metrics

# Tweak how tables are displayed
pandas.set_option('display.precision', 4)
pandas.set_option('display.max_colwidth', 30)
pandas.set_option('display.colheader_justify', 'left')

Load the texts and convert to a bag-of-words table

The bag-of-words (BOW) table will contain a count for each text and every word.

All letters are converted to lower case and we filter the words until we only have words with 3 or more alphabetic characters (no numbers or punctuation).

Important parameters are max_features, min_df, and max_df.

  • max_features limits the model to consider only the top n most frequent words. More is often better but may be slow to compute.
  • min_df and max_df restrict the words to those that occur in a certain proportion of texts.

For example, setting these to the values 0.2 and 0.8, respectively, restricts the model to words that are in at least 20%, and at most 80% of the texts. This removes rare words on the one hand, and ignores highly frequent words such as function words on the other.

  • use_idf=True turns on tf-idf weighting; this incorporates not only the total frequency of a word, but also the number of texts in which it appears. This means that words which are frequent in one text but not in others will get a high score, while words that are frequent in all documents receive a lower score.
  • sublinear_tf=True is another variation which makes the word frequencies logarithmic.
In [2]:
# Set up a simple BOW model
vectorizer = feature_extraction.text.TfidfVectorizer(
        lowercase=True, token_pattern=r'\b[-A-Za-z]{3,}\b',
        min_df=0.1, max_df=0.5, max_features=5000,
        use_idf=True, sublinear_tf=True)

# Get a list of all filenames in the 'train/' folder,
# and add their text to the BOW table 'X'.
filenames = os.listdir('train/')
X = vectorizer.fit_transform((io.open('train/' + a, encoding='utf8').read() for a in filenames))

Inspect the vector representation

To see what the vector representation looks like, we will look at how a simple example is transformed to the BOW representation.

Notice that because of the parameters before, many words are ignored.

In [3]:
vec = vectorizer.transform(['It was a dark and stormy night; '
                            'the rain fell in torrents — except at occasional intervals, '
                            'when it was checked by a violent gust of wind which swept '
                            'up the streets (for it is in London that our scene lies), '
                            'rattling along the housetops, and fiercely agitating the scanty '
                            'flame of the lamps that struggled against the darkness.'])

# we get back a large vector with a value for each possible word.
# show a table with only the non-zero items in the vector:
feature_names = vectorizer.get_feature_names()
pandas.DataFrame([(feature_names[b], (a, b), vec[a, b])
                        for a, b in zip(*vec.nonzero())],
                       columns=['word', 'index', 'weight'])
Out[3]:
word index weight
0 stormy (0, 4269) 0.440
1 scanty (0, 3877) 0.468
2 rattling (0, 3539) 0.418
3 occasional (0, 2995) 0.364
4 lamps (0, 2510) 0.391
5 checked (0, 714) 0.357

Get genre of each text

The genre of each text is specified in a separate metadata file.

Note that in the original data, a text may have multiple genres. We choose an arbitrary genre as the "true" genre.

In a more careful study the single most appropriate genre would have to be seleceted by hand.

In [4]:
# Print the first 5 lines to see what the metadata looks like:
print(''.join(io.open('metadata.csv', encoding='utf8').readlines()[:5]))
Dataset,Fold,Success,FileName,Title,Author,Language,DownloadCount
Adventure,1,SUCCESS,103.txt,around the world in 80 days,"verne, jules, 1828-1905",en,3260
Adventure,1,SUCCESS,1145.txt,rupert of hentzau,"hope, anthony, 1863-1933",en,141
Adventure,1,SUCCESS,1947.txt,scaramouche,"sabatini, rafael, 1875-1950",en,434
Adventure,1,SUCCESS,18857.txt,a journey to the center of the earth,"verne, jules, 1828-1905",en,336

In [5]:
# Load the data; metadata.index will be the filename.
metadata = pandas.read_csv('metadata.csv', index_col=3, encoding='utf8')
genres = dict(zip(metadata.index, metadata['Dataset']))

# convert the genre labels to integers
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform([genres[a] for a in filenames])

# Create an abbreviated label "Author_Title" for each text
authors = dict(zip(metadata.index, metadata['Author']))
titles = dict(zip(metadata.index, metadata['Title']))
labels = ['%s_%s' % (authors[a].split(',')[0].title(),
        titles[a][:15].title()) for a in filenames]

Visualize how texts relate according to the bag-of-words model

By applying dimensionality reduction, it is possible to summarize the word counts for each text in two dimensions. Books which are close together are similar. The method used is Latent Semantic Analysis

Note that for the purposes of visualization, we only show 2 dimensions; the classification model will use more. Books that are close together in this visualization may still be distinguished when more dimensions are taken into account.

In [6]:
# Reduce the BOW model to 2 dimensions
dec = decomposition.TruncatedSVD(n_components=2)
X_r = dec.fit_transform(X)
print('Explained variance:', dec.explained_variance_ratio_)

# Make a scatter plot with the author/title of each text as label
plt.figure(figsize=(12, 8))
for c, (i, target_name) in zip('rbmkycg', enumerate(encoder.classes_, 2)):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
    for n, xpos, ypos in zip(
            (y == i).nonzero()[0], X_r[y == i, 0], X_r[y == i, 1]):
        plt.annotate(labels[n], xy=(xpos, ypos), xytext=(5, 5),
            textcoords='offset points', color=c,
            fontsize='small', ha='left', va='top')

plt.legend()
plt.title('%s of dataset' % dec.__class__.__name__)
plt.show()
('Explained variance:', array([ 0.04966917,  0.04049138]))

Train a classifier

You can try different values for the parameter C. This parameter controls the level of regularization; with higher values, the model will take more edge cases (datapoints close to datapoints of other classes) into account. This will give better scores on data that is similar to the training data, but if the training data is not representative, it may result in more errors.

In [7]:
# Randomly select 80% as training set, 20% as test/validation set,
# but make sure that each genre is well-represented.
train, test = next(iter(cross_validation.StratifiedShuffleSplit(
        y, test_size=0.2, n_iter=1, random_state=42)))

# Train an SVM classifier and predict the genre of the items in the test set.
clf = svm.LinearSVC(C=1.0, random_state=42)
clf.fit(X[train], y[train])
pred = clf.predict(X[test])

Evaluate the classifier

The breakdown shows that not all genres are predicted as well; the f-score column is the most important.

In the confusion matrix we can see which genres were mistaken most often. The columns hold the number of times the model predicted a genre, while the rows show the true genres.

In [8]:
print('Overall accuracy:\t%4.1f %%\n' % (100 * metrics.accuracy_score(y[test], pred)))
print(metrics.classification_report(y[test], pred, target_names=encoder.classes_))
pandas.DataFrame(metrics.confusion_matrix(y[test], pred),
                       index=sorted(encoder.classes_),
                       columns=sorted(encoder.classes_))
Overall accuracy:	68.3 %

             precision    recall  f1-score   support

  Adventure       0.40      0.25      0.31         8
  Detective       1.00      0.89      0.94         9
    Fiction       0.38      0.56      0.45         9
 Historical       0.90      0.90      0.90        10
     Poetry       0.77      1.00      0.87        10
     Sci-Fi       0.70      1.00      0.82         7
      Short       0.50      0.20      0.29        10

avg / total       0.67      0.68      0.66        63

Out[8]:
Adventure Detective Fiction Historical Poetry Sci-Fi Short
Adventure 2 0 3 0 2 0 1
Detective 0 8 1 0 0 0 0
Fiction 1 0 5 1 1 0 1
Historical 0 0 1 9 0 0 0
Poetry 0 0 0 0 10 0 0
Sci-Fi 0 0 0 0 0 7 0
Short 2 0 3 0 0 3 2

Which books were the hardest to classify?

The following table lists the 10 books which had the least similarity to any of the genres. In the table, the numbers represent the distance of a text to a genre, where a negative value represents dissimilarity, and a positive value represents similarity. Values close to 0 represent uncertainty and are therefore the more difficult cases.

In [9]:
data = sorted(zip(test, clf.decision_function(X[test])), key=lambda x: max(x[1]))[:10]
pandas.DataFrame([a for _, a in data],
                       index=[labels[n] for n, _ in data],
                       columns=encoder.classes_)
Out[9]:
Adventure Detective Fiction Historical Poetry Sci-Fi Short
Hartley_The Gentlemen'S -0.776 -0.885 -0.553 -0.563 -0.630 -1.237 -0.634
Twain_A Double Barrel -0.803 -0.613 -0.551 -0.905 -0.971 -0.741 -0.598
Buchan_Huntingtower -0.817 -0.558 -0.430 -0.903 -0.713 -1.058 -0.896
Conrad_Tales Of Unrest -0.426 -0.869 -0.475 -0.645 -0.996 -1.249 -0.614
Hawthorne_Mosses From An -0.789 -1.006 -0.756 -0.606 -0.656 -1.200 -0.391
James_The Pupil -0.933 -0.579 -0.383 -0.875 -0.789 -0.789 -0.651
Smith_A Gentleman Vag -0.969 -0.571 -0.371 -0.968 -1.006 -0.978 -0.411
Conrad_The Arrow Of Go -0.774 -0.446 -0.366 -0.567 -1.134 -1.301 -0.647
Nesbit_The Wouldbegood -0.603 -0.755 -0.697 -0.903 -0.897 -1.124 -0.363
Farmer_They Twinkled L -0.920 -0.829 -0.787 -0.935 -0.854 -0.363 -0.379

Which words are most strongly associated with each genre?

For each genre, the top 10 words most strongly linked to each genre are shown.

These words are not necessarily frequent, but if they occur, there is a high chance that it points to the given genre.

In [10]:
# Sort the weights of the classifier and take last 10 items
data = []
for n, target in enumerate(encoder.classes_):
    data.append([(feature_names[m], clf.coef_[n][m])
                 for m in numpy.argsort(clf.coef_[n])[-10:][::-1]])
    
pandas.DataFrame([itertools.chain(*a) for a in zip(*data)],
                       columns=list(itertools.chain(*((target, '') for target in encoder.classes_))),
                       index=range(1, 11))
Out[10]:
Adventure Detective Fiction Historical Poetry Sci-Fi Short
1 buck 0.460 detective 0.803 billy 0.453 virginia 0.491 poems 0.789 corrected 0.667 cats 0.595
2 engineer 0.444 police 0.559 chairman 0.434 sergeant 0.480 thy 0.505 errors 0.626 bob 0.501
3 moat 0.421 revolver 0.508 helen 0.420 lieutenant 0.404 thee 0.467 mars 0.562 model 0.481
4 savages 0.382 inspector 0.497 anne 0.420 wagons 0.403 morn 0.460 car 0.487 dollars 0.434
5 adventures 0.354 criminal 0.456 guy 0.397 fort 0.368 poem 0.416 spelling 0.470 wizard 0.432
6 tent 0.345 murderer 0.445 daddy 0.391 galloped 0.360 tis 0.407 publication 0.435 chicago 0.432
7 attacking 0.342 detectives 0.437 stove 0.377 treason 0.354 skies 0.400 button 0.434 magician 0.424
8 wounds 0.342 drawer 0.426 saloon 0.375 nicholas 0.353 sings 0.373 buildings 0.426 mamma 0.416
9 compass 0.331 card 0.360 cent 0.373 france 0.350 muse 0.371 onto 0.417 ruler 0.415
10 camels 0.329 collection 0.354 nothin 0.359 veteran 0.349 doth 0.369 section 0.406 tonight 0.406

Can the model predict the genre of new texts?

Finally, load new texts that the model has never seen before, and see what it predicts

In [11]:
# Since we now evaluate on an external test set, we can use everything
# as training data
clf.fit(X, y)

# Transform the new files to the format of the existing BOW table
newfiles = os.listdir('test/')
X1 = vectorizer.transform((io.open('test/' + a, encoding='utf8').read() for a in newfiles))
predictions = encoder.inverse_transform(clf.predict(X1))

pandas.DataFrame([
            (authors[a].title(), titles[a].title(),
            genres[a], b)
        for a, b in zip(newfiles, predictions)],
        index=newfiles,
        columns=['Author', 'Title', 'actual', 'predicted'])
Out[11]:
Author Title actual predicted
1027.txt Grey, Zane, 1872-1939 The Lone Star Ranger, A Ro... Fiction Fiction
10067.txt Stevenson, Burton Egbert, ... The Mystery Of The Boule C... Detective Detective
12843.txt Emerson, Ralph Waldo, 1803... Poems Household Edition Poetry Poetry
10150.txt Stoker, Bram, 1847-1912 Dracula'S Guest Short Short
103.txt Verne, Jules, 1828-1905 Around The World In 80 Days Adventure Detective
18109.txt Piper, H. Beam, 1904-1964 Graveyard Of Dreams Sci-Fi Sci-Fi
11228.txt Chesnutt, Charles W. (Char... The Marrow Of Tradition Historical Historical