Genre Classification with a Bag-of-Words model¶

This notebook demonstrates how the genre of a text can be automatically predicted from its word counts.

The texts consist of 19th century English novels from project Gutenberg. The selection is based on the successful novels (high download counts) as used in Ashok et al (2013), Success with style.

First some preliminaries. Import the external libraries we will use. Most importantly, we rely on Scikit-Learn to do machine learning. See http://scikit-learn.org/

In [1]:

import io, os, itertools
import numpy, pandas
import matplotlib.pyplot as plt
from sklearn import feature_extraction, preprocessing, decomposition, cross_validation, svm, metrics

# Tweak how tables are displayed
pandas.set_option('display.precision', 4)
pandas.set_option('display.max_colwidth', 30)
pandas.set_option('display.colheader_justify', 'left')

Load the texts and convert to a bag-of-words table¶

The bag-of-words (BOW) table will contain a count for each text and every word.

All letters are converted to lower case and we filter the words until we only have words with 3 or more alphabetic characters (no numbers or punctuation).

Important parameters are max_features, min_df, and max_df.

max_features limits the model to consider only the top n most frequent words. More is often better but may be slow to compute.
min_df and max_df restrict the words to those that occur in a certain proportion of texts.

For example, setting these to the values 0.2 and 0.8, respectively, restricts the model to words that are in at least 20%, and at most 80% of the texts. This removes rare words on the one hand, and ignores highly frequent words such as function words on the other.

use_idf=True turns on tf-idf weighting; this incorporates not only the total frequency of a word, but also the number of texts in which it appears. This means that words which are frequent in one text but not in others will get a high score, while words that are frequent in all documents receive a lower score.
sublinear_tf=True is another variation which makes the word frequencies logarithmic.

In [2]:

# Set up a simple BOW model
vectorizer = feature_extraction.text.TfidfVectorizer(
        lowercase=True, token_pattern=r'\b[-A-Za-z]{3,}\b',
        min_df=0.1, max_df=0.5, max_features=5000,
        use_idf=True, sublinear_tf=True)

# Get a list of all filenames in the 'train/' folder,
# and add their text to the BOW table 'X'.
filenames = os.listdir('train/')
X = vectorizer.fit_transform((io.open('train/' + a, encoding='utf8').read() for a in filenames))

Inspect the vector representation¶

To see what the vector representation looks like, we will look at how a simple example is transformed to the BOW representation.

Notice that because of the parameters before, many words are ignored.

In [3]:

vec = vectorizer.transform(['It was a dark and stormy night; '
                            'the rain fell in torrents — except at occasional intervals, '
                            'when it was checked by a violent gust of wind which swept '
                            'up the streets (for it is in London that our scene lies), '
                            'rattling along the housetops, and fiercely agitating the scanty '
                            'flame of the lamps that struggled against the darkness.'])

# we get back a large vector with a value for each possible word.
# show a table with only the non-zero items in the vector:
feature_names = vectorizer.get_feature_names()
pandas.DataFrame([(feature_names[b], (a, b), vec[a, b])
                        for a, b in zip(*vec.nonzero())],
                       columns=['word', 'index', 'weight'])

Out[3]:

	word	index	weight
0	stormy	(0, 4269)	0.440
1	scanty	(0, 3877)	0.468
2	rattling	(0, 3539)	0.418
3	occasional	(0, 2995)	0.364
4	lamps	(0, 2510)	0.391
5	checked	(0, 714)	0.357

Get genre of each text¶

The genre of each text is specified in a separate metadata file.

Note that in the original data, a text may have multiple genres. We choose an arbitrary genre as the "true" genre.

In a more careful study the single most appropriate genre would have to be seleceted by hand.

In [4]:

# Print the first 5 lines to see what the metadata looks like:
print(''.join(io.open('metadata.csv', encoding='utf8').readlines()[:5]))

Dataset,Fold,Success,FileName,Title,Author,Language,DownloadCount
Adventure,1,SUCCESS,103.txt,around the world in 80 days,"verne, jules, 1828-1905",en,3260
Adventure,1,SUCCESS,1145.txt,rupert of hentzau,"hope, anthony, 1863-1933",en,141
Adventure,1,SUCCESS,1947.txt,scaramouche,"sabatini, rafael, 1875-1950",en,434
Adventure,1,SUCCESS,18857.txt,a journey to the center of the earth,"verne, jules, 1828-1905",en,336

In [5]:

# Load the data; metadata.index will be the filename.
metadata = pandas.read_csv('metadata.csv', index_col=3, encoding='utf8')
genres = dict(zip(metadata.index, metadata['Dataset']))

# convert the genre labels to integers
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform([genres[a] for a in filenames])

# Create an abbreviated label "Author_Title" for each text
authors = dict(zip(metadata.index, metadata['Author']))
titles = dict(zip(metadata.index, metadata['Title']))
labels = ['%s_%s' % (authors[a].split(',')[0].title(),
        titles[a][:15].title()) for a in filenames]

Visualize how texts relate according to the bag-of-words model¶

By applying dimensionality reduction, it is possible to summarize the word counts for each text in two dimensions. Books which are close together are similar. The method used is Latent Semantic Analysis

Note that for the purposes of visualization, we only show 2 dimensions; the classification model will use more. Books that are close together in this visualization may still be distinguished when more dimensions are taken into account.

In [6]:

# Reduce the BOW model to 2 dimensions
dec = decomposition.TruncatedSVD(n_components=2)
X_r = dec.fit_transform(X)
print('Explained variance:', dec.explained_variance_ratio_)

# Make a scatter plot with the author/title of each text as label
plt.figure(figsize=(12, 8))
for c, (i, target_name) in zip('rbmkycg', enumerate(encoder.classes_, 2)):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
    for n, xpos, ypos in zip(
            (y == i).nonzero()[0], X_r[y == i, 0], X_r[y == i, 1]):
        plt.annotate(labels[n], xy=(xpos, ypos), xytext=(5, 5),
            textcoords='offset points', color=c,
            fontsize='small', ha='left', va='top')

plt.legend()
plt.title('%s of dataset' % dec.__class__.__name__)
plt.show()

('Explained variance:', array([ 0.04966917,  0.04049138]))

Train a classifier¶

You can try different values for the parameter C. This parameter controls the level of regularization; with higher values, the model will take more edge cases (datapoints close to datapoints of other classes) into account. This will give better scores on data that is similar to the training data, but if the training data is not representative, it may result in more errors.

In [7]:

# Randomly select 80% as training set, 20% as test/validation set,
# but make sure that each genre is well-represented.
train, test = next(iter(cross_validation.StratifiedShuffleSplit(
        y, test_size=0.2, n_iter=1, random_state=42)))

# Train an SVM classifier and predict the genre of the items in the test set.
clf = svm.LinearSVC(C=1.0, random_state=42)
clf.fit(X[train], y[train])
pred = clf.predict(X[test])

Evaluate the classifier¶

The breakdown shows that not all genres are predicted as well; the f-score column is the most important.

In the confusion matrix we can see which genres were mistaken most often. The columns hold the number of times the model predicted a genre, while the rows show the true genres.

In [8]:

print('Overall accuracy:\t%4.1f %%\n' % (100 * metrics.accuracy_score(y[test], pred)))
print(metrics.classification_report(y[test], pred, target_names=encoder.classes_))
pandas.DataFrame(metrics.confusion_matrix(y[test], pred),
                       index=sorted(encoder.classes_),
                       columns=sorted(encoder.classes_))

Overall accuracy:	68.3 %

             precision    recall  f1-score   support

  Adventure       0.40      0.25      0.31         8
  Detective       1.00      0.89      0.94         9
    Fiction       0.38      0.56      0.45         9
 Historical       0.90      0.90      0.90        10
     Poetry       0.77      1.00      0.87        10
     Sci-Fi       0.70      1.00      0.82         7
      Short       0.50      0.20      0.29        10

avg / total       0.67      0.68      0.66        63

Out[8]:

	Adventure	Detective	Fiction	Historical	Poetry	Sci-Fi	Short
Adventure	2	0	3	0	2	0	1
Detective	0	8	1	0	0	0	0
Fiction	1	0	5	1	1	0	1
Historical	0	0	1	9	0	0	0
Poetry	0	0	0	0	10	0	0
Sci-Fi	0	0	0	0	0	7	0
Short	2	0	3	0	0	3	2

Which books were the hardest to classify?¶

The following table lists the 10 books which had the least similarity to any of the genres. In the table, the numbers represent the distance of a text to a genre, where a negative value represents dissimilarity, and a positive value represents similarity. Values close to 0 represent uncertainty and are therefore the more difficult cases.

In [9]:

data = sorted(zip(test, clf.decision_function(X[test])), key=lambda x: max(x[1]))[:10]
pandas.DataFrame([a for _, a in data],
                       index=[labels[n] for n, _ in data],
                       columns=encoder.classes_)

Out[9]:

	Adventure	Detective	Fiction	Historical	Poetry	Sci-Fi	Short
Hartley_The Gentlemen'S	-0.776	-0.885	-0.553	-0.563	-0.630	-1.237	-0.634
Twain_A Double Barrel	-0.803	-0.613	-0.551	-0.905	-0.971	-0.741	-0.598
Buchan_Huntingtower	-0.817	-0.558	-0.430	-0.903	-0.713	-1.058	-0.896
Conrad_Tales Of Unrest	-0.426	-0.869	-0.475	-0.645	-0.996	-1.249	-0.614
Hawthorne_Mosses From An	-0.789	-1.006	-0.756	-0.606	-0.656	-1.200	-0.391
James_The Pupil	-0.933	-0.579	-0.383	-0.875	-0.789	-0.789	-0.651
Smith_A Gentleman Vag	-0.969	-0.571	-0.371	-0.968	-1.006	-0.978	-0.411
Conrad_The Arrow Of Go	-0.774	-0.446	-0.366	-0.567	-1.134	-1.301	-0.647
Nesbit_The Wouldbegood	-0.603	-0.755	-0.697	-0.903	-0.897	-1.124	-0.363
Farmer_They Twinkled L	-0.920	-0.829	-0.787	-0.935	-0.854	-0.363	-0.379

Which words are most strongly associated with each genre?¶

For each genre, the top 10 words most strongly linked to each genre are shown.

These words are not necessarily frequent, but if they occur, there is a high chance that it points to the given genre.

In [10]:

# Sort the weights of the classifier and take last 10 items
data = []
for n, target in enumerate(encoder.classes_):
    data.append([(feature_names[m], clf.coef_[n][m])
                 for m in numpy.argsort(clf.coef_[n])[-10:][::-1]])
    
pandas.DataFrame([itertools.chain(*a) for a in zip(*data)],
                       columns=list(itertools.chain(*((target, '') for target in encoder.classes_))),
                       index=range(1, 11))

Out[10]:

	Adventure		Detective		Fiction		Historical		Poetry		Sci-Fi		Short
1	buck	0.460	detective	0.803	billy	0.453	virginia	0.491	poems	0.789	corrected	0.667	cats	0.595
2	engineer	0.444	police	0.559	chairman	0.434	sergeant	0.480	thy	0.505	errors	0.626	bob	0.501
3	moat	0.421	revolver	0.508	helen	0.420	lieutenant	0.404	thee	0.467	mars	0.562	model	0.481
4	savages	0.382	inspector	0.497	anne	0.420	wagons	0.403	morn	0.460	car	0.487	dollars	0.434
5	adventures	0.354	criminal	0.456	guy	0.397	fort	0.368	poem	0.416	spelling	0.470	wizard	0.432
6	tent	0.345	murderer	0.445	daddy	0.391	galloped	0.360	tis	0.407	publication	0.435	chicago	0.432
7	attacking	0.342	detectives	0.437	stove	0.377	treason	0.354	skies	0.400	button	0.434	magician	0.424
8	wounds	0.342	drawer	0.426	saloon	0.375	nicholas	0.353	sings	0.373	buildings	0.426	mamma	0.416
9	compass	0.331	card	0.360	cent	0.373	france	0.350	muse	0.371	onto	0.417	ruler	0.415
10	camels	0.329	collection	0.354	nothin	0.359	veteran	0.349	doth	0.369	section	0.406	tonight	0.406

Can the model predict the genre of new texts?¶

Finally, load new texts that the model has never seen before, and see what it predicts

In [11]:

# Since we now evaluate on an external test set, we can use everything
# as training data
clf.fit(X, y)

# Transform the new files to the format of the existing BOW table
newfiles = os.listdir('test/')
X1 = vectorizer.transform((io.open('test/' + a, encoding='utf8').read() for a in newfiles))
predictions = encoder.inverse_transform(clf.predict(X1))

pandas.DataFrame([
            (authors[a].title(), titles[a].title(),
            genres[a], b)
        for a, b in zip(newfiles, predictions)],
        index=newfiles,
        columns=['Author', 'Title', 'actual', 'predicted'])

Out[11]:

	Author	Title	actual	predicted
1027.txt	Grey, Zane, 1872-1939	The Lone Star Ranger, A Ro...	Fiction	Fiction
10067.txt	Stevenson, Burton Egbert, ...	The Mystery Of The Boule C...	Detective	Detective
12843.txt	Emerson, Ralph Waldo, 1803...	Poems Household Edition	Poetry	Poetry
10150.txt	Stoker, Bram, 1847-1912	Dracula'S Guest	Short	Short
103.txt	Verne, Jules, 1828-1905	Around The World In 80 Days	Adventure	Detective
18109.txt	Piper, H. Beam, 1904-1964	Graveyard Of Dreams	Sci-Fi	Sci-Fi
11228.txt	Chesnutt, Charles W. (Char...	The Marrow Of Tradition	Historical	Historical