This notebook demonstrates how the genre of a text can be automatically predicted from its word counts.
The texts consist of 19th century English novels from project Gutenberg. The selection is based on the successful novels (high download counts) as used in Ashok et al (2013), Success with style.
First some preliminaries. Import the external libraries we will use. Most importantly, we rely on Scikit-Learn to do machine learning. See http://scikit-learn.org/
import io, os, itertools
import numpy, pandas
import matplotlib.pyplot as plt
from sklearn import feature_extraction, preprocessing, decomposition, cross_validation, svm, metrics
# Tweak how tables are displayed
pandas.set_option('display.precision', 4)
pandas.set_option('display.max_colwidth', 30)
pandas.set_option('display.colheader_justify', 'left')
The bag-of-words (BOW) table will contain a count for each text and every word.
All letters are converted to lower case and we filter the words until we only have words with 3 or more alphabetic characters (no numbers or punctuation).
Important parameters are max_features
, min_df
, and max_df
.
max_features
limits the model to consider only the top n most frequent words. More is often better but may be slow to compute.min_df
and max_df
restrict the words to those that occur in a certain proportion of texts.For example, setting these to the values 0.2 and 0.8, respectively, restricts the model to words that are in at least 20%, and at most 80% of the texts. This removes rare words on the one hand, and ignores highly frequent words such as function words on the other.
use_idf=True
turns on tf-idf weighting; this incorporates not only the total frequency of a word, but also the number of texts in which it appears. This means that words which are frequent in one text but not in others will get a high score, while words that are frequent in all documents receive a lower score.sublinear_tf=True
is another variation which makes the word frequencies logarithmic.# Set up a simple BOW model
vectorizer = feature_extraction.text.TfidfVectorizer(
lowercase=True, token_pattern=r'\b[-A-Za-z]{3,}\b',
min_df=0.1, max_df=0.5, max_features=5000,
use_idf=True, sublinear_tf=True)
# Get a list of all filenames in the 'train/' folder,
# and add their text to the BOW table 'X'.
filenames = os.listdir('train/')
X = vectorizer.fit_transform((io.open('train/' + a, encoding='utf8').read() for a in filenames))
vec = vectorizer.transform(['It was a dark and stormy night; '
'the rain fell in torrents — except at occasional intervals, '
'when it was checked by a violent gust of wind which swept '
'up the streets (for it is in London that our scene lies), '
'rattling along the housetops, and fiercely agitating the scanty '
'flame of the lamps that struggled against the darkness.'])
# we get back a large vector with a value for each possible word.
# show a table with only the non-zero items in the vector:
feature_names = vectorizer.get_feature_names()
pandas.DataFrame([(feature_names[b], (a, b), vec[a, b])
for a, b in zip(*vec.nonzero())],
columns=['word', 'index', 'weight'])
word | index | weight | |
---|---|---|---|
0 | stormy | (0, 4269) | 0.440 |
1 | scanty | (0, 3877) | 0.468 |
2 | rattling | (0, 3539) | 0.418 |
3 | occasional | (0, 2995) | 0.364 |
4 | lamps | (0, 2510) | 0.391 |
5 | checked | (0, 714) | 0.357 |
The genre of each text is specified in a separate metadata file.
Note that in the original data, a text may have multiple genres. We choose an arbitrary genre as the "true" genre.
In a more careful study the single most appropriate genre would have to be seleceted by hand.
# Print the first 5 lines to see what the metadata looks like:
print(''.join(io.open('metadata.csv', encoding='utf8').readlines()[:5]))
Dataset,Fold,Success,FileName,Title,Author,Language,DownloadCount Adventure,1,SUCCESS,103.txt,around the world in 80 days,"verne, jules, 1828-1905",en,3260 Adventure,1,SUCCESS,1145.txt,rupert of hentzau,"hope, anthony, 1863-1933",en,141 Adventure,1,SUCCESS,1947.txt,scaramouche,"sabatini, rafael, 1875-1950",en,434 Adventure,1,SUCCESS,18857.txt,a journey to the center of the earth,"verne, jules, 1828-1905",en,336
# Load the data; metadata.index will be the filename.
metadata = pandas.read_csv('metadata.csv', index_col=3, encoding='utf8')
genres = dict(zip(metadata.index, metadata['Dataset']))
# convert the genre labels to integers
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform([genres[a] for a in filenames])
# Create an abbreviated label "Author_Title" for each text
authors = dict(zip(metadata.index, metadata['Author']))
titles = dict(zip(metadata.index, metadata['Title']))
labels = ['%s_%s' % (authors[a].split(',')[0].title(),
titles[a][:15].title()) for a in filenames]
By applying dimensionality reduction, it is possible to summarize the word counts for each text in two dimensions. Books which are close together are similar. The method used is Latent Semantic Analysis
Note that for the purposes of visualization, we only show 2 dimensions; the classification model will use more. Books that are close together in this visualization may still be distinguished when more dimensions are taken into account.
# Reduce the BOW model to 2 dimensions
dec = decomposition.TruncatedSVD(n_components=2)
X_r = dec.fit_transform(X)
print('Explained variance:', dec.explained_variance_ratio_)
# Make a scatter plot with the author/title of each text as label
plt.figure(figsize=(12, 8))
for c, (i, target_name) in zip('rbmkycg', enumerate(encoder.classes_, 2)):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
for n, xpos, ypos in zip(
(y == i).nonzero()[0], X_r[y == i, 0], X_r[y == i, 1]):
plt.annotate(labels[n], xy=(xpos, ypos), xytext=(5, 5),
textcoords='offset points', color=c,
fontsize='small', ha='left', va='top')
plt.legend()
plt.title('%s of dataset' % dec.__class__.__name__)
plt.show()
('Explained variance:', array([ 0.04966917, 0.04049138]))
You can try different values for the parameter C
.
This parameter controls the level of regularization;
with higher values, the model will take more edge cases
(datapoints close to datapoints of other classes) into account.
This will give better scores on data that is similar to the training data,
but if the training data is not representative, it may result in more errors.
# Randomly select 80% as training set, 20% as test/validation set,
# but make sure that each genre is well-represented.
train, test = next(iter(cross_validation.StratifiedShuffleSplit(
y, test_size=0.2, n_iter=1, random_state=42)))
# Train an SVM classifier and predict the genre of the items in the test set.
clf = svm.LinearSVC(C=1.0, random_state=42)
clf.fit(X[train], y[train])
pred = clf.predict(X[test])
The breakdown shows that not all genres are predicted as well; the f-score column is the most important.
In the confusion matrix we can see which genres were mistaken most often. The columns hold the number of times the model predicted a genre, while the rows show the true genres.
print('Overall accuracy:\t%4.1f %%\n' % (100 * metrics.accuracy_score(y[test], pred)))
print(metrics.classification_report(y[test], pred, target_names=encoder.classes_))
pandas.DataFrame(metrics.confusion_matrix(y[test], pred),
index=sorted(encoder.classes_),
columns=sorted(encoder.classes_))
Overall accuracy: 68.3 % precision recall f1-score support Adventure 0.40 0.25 0.31 8 Detective 1.00 0.89 0.94 9 Fiction 0.38 0.56 0.45 9 Historical 0.90 0.90 0.90 10 Poetry 0.77 1.00 0.87 10 Sci-Fi 0.70 1.00 0.82 7 Short 0.50 0.20 0.29 10 avg / total 0.67 0.68 0.66 63
Adventure | Detective | Fiction | Historical | Poetry | Sci-Fi | Short | |
---|---|---|---|---|---|---|---|
Adventure | 2 | 0 | 3 | 0 | 2 | 0 | 1 |
Detective | 0 | 8 | 1 | 0 | 0 | 0 | 0 |
Fiction | 1 | 0 | 5 | 1 | 1 | 0 | 1 |
Historical | 0 | 0 | 1 | 9 | 0 | 0 | 0 |
Poetry | 0 | 0 | 0 | 0 | 10 | 0 | 0 |
Sci-Fi | 0 | 0 | 0 | 0 | 0 | 7 | 0 |
Short | 2 | 0 | 3 | 0 | 0 | 3 | 2 |
The following table lists the 10 books which had the least similarity to any of the genres. In the table, the numbers represent the distance of a text to a genre, where a negative value represents dissimilarity, and a positive value represents similarity. Values close to 0 represent uncertainty and are therefore the more difficult cases.
data = sorted(zip(test, clf.decision_function(X[test])), key=lambda x: max(x[1]))[:10]
pandas.DataFrame([a for _, a in data],
index=[labels[n] for n, _ in data],
columns=encoder.classes_)
Adventure | Detective | Fiction | Historical | Poetry | Sci-Fi | Short | |
---|---|---|---|---|---|---|---|
Hartley_The Gentlemen'S | -0.776 | -0.885 | -0.553 | -0.563 | -0.630 | -1.237 | -0.634 |
Twain_A Double Barrel | -0.803 | -0.613 | -0.551 | -0.905 | -0.971 | -0.741 | -0.598 |
Buchan_Huntingtower | -0.817 | -0.558 | -0.430 | -0.903 | -0.713 | -1.058 | -0.896 |
Conrad_Tales Of Unrest | -0.426 | -0.869 | -0.475 | -0.645 | -0.996 | -1.249 | -0.614 |
Hawthorne_Mosses From An | -0.789 | -1.006 | -0.756 | -0.606 | -0.656 | -1.200 | -0.391 |
James_The Pupil | -0.933 | -0.579 | -0.383 | -0.875 | -0.789 | -0.789 | -0.651 |
Smith_A Gentleman Vag | -0.969 | -0.571 | -0.371 | -0.968 | -1.006 | -0.978 | -0.411 |
Conrad_The Arrow Of Go | -0.774 | -0.446 | -0.366 | -0.567 | -1.134 | -1.301 | -0.647 |
Nesbit_The Wouldbegood | -0.603 | -0.755 | -0.697 | -0.903 | -0.897 | -1.124 | -0.363 |
Farmer_They Twinkled L | -0.920 | -0.829 | -0.787 | -0.935 | -0.854 | -0.363 | -0.379 |
For each genre, the top 10 words most strongly linked to each genre are shown.
These words are not necessarily frequent, but if they occur, there is a high chance that it points to the given genre.
# Sort the weights of the classifier and take last 10 items
data = []
for n, target in enumerate(encoder.classes_):
data.append([(feature_names[m], clf.coef_[n][m])
for m in numpy.argsort(clf.coef_[n])[-10:][::-1]])
pandas.DataFrame([itertools.chain(*a) for a in zip(*data)],
columns=list(itertools.chain(*((target, '') for target in encoder.classes_))),
index=range(1, 11))
Adventure | Detective | Fiction | Historical | Poetry | Sci-Fi | Short | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | buck | 0.460 | detective | 0.803 | billy | 0.453 | virginia | 0.491 | poems | 0.789 | corrected | 0.667 | cats | 0.595 |
2 | engineer | 0.444 | police | 0.559 | chairman | 0.434 | sergeant | 0.480 | thy | 0.505 | errors | 0.626 | bob | 0.501 |
3 | moat | 0.421 | revolver | 0.508 | helen | 0.420 | lieutenant | 0.404 | thee | 0.467 | mars | 0.562 | model | 0.481 |
4 | savages | 0.382 | inspector | 0.497 | anne | 0.420 | wagons | 0.403 | morn | 0.460 | car | 0.487 | dollars | 0.434 |
5 | adventures | 0.354 | criminal | 0.456 | guy | 0.397 | fort | 0.368 | poem | 0.416 | spelling | 0.470 | wizard | 0.432 |
6 | tent | 0.345 | murderer | 0.445 | daddy | 0.391 | galloped | 0.360 | tis | 0.407 | publication | 0.435 | chicago | 0.432 |
7 | attacking | 0.342 | detectives | 0.437 | stove | 0.377 | treason | 0.354 | skies | 0.400 | button | 0.434 | magician | 0.424 |
8 | wounds | 0.342 | drawer | 0.426 | saloon | 0.375 | nicholas | 0.353 | sings | 0.373 | buildings | 0.426 | mamma | 0.416 |
9 | compass | 0.331 | card | 0.360 | cent | 0.373 | france | 0.350 | muse | 0.371 | onto | 0.417 | ruler | 0.415 |
10 | camels | 0.329 | collection | 0.354 | nothin | 0.359 | veteran | 0.349 | doth | 0.369 | section | 0.406 | tonight | 0.406 |
Finally, load new texts that the model has never seen before, and see what it predicts
# Since we now evaluate on an external test set, we can use everything
# as training data
clf.fit(X, y)
# Transform the new files to the format of the existing BOW table
newfiles = os.listdir('test/')
X1 = vectorizer.transform((io.open('test/' + a, encoding='utf8').read() for a in newfiles))
predictions = encoder.inverse_transform(clf.predict(X1))
pandas.DataFrame([
(authors[a].title(), titles[a].title(),
genres[a], b)
for a, b in zip(newfiles, predictions)],
index=newfiles,
columns=['Author', 'Title', 'actual', 'predicted'])
Author | Title | actual | predicted | |
---|---|---|---|---|
1027.txt | Grey, Zane, 1872-1939 | The Lone Star Ranger, A Ro... | Fiction | Fiction |
10067.txt | Stevenson, Burton Egbert, ... | The Mystery Of The Boule C... | Detective | Detective |
12843.txt | Emerson, Ralph Waldo, 1803... | Poems Household Edition | Poetry | Poetry |
10150.txt | Stoker, Bram, 1847-1912 | Dracula'S Guest | Short | Short |
103.txt | Verne, Jules, 1828-1905 | Around The World In 80 Days | Adventure | Detective |
18109.txt | Piper, H. Beam, 1904-1964 | Graveyard Of Dreams | Sci-Fi | Sci-Fi |
11228.txt | Chesnutt, Charles W. (Char... | The Marrow Of Tradition | Historical | Historical |