Author: Mattias Östmar
Date: 2019-03-14
Contact: mattiasostmar at gmail dot com
Thanks to Mikael Huss for being a good speaking partner.
In this notebook we're going to use the python version of fasttext, based on Facebooks fasttext tool, to try to predict the Jungian cognitive function of the authors writing style as appearing in blog posts.
import csv
import requests
import pandas as pd
from sklearn.model_selection import train_test_split
import fasttext
Download the annotated dataset as semi-colon separated CSV from https://osf.io/zvw5g/download (66,1 MB file size)
df = pd.read_csv("blog_texts_and_cognitive_function.csv", sep=";", index_col=0)
df.head(3)
text | base_function | directed_function | |
---|---|---|---|
1 | ❀*a drop of colour*❀ 1/39 next→ home ask past ... | f | fi |
2 | Neko cool kids can't die home family daveblog ... | t | ti |
3 | Anything... Anything Mass Effect-related Music... | f | fe |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 22588 entries, 1 to 25437 Data columns (total 3 columns): text 22588 non-null object base_function 22588 non-null object directed_function 22588 non-null object dtypes: object(3) memory usage: 705.9+ KB
df.base_function.value_counts()
n 9380 f 6063 t 4502 s 2643 Name: base_function, dtype: int64
Let's see, crudely, if the blog writers of a certain class writes longer or shorter texts in average.
tokens = []
df.text.apply(lambda x: tokens.append(len(x.split())))
df["text_len"] = pd.Series(tokens)
df.groupby("base_function").mean()
text_len | |
---|---|
base_function | |
f | 476.125869 |
n | 489.926113 |
s | 488.566448 |
t | 508.435853 |
Let's try to predict the four base cognitive functions. We need to prepare the labels to suite fasttexts formatting.
dataset = df[["base_function","text"]]
dataset["label"] = df.base_function.apply(lambda x: "__label__" + x)
dataset.drop("base_function", axis=1, inplace=True)
dataset = dataset[["label","text"]]
dataset.head(3)
label | text | |
---|---|---|
1 | __label__f | ❀*a drop of colour*❀ 1/39 next→ home ask past ... |
2 | __label__t | Neko cool kids can't die home family daveblog ... |
3 | __label__f | Anything... Anything Mass Effect-related Music... |
dataset.tail(3)
label | text | |
---|---|---|
25435 | __label__t | Living in Lit Home Hi there Ask Archive Mobile... |
25436 | __label__f | Love is Art Love is Art message · about ... |
25437 | __label__n | (Source: taeyeohn , via ninakask ) Posted at 0... |
Now let's separate the dataset into two separate files for 80 per cent training and 20 per cent evaluation respectively.
train, test = train_test_split(dataset, test_size=0.2)
print("Rows in training data: {}".format(len(train)))
print("Rows in test data: {}".format(len(test)))
Rows in training data: 18070 Rows in test data: 4518
Now we create two separate textfiles for the training and evaluation respectively, with each row containing the label and the text according to fasttexts formatting standards.
train.to_csv(r'jung_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
test.to_csv(r'jung_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
Now we can train our model with the default settings and no text preprocessing to get an initial setup.
classifier1 = fasttext.supervised("jung_training.txt","model_jung_default")
Then we can evaluate the model using our test data.
result = classifier1.test("jung_evaluation.txt")
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)
P@1: 0.42297476759628155 R@1: 0.42297476759628155 Number of examples: 4518
The results are slightly better than pure chance (0.415). Let's see if we can improve the model by some crude preprocessing of the texts, removing non-alphanumeric characters and making all words lowercase.
processed = dataset.copy()
processed["text"] = processed.text.str.replace(r"[\W ]"," ") # replace all characters that are not a-z, A-Z or 0-9
processed["text"] = processed.text.str.lower() # make all characters lower case
processed["text"] = processed.text.str.replace(r' +',' ') # Remove multiple spaces
processed["text"] = processed.text.str.replace(r'^ +','') # Remove resulting initial spaces
processed.head(3)
label | text | |
---|---|---|
1 | __label__f | a drop of colour 1 39 next home ask past a dro... |
2 | __label__t | neko cool kids can t die home family daveblog ... |
3 | __label__f | anything anything mass effect related music fu... |
And then we create training and evaluation data from the processed dataframe and store them to two new files with the prefix "processed_"
train, test = train_test_split(processed, test_size=0.2)
print("Rows in training data: {}".format(len(train)))
print("Rows in test data: {}".format(len(test)))
train.to_csv(r'processed_jung_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
test.to_csv(r'processed_jung_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
Rows in training data: 18070 Rows in test data: 4518
And re-run the training and evaluation.
classifier2 = fasttext.supervised("processed_jung_training.txt","model_jung_preprocessed")
result = classifier2.test("processed_jung_evaluation.txt")
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)
P@1: 0.42076139884904823 R@1: 0.42076139884904823 Number of examples: 4518
Even worse results now. Apparently capital letters and special characters are features that help distinguish between the different labels, so let's keep the original trainingdata for further training and tuning.
What happens if we increase the number of epochs from the default 5 epochs to 25?
classifier3 = fasttext.supervised("jung_training.txt", "model_jung_default_25epochs", epoch=25)
result = classifier3.test("jung_evaluation.txt")
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)
P@1: 0.35568835768038953 R@1: 0.35568835768038953 Number of examples: 4518
The results actually deteriorates from 0.422 to 0.355.
What happens if we increase the learning rate from default 0.05 to 1?
classifier4 = fasttext.supervised("jung_training.txt", "model_jung_default_lr0.5", lr=1)
result = classifier4.test("jung_evaluation.txt")
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)
P@1: 0.42363877822045154 R@1: 0.42363877822045154 Number of examples: 4518
A miniscule improvement from 0.422 to 0.423.
What happens if we use word_ngrams of 2?
With ngrams set to 2 we get a similar result of 0.423 as when we increase the learning rate to 1.
What if we use pre-trained vectors when building the classifier? They can be downloaded from fasttext.cc.
First we use the smallest vektor-file. Note that we have to increase the number of dimensions used when training from default 100 to 300 to match the vector-file.
Then we try it with the largest vector-file that also includes subword-information.
The results improve by a mere 0.002.
Just in case, we also train on the preprocessed texts again using the largest pre-trained vectors.
Now we get the best results this far. 0.425 in precision when baseline is 0.415, as n = 9380 in the largest class N from a total of 22588 in the original dataset. But that is only 2.4 % better than chance, so it doesn't say very much about the predictability of blog authors Jungian cognitive function based on their writing style.