Hypothesis: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.
VERB
, ADJ
, or PUNCT
; and POSTAG (tag_) refers to "fine-grained part-of-speech" like VB
, JJ
, or .
.DOBJ
(direct object), PREP
(prepositional modifier), or POBJ
(object of preposition).The imperative mood centers around actions, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:
FeatureName.VERB
: Does the phrase contain VERB
(s) of the tag form VB*
?FeatureName.FOLLOWING_POS
: Are the words following the VERB
(s) of certain parts of speech?FeatureName.FOLLOWING_POSTAG
: Are the words following the VERB
(s) of certain POS tags?FeatureName.CHILD_DEP
: Are the VERB
(s) parents of certain syntactic dependencies?FeatureName.PARENT_DEP
: Are the VERB
(s) children of certain syntactic dependencies?FeatureName.CHILD_POS
: Are the syntactic dependencies that the VERB
(s) are children of of certain parts of speech?FeatureName.CHILD_POSTAG
: Are the syntactic dependencies that the VERB
(s) are children of of certain POS tags?FeatureName.PARENT_POS
: Are the syntactic dependencies that the VERB
(s) parent of certain parts of speech?FeatureName.PARENT_POSTAG
: Are the syntactic dependencies that the VERB
(s) parent of certain POS tags?Notes:
True
; if False
, phrase vectorization will result in all zeroes._*
(e.g., FeatureName.FOLLOWING_POSTAG_WRB
).I wrote and ran epicurious_recipes.py
* to scrape Epicurious.com for recipe instructions and descriptions. I then performed some manual cleanup of the script results. Output is in epicurious-pos.txt
and epicurious-neg.txt
.
* script (very) loosely based off of https://github.com/benosment/hrecipe-parse
Note that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.
To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.
This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available. Further, most positive examples are recipe instructions - also a specific (and not necessarily related to the main "task" category) category of text.
Ultimately though, this recipe corpus is a stopgap/proof of concept for a corpus more relevant to tasks later on, so I won't worry further about this for now.
import os
from pandas import read_csv
from numpy import random
BASE_DIR = os.getcwd()
data_path = BASE_DIR + '/data.tsv'
df = read_csv(data_path, sep='\t', header=None, names=['Text', 'Label'])
df.head()
Text | Label | |
---|---|---|
0 | Be kind | pos |
1 | Get out of here | pos |
2 | Look this over | pos |
3 | Paul, do your homework now | pos |
4 | Do not clean soot off the window | pos |
pos_data_split = list(df.loc[df.Label == 'pos'].Text)
neg_data_split = list(df.loc[df.Label == 'neg'].Text)
num_pos = len(pos_data_split)
num_neg = len(neg_data_split)
# 50/50 split between the number of positive and negative samples
num_per_class = num_pos if num_pos < num_neg else num_neg
# shuffle samples
random.shuffle(pos_data_split)
random.shuffle(neg_data_split)
lines = []
for l in pos_data_split[:num_per_class]:
lines.append((l, 'pos'))
for l in neg_data_split[:num_per_class]:
lines.append((l, 'neg'))
# Features as defined in the introduction
from enum import Enum, auto
class FeatureName(Enum):
VERB = auto()
FOLLOWING_POS = auto()
FOLLOWING_POSTAG = auto()
CHILD_DEP = auto()
PARENT_DEP = auto()
CHILD_POS = auto()
CHILD_POSTAG = auto()
PARENT_POS = auto()
PARENT_POSTAG = auto()
Because Stanford CoreNLP is hard to install for Python
Found Spacy through an article on "Training a Classifier for Relation Extraction from Medical Literature" (GitHub)
#!conda config --add channels conda-forge
#!conda install spacy
#!python -m spacy download en
import spacy
# slow
nlp = spacy.load('en')
Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.
TODO: Improvement to Doc.sents? "To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse."
def create_spacy_docs(ll):
dd = [(nlp(l[0]), l[1]) for l in ll]
# collapse noun phrases into single compounds
for d in dd:
for np in d[0].noun_chunks:
np.merge(tag=np.root.tag_, ent_type=np.root.ent_type_, lemma=np.root.lemma_)
return dd
# slower
docs = create_spacy_docs(lines)
Tokenization, POS tagging, and dependency parsing happened automatically with the nlp(line)
calls above! So let's look at the outputs.
https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward
for doc in docs[:10]:
print(list(doc[0].sents))
[Toss cherry tomatoes with 1 Tbsp., oil in a small bowl; season with salt.] [Transfer to a platter and top with remaining 1/2 tsp., lime zest; season with salt and pepper.] [Serve with a crunchy green salad.] [Immediately pour entire contents of pot into a large colander to drain, then spread out corn, potatoes, and shrimp on a large rimmed baking sheet or sheets of newspaper; discard lemon halves.] [Cook wings, moving to a cooler section of grill or reducing heat if they start to burn, until cooked through, an instant-read thermometer inserted into the flesh but not touching the bone registers 165°F, and skin is crisp and lightly charred, 5–10 minutes.] [Line 24 muffin cups with paper liners.] [Cut into thin slices crosswise.] [Place 2 small plates in freezer to chill.] [Bring mixture to boil, stirring often, over medium-high heat.] [Don’t omit the milk, however, as this will change the balance of liquid to dry ingredients in the recipe.]
for doc in docs[:10]:
print(list(doc[0].noun_chunks))
[Toss cherry tomatoes, 1 Tbsp, oil, a small bowl, season, salt] [a platter, top, remaining 1/2 tsp, lime zest, season, salt, pepper] [a crunchy green salad] [entire contents, pot, a large colander, corn, potatoes, a large rimmed baking sheet, sheets, newspaper, discard lemon halves] [Cook wings, a cooler section, grill, heat, they, an instant-read thermometer, the flesh, 165°F, skin] [Line 24 muffin cups, paper liners] [thin slices] [Place 2 small plates, freezer] [mixture, medium-high heat] [the milk, the balance, liquid, ingredients, the recipe]
for doc in docs[:5]:
for token in doc[0]:
print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))
Toss cherry tomatoes ROOT tomato NOUN NNS Toss cherry tomatoes [with, .] with prep with ADP IN Toss cherry tomatoes [1 Tbsp] 1 Tbsp pobj tbsp PROPN NNP with [] . punct . PUNCT . Toss cherry tomatoes [] oil ROOT oil NOUN NN oil [in, ;, season, .] in prep in ADP IN oil [a small bowl] a small bowl pobj bowl NOUN NN in [] ; punct ; PUNCT : oil [] season conj season NOUN NN oil [with] with prep with ADP IN season [salt] salt pobj salt NOUN NN with [] . punct . PUNCT . oil [] Transfer ROOT transfer VERB VB Transfer [to, with, .] to prep to ADP IN Transfer [a platter] a platter pobj platter NOUN NN to [and, top] and cc and CCONJ CC a platter [] top conj top NOUN NN a platter [] with prep with ADP IN Transfer [remaining 1/2 tsp] remaining 1/2 tsp pobj tsp NOUN NN with [] . punct . PUNCT . Transfer [] lime zest ROOT zest NOUN NN lime zest [;, season, .] ; punct ; PUNCT : lime zest [] season appos season NOUN NN lime zest [with] with prep with ADP IN season [salt] salt pobj salt NOUN NN with [and, pepper] and cc and CCONJ CC salt [] pepper conj pepper NOUN NN salt [] . punct . PUNCT . lime zest [] Serve ROOT serve VERB VB Serve [with, .] with prep with ADP IN Serve [a crunchy green salad] a crunchy green salad pobj salad NOUN NN with [] . punct . PUNCT . Serve [] Immediately advmod immediately ADV RB pour [] pour ROOT pour VERB VBP pour [Immediately, entire contents, into, ,, spread, .] entire contents dobj content NOUN NNS pour [of] of prep of ADP IN entire contents [pot] pot pobj pot NOUN NN of [] into prep into ADP IN pour [a large colander] a large colander pobj colander NOUN NN into [drain] to aux to PART TO drain [] drain relcl drain VERB VB a large colander [to] , punct , PUNCT , pour [] then advmod then ADV RB spread [] spread dep spread VERB VB pour [then, out, corn] out prt out PART RP spread [] corn dobj corn NOUN NN spread [,, potatoes, ;, discard lemon halves] , punct , PUNCT , corn [] potatoes conj potato NOUN NNS corn [,, and, shrimp] , punct , PUNCT , potatoes [] and cc and CCONJ CC potatoes [] shrimp conj shrimp VERB VB potatoes [on] on prep on ADP IN shrimp [a large rimmed baking sheet] a large rimmed baking sheet pobj sheet NOUN NN on [or, sheets] or cc or CCONJ CC a large rimmed baking sheet [] sheets conj sheet NOUN NNS a large rimmed baking sheet [of] of prep of ADP IN sheets [newspaper] newspaper pobj newspaper NOUN NN of [] ; punct ; PUNCT : corn [] discard lemon halves appos half NOUN NNS corn [] . punct . PUNCT . pour [] Cook wings nsubj wing NOUN NNS inserted [] , punct , PUNCT , inserted [] moving advcl move VERB VBG inserted [to, or, reducing] to prep to ADP IN moving [a cooler section] a cooler section pobj section NOUN NN to [of] of prep of ADP IN a cooler section [grill] grill pobj grill NOUN NN of [] or cc or CCONJ CC moving [] reducing conj reduce VERB VBG moving [heat, start] heat dobj heat NOUN NN reducing [] if mark if ADP IN start [] they nsubj -PRON- PRON PRP start [] start advcl start VERB VBP reducing [if, they, burn] to aux to PART TO burn [] burn xcomp burn VERB VB start [to] , punct , PUNCT , inserted [] until mark until ADP IN cooked [] cooked advcl cook VERB VBN inserted [until, through] through prt through PART RP cooked [] , punct , PUNCT , inserted [] an instant-read thermometer nsubj thermometer NOUN NN inserted [] inserted ROOT insert VERB VBN inserted [Cook wings, ,, moving, ,, cooked, ,, an instant-read thermometer, into, but, touching, ,, and, is] into prep into ADP IN inserted [the flesh] the flesh pobj flesh NOUN NN into [] but cc but CCONJ CC inserted [] not neg not ADV RB touching [] touching conj touch VERB VBG inserted [not, registers, 165°F] the det the DET DT registers [] bone compound bone NOUN NN registers [] registers dobj register VERB VBZ touching [the, bone] 165°F dobj f PROPN NNP touching [] , punct , PUNCT , inserted [] and cc and CCONJ CC inserted [] skin nsubj skin NOUN NN is [] is conj be VERB VBZ inserted [skin, crisp, minutes, .] crisp acomp crisp ADJ JJ is [and, charred, ,] and cc and CCONJ CC crisp [] lightly advmod lightly ADV RB charred [] charred conj char VERB VBN crisp [lightly] , punct , PUNCT , crisp [] 5–10 nummod 5–10 NUM CD minutes [] minutes npadvmod minute NOUN NNS is [5–10] . punct . PUNCT . is []
import re
from collections import defaultdict
def featurize(d):
s_features = defaultdict(int)
for idx, token in enumerate(d):
if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags
s_features[FeatureName.VERB.name] += 1
# FOLLOWING_POS
# FOLLOWING_POSTAG
next_idx = idx + 1;
if next_idx < len(d):
s_features[f'{FeatureName.FOLLOWING_POS.name}_{d[next_idx].pos_}'] += 1
s_features[f'{FeatureName.FOLLOWING_POSTAG.name}_{d[next_idx].tag_}'] += 1
# PARENT_DEP
# PARENT_POS
# PARENT_POSTAG
'''
"Because the syntactic relations form a tree, every word has exactly one head.
You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
https://spacy.io/docs/usage/dependency-parse#navigating
'''
if (token.head is not token):
s_features[f'{FeatureName.PARENT_DEP.name}_{token.head.dep_.upper()}'] += 1
s_features[f'{FeatureName.PARENT_POS.name}_{token.head.pos_}'] += 1
s_features[f'{FeatureName.PARENT_POSTAG.name}_{token.head.tag_}'] += 1
# CHILD_DEP
# CHILD_POS
# CHILD_POSTAG
for child in token.children:
s_features[f'{FeatureName.CHILD_DEP.name}_{child.dep_.upper()}'] += 1
s_features[f'{FeatureName.CHILD_POS.name}_{child.pos_}'] += 1
s_features[f'{FeatureName.CHILD_POSTAG.name}_{child.tag_}'] += 1
return dict(s_features)
featuresets = [(doc[0], (featurize(doc[0]), doc[1])) for doc in docs]
from statistics import mean, median, mode, stdev
f_lengths = [len(fs[1][0]) for fs in featuresets]
print('Stats on number of features per example:')
print(f'mean: {mean(f_lengths)}')
print(f'stdev: {stdev(f_lengths)}')
print(f'median: {median(f_lengths)}')
print(f'mode: {mode(f_lengths)}')
print(f'max: {max(f_lengths)}')
print(f'min: {min(f_lengths)}')
Stats on number of features per example: mean: 24.562363715656346 stdev: 14.281121014936279 median: 24.0 mode: 0 max: 73 min: 0
featuresets[:2]
[(Toss cherry tomatoes with 1 Tbsp. oil in a small bowl; season with salt., ({}, 'pos')), (Transfer to a platter and top with remaining 1/2 tsp. lime zest; season with salt and pepper., ({'CHILD_DEP_PREP': 2, 'CHILD_DEP_PUNCT': 1, 'CHILD_POSTAG_.': 1, 'CHILD_POSTAG_IN': 2, 'CHILD_POS_ADP': 2, 'CHILD_POS_PUNCT': 1, 'FOLLOWING_POSTAG_IN': 1, 'FOLLOWING_POS_ADP': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POSTAG_VB': 1, 'PARENT_POS_VERB': 1, 'VERB': 1}, 'pos'))]
On one run, the above line printed the following featureset:
(Gather foil loosely on top and bake for 1 1/2 hours., ({}, 'pos'))
This is because the Spacy.io POS tagger provided this:
Gather/NNP foil/NN loosely/RB on/IN top/NN and/CC bake/NN for/IN 1 1/2 hours./NNS
...with no VERBs tagged, which is incorrect.
"Voting - POS taggers and classifiers" in the Next Steps/Improvements section below is meant to improve on this.
Compare to Stanford CoreNLP POS tagger:
Gather/VB foil/NN loosely/RB on/IN top/JJ and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.
And Stanford Parser:
Gather/NNP foil/VB loosely/RB on/IN top/NN and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.
random.shuffle(featuresets)
num_classes = 2
split_num = round(num_per_class*num_classes / 5)
# train and test sets
testing_set = [fs[1] for i, fs in enumerate(featuresets[:split_num])]
training_set = [fs[1] for i, fs in enumerate(featuresets[split_num:])]
print(f'# training samples: {len(training_set)}')
print(f'# test samples: {len(testing_set)}')
# training samples: 3669 # test samples: 917
# decoupling the functionality of nltk.classify.accuracy
def predict(classifier, gold, prob=True):
if (prob is True):
predictions = classifier.prob_classify_many([fs for (fs, ll) in gold])
else:
predictions = classifier.classify_many([fs for (fs, ll) in gold])
return list(zip(predictions, [ll for (fs, ll) in gold]))
def accuracy(predicts, prob=True):
if (prob is True):
correct = [label == prediction.max() for (prediction, label) in predicts]
else:
correct = [label == prediction for (prediction, label) in predicts]
if correct:
return sum(correct) / len(correct)
else:
return 0
Note below the use of DummyClassifier
to provide a simple sanity check, a baseline of random predictions. stratified
means it "generates random predictions by respecting the training set class distribution." (http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong: features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc…
If a classifier can beat the DummyClassifier
, it is at least learning something valuable! How valuable is another question...
from nltk import NaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegressionCV, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
dummy = SklearnClassifier(DummyClassifier(strategy='stratified', random_state=0))
dummy.train(training_set)
dummy_predict = predict(dummy, testing_set)
dummy_accuracy = accuracy(dummy_predict)
print("Dummy classifier accuracy percent:", dummy_accuracy*100)
nb = NaiveBayesClassifier.train(training_set)
nb_predict = predict(nb, testing_set)
nb_accuracy = accuracy(nb_predict)
print("NaiveBayes classifier accuracy percent:", nb_accuracy*100)
multinomial_nb = SklearnClassifier(MultinomialNB())
multinomial_nb.train(training_set)
mnb_predict = predict(multinomial_nb, testing_set)
mnb_accuracy = accuracy(mnb_predict)
print("MultinomialNB classifier accuracy percent:", mnb_accuracy*100)
bernoulli_nb = SklearnClassifier(BernoulliNB())
bernoulli_nb.train(training_set)
bnb_predict = predict(bernoulli_nb, testing_set)
bnb_accuracy = accuracy(bnb_predict)
print("BernoulliNB classifier accuracy percent:", bnb_accuracy*100)
# ??logistic_regression._clf
# sklearn.svm.LinearSVC : learns SVM models using the same algorithm.
logistic_regression = SklearnClassifier(LogisticRegressionCV())
logistic_regression.train(training_set)
lr_predict = predict(logistic_regression, testing_set)
lr_accuracy = accuracy(lr_predict)
print("LogisticRegressionCV classifier accuracy percent:", lr_accuracy*100)
# ??sgd._clf
# The 'log' loss gives logistic regression, a probabilistic classifier.
# ??linear_svc._clf
# can optimize the same cost function as LinearSVC
# by adjusting the penalty and loss parameters. In addition it requires
# less memory, allows incremental (online) learning, and implements
# various loss functions and regularization regimes.
sgd = SklearnClassifier(SGDClassifier(loss='log'))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print("SGD classifier accuracy percent:", sgd_accuracy*100)
# slow
# using libsvm with kernel 'rbf' (radial basis function)
svc = SklearnClassifier(SVC(probability=True))
svc.train(training_set)
svc_predict = predict(svc, testing_set)
svc_accuracy = accuracy(svc_predict)
print("SVC classifier accuracy percent:", svc_accuracy*100)
# ??linear_svc._clf
# Similar to SVC with parameter kernel='linear', but implemented in terms of
# liblinear rather than libsvm, so it has more flexibility in the choice of
# penalties and loss functions and should scale better to large numbers of
# samples.
# Prefer dual=False when n_samples > n_features.
# Using CalibratedClassifierCV as wrapper to get predict probabilities (https://stackoverflow.com/a/39712590)
linear_svc = SklearnClassifier(CalibratedClassifierCV(LinearSVC(dual=False)))
linear_svc.train(training_set)
linear_svc_predict = predict(linear_svc, testing_set)
linear_svc_accuracy = accuracy(linear_svc_predict)
print("LinearSVC classifier accuracy percent:", linear_svc_accuracy*100)
# slower
dt = DecisionTreeClassifier.train(training_set)
dt_predict = predict(dt, testing_set, False)
dt_accuracy = accuracy(dt_predict, False)
print("DecisionTree classifier accuracy percent:", dt_accuracy*100)
random_forest = SklearnClassifier(RandomForestClassifier(n_estimators = 100))
random_forest.train(training_set)
rf_predict = predict(random_forest, testing_set)
rf_accuracy = accuracy(rf_predict)
print("RandomForest classifier accuracy percent:", rf_accuracy*100)
Dummy classifier accuracy percent: 49.836423118865866 NaiveBayes classifier accuracy percent: 68.92039258451473 MultinomialNB classifier accuracy percent: 79.17121046892039 BernoulliNB classifier accuracy percent: 78.08069792802618 LogisticRegressionCV classifier accuracy percent: 83.31515812431843 SGD classifier accuracy percent: 80.91603053435115 SVC classifier accuracy percent: 82.11559432933478 LinearSVC classifier accuracy percent: 83.20610687022901 DecisionTree classifier accuracy percent: 77.42639040348965 RandomForest classifier accuracy percent: 83.53326063249727
sgd
classifiers improves with epochs. ??sgd._clf
tells us that the default number of epochs n_iter
is 5. So let's run more epochs. Also not that the training_set shuffle is True
by default.
num_epochs = 1000
sgd = SklearnClassifier(SGDClassifier(loss='log', n_iter=num_epochs))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print(f"SGDClassifier classifier accuracy percent (epochs: {num_epochs}):", sgd_accuracy*100)
SGDClassifier classifier accuracy percent (epochs: 1000): 83.96946564885496
Fortunately, 1000 epochs run very quickly! And SGDClassifier
performance has improved with more iterations.
Also note that we can set warm_start
to True
if we want to take advantage of online learning and reuse the solution of the previous call.
Next we perform 1) grid search to find optimal hyperparameters, and 2) cross-validation to evaluate performance over multiple folds of the data (to avoid overfitting).
http://scikit-learn.org/stable/modules/grid_search.html#grid-search
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
# https://stackoverflow.com/a/16388804
from sklearn.model_selection import KFold
from sklearn.base import clone
from numpy import zeros
def cross_val(name, model, debug=True):
num_splits = 3
original_clf = clone(model._clf)
cvidx = KFold(n_splits=num_splits, shuffle=True).split(training_set)
nested_acc = zeros(num_splits)
i=0
for trainidx, testidx in cvidx:
model._clf = clone(original_clf) # we clone the estimator to make sure that all the folds are independent
classifier = model.train(training_set[trainidx[0]:trainidx[len(trainidx)-1]])
pred = predict(classifier, training_set[testidx[0]:testidx[len(testidx)-1]])
nested_acc[i] = accuracy(pred)
i += 1
if debug == True:
print(f"{name} CV accuracies:", nested_acc)
return nested_acc.mean()
# http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# Set the parameters by cross-validation
model_parameters = [{'LinearSVC': [{
'loss': ['hinge'],
'dual': [True],
'penalty': ['l2'],
'tol': [1e-3, 1e-4, 1e-5],
'max_iter': [1000, 10000],
'C': [100.0, 1.0, 0.01]
},
{
'loss': ['squared_hinge'],
'dual': [False, True],
'penalty': ['l2'],
'tol': [1e-3, 1e-4, 1e-5],
'max_iter': [1000, 10000],
'C': [100.0, 1.0, 0.01]
}]},
{'LogisticRegression': [{
'penalty': ['l1'],
'dual': [False],
'C': [100.0, 1.0, 0.01],
'solver': ['liblinear']
},
{
'penalty': ['l2'],
'dual': [False, True],
'C': [100.0, 1.0, 0.01],
'max_iter': [100, 1000],
'solver': ['liblinear'],
'tol': [1e-3, 1e-4, 1e-5]
},
{
'penalty': ['l2'],
'dual': [False],
'C': [100.0, 1.0, 0.01],
'max_iter': [100, 1000],
'solver': ['newton-cg', 'lbfgs', 'sag'],
'tol': [1e-3, 1e-4, 1e-5]
}]},
{'SGD': [{
'penalty': ['l1', 'l2', 'elasticnet'],
'alpha': [1e-3, 1e-4, 1e-5],
'average': [True, False],
'n_iter': [100, 1000, 10000]
}]},
{'RandomForest': [{
'n_estimators': [10, 100, 1000],
'criterion': ['gini', 'entropy'],
'max_features': ['auto', 'log2', None],
'oob_score': [True, False]
}]}]
score = 'roc_auc'
for i, model_param in enumerate(model_parameters):
model = [key for i, key in enumerate(model_param)][0]
print(f"# {model}: Tuning hyper-parameters for {score}")
print()
if model == 'LinearSVC':
clf = LinearSVC()
elif model == 'LogisticRegression':
clf = LogisticRegression()
elif model == 'SGD':
clf = SGDClassifier(loss='log')
elif model == 'RandomForest':
clf = RandomForestClassifier()
else:
raise Exception('%s model needs to be added to the if-block' % model)
grid = SklearnClassifier(GridSearchCV(clf, model_param[model], cv=5,
scoring=score, n_jobs=-1))
grid.train(training_set)
print("Best parameters set found on development set:")
print()
print(grid._clf.best_params_)
mean = grid._clf.cv_results_['mean_test_score'][grid._clf.best_index_]
std = grid._clf.cv_results_['std_test_score'][grid._clf.best_index_]
print("roc_auc: %0.3f (+/-%0.03f)" % (mean, std * 2))
print()
if model == 'LinearSVC':
# Wrapping LinearSVC in CalibratedClassifierCV to add support for probability prediction
# Note that there is a difference in accuracies between raw GridSearchCV and calibrated GridSearchCV
# However, I'm willing to sacrifice the potential 'best' result from raw in order to output probabilities
grid_calibrated = SklearnClassifier(CalibratedClassifierCV(grid._clf.best_estimator_, cv=None))
grid_calibrated.train(training_set)
gridc_predict = predict(grid_calibrated, testing_set)
gridc_accuracy = accuracy(gridc_predict)
print(f"{model} (calibrated) classifier accuracy percent:", gridc_accuracy*100)
grid_predict = predict(grid, testing_set, False)
grid_accuracy = accuracy(grid_predict, False)
print(f"{model} (raw) classifier accuracy percent:", grid_accuracy*100)
# CV after parameter optimization
cv_acc = cross_val(model, grid_calibrated)
print(f"{model} (calibrated) CV classifier avg accuracy percent:", cv_acc*100)
linear_svc_opt = grid_calibrated
linear_svc_predict = gridc_predict
else:
grid_predict = predict(grid, testing_set)
grid_accuracy = accuracy(grid_predict)
print(f"{model} classifier accuracy percent:", grid_accuracy*100)
# CV after parameter optimization
cv_acc = cross_val(model, grid)
print(f"{model} CV classifier avg accuracy percent:", cv_acc*100)
if model == 'LogisticRegression':
logistic_regression_opt = grid
lr_predict = grid_predict
elif model == 'SGD':
sgd_opt = grid
sgd_predict = grid_predict
elif model == 'RandomForest':
random_forest_opt = grid
rf_predict = grid_predict
else:
raise Exception('%s model was not run through Grid Search' % model)
print()
# LinearSVC: Tuning hyper-parameters for roc_auc Best parameters set found on development set: {'C': 0.01, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000, 'penalty': 'l2', 'tol': 0.001} roc_auc: 0.934 (+/-0.019) LinearSVC (calibrated) classifier accuracy percent: 83.20610687022901 LinearSVC (raw) classifier accuracy percent: 82.87895310796074 LinearSVC CV accuracies: [ 0.85550082 0.85566166 0.85496183] LinearSVC (calibrated) CV classifier avg accuracy percent: 85.5374772491 # LogisticRegression: Tuning hyper-parameters for roc_auc Best parameters set found on development set: {'C': 1.0, 'dual': False, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.001} roc_auc: 0.937 (+/-0.019) LogisticRegression classifier accuracy percent: 83.53326063249727
C:\Users\narho_000\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:286: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge "the coef_ did not converge", ConvergenceWarning)
LogisticRegression CV accuracies: [ 0.86619334 0.86637578 0.86630286] LogisticRegression CV classifier avg accuracy percent: 86.6290661978 # SGD: Tuning hyper-parameters for roc_auc Best parameters set found on development set: {'alpha': 0.0001, 'average': True, 'n_iter': 100, 'penalty': 'l2'} roc_auc: 0.937 (+/-0.018) SGD classifier accuracy percent: 82.76990185387132 SGD CV accuracies: [ 0.86724939 0.86743044 0.8672301 ] SGD CV classifier avg accuracy percent: 86.7303308486 # RandomForest: Tuning hyper-parameters for roc_auc Best parameters set found on development set: {'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 1000, 'oob_score': True} roc_auc: 0.936 (+/-0.016) RandomForest classifier accuracy percent: 83.86041439476554 RandomForest CV accuracies: [ 0.94050218 0.94011485 0.94055086] RandomForest CV classifier avg accuracy percent: 94.0389296885
We're going to create an ensemble classifier by letting our top-performing classifiers, which consistently perform with >80% accuracy — LogisticRegression
, LinearSVC
, SGD
, and RandomForest
(excluding SVC
due to its slowness) — vote on each prediction.
from sklearn.ensemble import VotingClassifier
voting = SklearnClassifier(VotingClassifier(estimators=[
('lr', logistic_regression_opt._clf),
('linear_svc', linear_svc_opt._clf),
('sgd', sgd_opt._clf),
('rf', random_forest_opt._clf)
], voting='soft', weights=[1,1,1,3], n_jobs=-1))
voting.train(training_set)
voting_predict = predict(voting, testing_set)
voting_accuracy = accuracy(voting_predict)
print("Soft voting classifier accuracy percent:", voting_accuracy*100)
# CV after parameter optimization
voting_acc = cross_val("Soft voting", voting)
print(f"Soft voting CV classifier accuracy percent:", voting_acc*100)
Soft voting classifier accuracy percent: 83.96946564885496 Soft voting CV accuracies: [ 0.90834697 0.90854491 0.90673575] Soft voting CV classifier accuracy percent: 90.7875877339
Similarly to the voting model, we're also going to scope analysis down to our top-performing classifiers. We'll include the Voting
model itself, and then Dummy
as a baseline.
# https://stackoverflow.com/a/11140887
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:round(n/2)], coefs_with_fns[:-(round(n/2) + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
print('SGD')
show_most_informative_features(sgd._vectorizer, sgd._clf, 15)
print()
print('Logistic Regression')
show_most_informative_features(logistic_regression._vectorizer, logistic_regression._clf, 15)
SGD -3.3820 FOLLOWING_POSTAG_WRB 2.9712 CHILD_POSTAG_-LRB- -2.1901 CHILD_DEP_AGENT 2.5766 FOLLOWING_POSTAG_JJS -2.0411 CHILD_DEP_DET 2.1984 FOLLOWING_POSTAG_VBZ -2.0337 FOLLOWING_POSTAG_JJR 1.8884 CHILD_DEP_NPADVMOD -2.0008 CHILD_DEP_NEG 1.8787 PARENT_DEP_PARATAXIS -1.7942 CHILD_DEP_RELCL 1.7963 CHILD_DEP_APPOS -1.7393 FOLLOWING_POSTAG_VB 1.7699 FOLLOWING_POSTAG_RB -1.7151 CHILD_POSTAG_HYPH 1.7124 CHILD_POSTAG_PDT Logistic Regression -1.4486 CHILD_DEP_NEG 1.4545 CHILD_POSTAG_-LRB- -1.4322 CHILD_DEP_AGENT 1.3721 CHILD_DEP_NPADVMOD -1.0650 PARENT_POSTAG_VBZ 1.1594 PARENT_POSTAG_VB -1.0267 CHILD_POSTAG_WDT 1.0527 CHILD_POSTAG_-RRB- -1.0153 CHILD_DEP_NSUBJ 1.0491 CHILD_DEP_MARK -0.9563 PARENT_DEP_XCOMP 0.9183 CHILD_POSTAG_VB -0.9126 CHILD_DEP_DET 0.9078 CHILD_POS_PROPN -0.8719 CHILD_DEP_AUX 0.9077 CHILD_POSTAG_NNP
Note: Because CalibratedClassifierCV
has no attribute coef_
, we cannot show the most informative features for LinearSVC
while it's wrapped. Random Forest
and Voting
also lack coef_
.
spacy.explain("JJS")
'adjective, superlative'
Negative coefficients:
AGENT
: "used for agents of passive verbs" - interpreting this to mean that existence of passive verbs (i.e., the opposite of active verbs) means negative correlation with it being imperativeWRB
: "wh-adverb" (where, when)AMOD
: "any adjective or adjectival phrase that serves to modify the meaning" of the verbPositive coefficients:
-RRB-
: "right round bracket"PROPN
: "proper noun"NNP
: "noun, proper singular"http://scikit-learn.org/stable/modules/model_evaluation.html
from sklearn import metrics
def classification_report(predict, prob=True):
predictions, labels = zip(*predict)
if prob is True:
return metrics.classification_report(labels, [p.max() for p in predictions])
else:
return metrics.classification_report(labels, predictions)
def confusion_matrix(predict, prob=True, print_layout=False):
predictions, labels = zip(*predict)
if print_layout is True:
print('Layout\n[[tn fp]\n [fn tp]]\n')
if prob is True:
return metrics.confusion_matrix(labels, [p.max() for p in predictions])
else:
return metrics.confusion_matrix(labels, predictions)
def log_loss(predict):
predictions, labels = zip(*predict)
return metrics.log_loss(labels, [p.prob('pos') for p in predictions])
def roc_auc_score(predict):
predictions, labels = zip(*predict)
# need to convert labels to binary classification of 0 or 1
return metrics.roc_auc_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions], average='weighted')
def precision_recall_curve(predict):
predictions, labels = zip(*predict)
return metrics.precision_recall_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')
def average_precision_score(predict):
predictions, labels = zip(*predict)
return metrics.average_precision_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions])
def roc_curve(predict):
predictions, labels = zip(*predict)
return metrics.roc_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')
print('SGD')
print(classification_report(sgd_predict))
print()
print('Logistic Regression')
print(classification_report(lr_predict))
print()
print('LinearSVC')
print(classification_report(linear_svc_predict))
print()
print('Random Forest')
print(classification_report(rf_predict))
SGD precision recall f1-score support neg 0.87 0.79 0.83 480 pos 0.79 0.87 0.83 437 avg / total 0.83 0.83 0.83 917 Logistic Regression precision recall f1-score support neg 0.88 0.79 0.83 480 pos 0.79 0.89 0.84 437 avg / total 0.84 0.84 0.84 917 LinearSVC precision recall f1-score support neg 0.88 0.78 0.83 480 pos 0.79 0.89 0.83 437 avg / total 0.84 0.83 0.83 917 Random Forest precision recall f1-score support neg 0.88 0.80 0.84 480 pos 0.80 0.89 0.84 437 avg / total 0.84 0.84 0.84 917
print('Voting')
print(classification_report(voting_predict))
Voting precision recall f1-score support neg 0.88 0.80 0.84 480 pos 0.80 0.88 0.84 437 avg / total 0.84 0.84 0.84 917
print('Layout\n[[tn fp]\n [fn tp]]\n')
print('SGD')
print(confusion_matrix(sgd_predict))
print()
print('Logistic Regression')
print(confusion_matrix(lr_predict))
print()
print('LinearSVC')
print(confusion_matrix(linear_svc_predict))
print()
print('Random Forest')
print(confusion_matrix(rf_predict))
Layout [[tn fp] [fn tp]] SGD [[379 101] [ 57 380]] Logistic Regression [[379 101] [ 50 387]] LinearSVC [[375 105] [ 49 388]] Random Forest [[382 98] [ 50 387]]
print('Voting')
print(confusion_matrix(voting_predict))
Voting [[384 96] [ 51 386]]
The lower the better for log_loss
...
print(f'SGD: {log_loss(sgd_predict)}')
print(f'Logistic Regression: {log_loss(lr_predict)}')
print(f'LinearSVC: {log_loss(linear_svc_predict)}')
print(f'Random Forest: {log_loss(rf_predict)}')
SGD: 0.5408885923854805 Logistic Regression: 0.3269831271366439 LinearSVC: 0.34151481635171227 Random Forest: 0.3557548099256462
print(f'Voting: {log_loss(voting_predict)}')
Voting: 0.3058827183576675
The higher the better for roc_auc_score
...
print(f'SGD: {roc_auc_score(sgd_predict)}')
print(f'Logistic Regression: {roc_auc_score(lr_predict)}')
print(f'LinearSVC: {roc_auc_score(linear_svc_predict)}')
print(f'Random Forest: {roc_auc_score(rf_predict)}')
SGD: 0.9333333333333333 Logistic Regression: 0.9354405034324942 LinearSVC: 0.9284610983981694 Random Forest: 0.9428751906941265
print(f'Voting: {roc_auc_score(voting_predict)}')
Voting: 0.9454185736079329
sample_tasks = ["Mow lawn", "Mow the lawn", "Buy new shoes", "Feed the dog", "Send report to Kyle", "Send the report to Kyle", "Peel the potatoes"]
features = [featurize(nlp(task)) for task in sample_tasks]
tasks_dummy = [(l, p.prob('pos')*1.0) for l, p in zip(dummy.classify_many(features), dummy.prob_classify_many(features))]
tasks_logistic = [(l, p.prob('pos')) for l,p in zip(logistic_regression_opt.classify_many(features), logistic_regression_opt.prob_classify_many(features))]
tasks_linear_svc = [(l, p.prob('pos')) for l,p in zip(linear_svc_opt.classify_many(features), linear_svc_opt.prob_classify_many(features))]
tasks_sgd = [(l, p.prob('pos')) for l,p in zip(sgd_opt.classify_many(features), sgd_opt.prob_classify_many(features))]
tasks_rf = [(l, p.prob('pos')) for l,p in zip(random_forest_opt.classify_many(features), random_forest_opt.prob_classify_many(features))]
tasks_voting = [(l, p.prob('pos')) for l,p in zip(voting.classify_many(features), voting.prob_classify_many(features))]
print(f'Dummy: {tasks_dummy}')
print(f'LogisticRegression: {tasks_logistic}')
print(f'LinearSVC: {tasks_linear_svc}')
print(f'SGD: {tasks_sgd}')
print(f'Random Forest: {tasks_rf}')
print()
print(f'Voting: {tasks_voting}')
Dummy: [('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('pos', 1.0), ('neg', 0.0), ('pos', 1.0)] LogisticRegression: [('pos', 0.5177528042129399), ('pos', 0.5177528042129399), ('pos', 0.95041384668781259), ('pos', 0.9042204368988711), ('pos', 0.77613792093402312), ('pos', 0.74783727994957849), ('pos', 0.90181743756911736)] LinearSVC: [('pos', 0.56492618977808062), ('pos', 0.56492618977808062), ('pos', 0.92860840939215505), ('pos', 0.8805809957950842), ('pos', 0.82874640690443713), ('pos', 0.83449212505086656), ('pos', 0.8889786555784075)] SGD: [('pos', 0.53547937378473365), ('pos', 0.53547937378473365), ('pos', 0.99992355013586687), ('pos', 0.99902040085679145), ('pos', 0.99105878886231658), ('pos', 0.98526269258531807), ('pos', 0.99895146721659678)] Random Forest: [('pos', 0.50561516177492483), ('pos', 0.50561516177492483), ('pos', 0.96279172244353028), ('pos', 0.97799999999999998), ('pos', 0.95699999999999996), ('pos', 0.94999999999999996), ('pos', 0.96999999999999997)] Voting: [('pos', 0.52202910464797891), ('pos', 0.52202910464797891), ('pos', 0.95976101963659921), ('pos', 0.95395871789375353), ('pos', 0.9032303318255851), ('pos', 0.89501175259253529), ('pos', 0.95289944311959662)]
import matplotlib.pyplot as plt
precision, recall, prc_thresholds = precision_recall_curve(voting_predict)
average_precision = average_precision_score(voting_predict)
plt.figure()
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
average_precision))
plt.show()
fpr, tpr, roc_thresholds = roc_curve(voting_predict)
area = roc_auc_score(voting_predict)
plt.figure()
plt.step(fpr, tpr, color='b', alpha=0.2,
where='post')
plt.fill_between(fpr, tpr, step='post', alpha=0.2,
color='b')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Receiving Operating Characteristic: area={0:0.2f}'.format(
area))
plt.show()
Considered a bad idea to actually adjust predictions based on optimal Threshold
from holdout test data curves - it's a form of overfitting on the test set: https://stackoverflow.com/questions/32627926/scikit-changing-the-threshold-to-create-multiple-confusion-matrixes (although using ROC to do this might be ok? or on cross-validated training data? https://stackoverflow.com/a/35300649)
import pickle
print ("Exporting the voting model to model.v2.pkl")
with open('model.v2.pkl', 'wb') as f:
pickle.dump(voting, f)
Exporting the voting model to model.v2.pkl
# load the model back into memory
print("Importing the model from model.v2.pkl")
with open('model.v2.pkl', 'rb') as f:
loaded_clf = pickle.load(f)
# predict on a new sample
task_new = 'Buy ice cream'
print ('New sample: {}'.format(task_new))
# score on the new sample
features = featurize(nlp(task_new));
predict = [(l, p.prob('pos')) for l,p in zip(loaded_clf.classify_many(features), loaded_clf.prob_classify_many(features))]
print('Predicted class is {}'.format(predict[0]))
Importing the model from model.v2.pkl New sample: Buy ice cream Predicted class is ('pos', 0.96540602930541619)
I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the Stanford CoreNLP toolkit and its associated software to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.
I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later.
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences
tagged_sentences = []
for s in sentences:
words = word_tokenize(s)
tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
tagged_sentences.append(tagged)
print(tagged_sentences)
Run down to the shop, will you, Peter
is parsed unexpectedly by nltk.pos_tag
:
[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]
Run
is tagged as a NNP (proper noun, singular)
I expected an output more like what the Stanford Parser provides:
Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP
Run
is tagged as a VGB (verb, gerund/present participle)
- still not quite the VB
I want, but at least it's a V*
MEANWHILE...
nltk.pos_tag
did better with:
[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]
Compared to Stanford CoreNLP (note that this is different than what Stanford Parser outputs):
(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))
Concern: clean as VB (verb, base form)
vs JJ (adjective)
IMPROVE POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.
Note what Spacy POS tagger did with Run down to the shop, will you Peter
:
Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP
where `Run` is the `VB` I expected from POS tagging (compared to `nltk.pos_tag` result of `NNP`). Also note that Spacy collapses `the shop` into a single unit, which should be helpful during featurization.
import re
from collections import defaultdict
featuresets = []
for ts in tagged_sentences:
s_features = defaultdict(int)
for idx, tup in enumerate(ts):
#print(tup)
pos = tup[1]
# FeatureName.VERB
is_verb = re.match(r'VB.?', pos) is not None
print(tup, is_verb)
if is_verb:
s_features[FeatureName.VERB] += 1
# FOLLOWING_POS
next_idx = idx + 1;
if next_idx < len(ts):
s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
# VERB_MODIFIER
# VERB_MODIFYING
else:
s_features[FeatureName.VERB] = 0
featuresets.append(dict(s_features))
print()
print(featuresets)
Setup guide used: https://stackoverflow.com/a/34112695
# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer