Notebook

NLP for Task Classification¶

Hypothesis: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.

Some Terminology¶

Imperative mood is "used principally for ordering, requesting or advising the listener to do (or not to do) something... also often used for giving instructions as to how to perform a task."
Part of speech (POS) is a way of categorizing a word based on its syntactic function.
- The POS tagger from Spacy.io that is used in this notebook differentiates between pos_ and tag_ - POS (pos_) refers to "coarse-grained part-of-speech" like VERB, ADJ, or PUNCT; and POSTAG (tag_) refers to "fine-grained part-of-speech" like VB, JJ, or ..
Syntactic dependency parsing is a way of connecting words based on syntactic relationships, such as DOBJ (direct object), PREP (prepositional modifier), or POBJ (object of preposition).
- Check out the dependency parse of the phrase "Send the report to Kyle by tomorrow" as an example.

Proposed Features¶

The imperative mood centers around actions, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:

FeatureName.VERB: Does the phrase contain VERB(s) of the tag form VB*?
FeatureName.FOLLOWING_POS: Are the words following the VERB(s) of certain parts of speech?
FeatureName.FOLLOWING_POSTAG: Are the words following the VERB(s) of certain POS tags?
FeatureName.CHILD_DEP: Are the VERB(s) parents of certain syntactic dependencies?
FeatureName.PARENT_DEP: Are the VERB(s) children of certain syntactic dependencies?
FeatureName.CHILD_POS: Are the syntactic dependencies that the VERB(s) are children of of certain parts of speech?
FeatureName.CHILD_POSTAG: Are the syntactic dependencies that the VERB(s) are children of of certain POS tags?
FeatureName.PARENT_POS: Are the syntactic dependencies that the VERB(s) parent of certain parts of speech?
FeatureName.PARENT_POSTAG: Are the syntactic dependencies that the VERB(s) parent of certain POS tags?

Notes:

Features 2-9 all depend on feature 1 between True; if False, phrase vectorization will result in all zeroes.
When features 2-9 are applied to actual phrases, they will append identifying informating about the feature in the form of _* (e.g., FeatureName.FOLLOWING_POSTAG_WRB).

Data and Setup¶

Building a recipe corpus¶

I wrote and ran epicurious_recipes.py* to scrape Epicurious.com for recipe instructions and descriptions. I then performed some manual cleanup of the script results. Output is in epicurious-pos.txt and epicurious-neg.txt.

* script (very) loosely based off of https://github.com/benosment/hrecipe-parse

Note that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.

To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.

This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available. Further, most positive examples are recipe instructions - also a specific (and not necessarily related to the main "task" category) category of text.

Ultimately though, this recipe corpus is a stopgap/proof of concept for a corpus more relevant to tasks later on, so I won't worry further about this for now.

In [1]:

import os
from pandas import read_csv
from numpy import random

In [2]:

BASE_DIR = os.getcwd()
data_path = BASE_DIR + '/data.tsv'

In [3]:

df = read_csv(data_path, sep='\t', header=None, names=['Text', 'Label'])
df.head()

Out[3]:

	Text	Label
0	Be kind	pos
1	Get out of here	pos
2	Look this over	pos
3	Paul, do your homework now	pos
4	Do not clean soot off the window	pos

In [4]:

pos_data_split = list(df.loc[df.Label == 'pos'].Text)
neg_data_split = list(df.loc[df.Label == 'neg'].Text)

num_pos = len(pos_data_split)
num_neg = len(neg_data_split)

# 50/50 split between the number of positive and negative samples
num_per_class = num_pos if num_pos < num_neg else num_neg

# shuffle samples
random.shuffle(pos_data_split)
random.shuffle(neg_data_split)

In [5]:

lines = []
for l in pos_data_split[:num_per_class]:
    lines.append((l, 'pos'))
for l in neg_data_split[:num_per_class]:
    lines.append((l, 'neg'))

In [6]:

# Features as defined in the introduction
from enum import Enum, auto
class FeatureName(Enum):
    VERB = auto()
    FOLLOWING_POS = auto()
    FOLLOWING_POSTAG = auto()
    CHILD_DEP = auto()
    PARENT_DEP = auto()
    CHILD_POS = auto()
    CHILD_POSTAG = auto()
    PARENT_POS = auto()
    PARENT_POSTAG = auto()

spaCy.io for NLP¶

Because Stanford CoreNLP is hard to install for Python

Found Spacy through an article on "Training a Classifier for Relation Extraction from Medical Literature" (GitHub)

NLTK library comparison chart https://spacy.io/docs/api/#comparison

In [7]:

#!conda config --add channels conda-forge
#!conda install spacy
#!python -m spacy download en

Using the Spacy Data Model for NLP¶

In [8]:

import spacy
# slow
nlp = spacy.load('en')

Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.

TODO: Improvement to Doc.sents? "To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse."

In [9]:

def create_spacy_docs(ll):
    dd = [(nlp(l[0]), l[1]) for l in ll]
    # collapse noun phrases into single compounds
    for d in dd:
        for np in d[0].noun_chunks:
            np.merge(tag=np.root.tag_, ent_type=np.root.ent_type_, lemma=np.root.lemma_)
    return dd

In [10]:

# slower
docs = create_spacy_docs(lines)

NLP output¶

Tokenization, POS tagging, and dependency parsing happened automatically with the nlp(line) calls above! So let's look at the outputs.

https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward

In [11]:

for doc in docs[:10]:
    print(list(doc[0].sents))

[Toss cherry tomatoes with 1 Tbsp., oil in a small bowl; season with salt.]
[Transfer to a platter and top with remaining 1/2 tsp., lime zest; season with salt and pepper.]
[Serve with a crunchy green salad.]
[Immediately pour entire contents of pot into a large colander to drain, then spread out corn, potatoes, and shrimp on a large rimmed baking sheet or sheets of newspaper; discard lemon halves.]
[Cook wings, moving to a cooler section of grill or reducing heat if they start to burn, until cooked through, an instant-read thermometer inserted into the flesh but not touching the bone registers 165°F, and skin is crisp and lightly charred, 5–10 minutes.]
[Line 24 muffin cups with paper liners.]
[Cut into thin slices crosswise.]
[Place 2 small plates in freezer to chill.]
[Bring mixture to boil, stirring often, over medium-high heat.]
[Don’t omit the milk, however, as this will change the balance of liquid to dry ingredients in the recipe.]

In [12]:

for doc in docs[:10]:
    print(list(doc[0].noun_chunks))

[Toss cherry tomatoes, 1 Tbsp, oil, a small bowl, season, salt]
[a platter, top, remaining 1/2 tsp, lime zest, season, salt, pepper]
[a crunchy green salad]
[entire contents, pot, a large colander, corn, potatoes, a large rimmed baking sheet, sheets, newspaper, discard lemon halves]
[Cook wings, a cooler section, grill, heat, they, an instant-read thermometer, the flesh, 165°F, skin]
[Line 24 muffin cups, paper liners]
[thin slices]
[Place 2 small plates, freezer]
[mixture, medium-high heat]
[the milk, the balance, liquid, ingredients, the recipe]

Spacy's dependency graph visualization

In [13]:

for doc in docs[:5]:
    for token in doc[0]:
        print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))

Toss cherry tomatoes ROOT tomato NOUN NNS Toss cherry tomatoes [with, .]
with prep with ADP IN Toss cherry tomatoes [1 Tbsp]
1 Tbsp pobj tbsp PROPN NNP with []
. punct . PUNCT . Toss cherry tomatoes []
oil ROOT oil NOUN NN oil [in, ;, season, .]
in prep in ADP IN oil [a small bowl]
a small bowl pobj bowl NOUN NN in []
; punct ; PUNCT : oil []
season conj season NOUN NN oil [with]
with prep with ADP IN season [salt]
salt pobj salt NOUN NN with []
. punct . PUNCT . oil []
Transfer ROOT transfer VERB VB Transfer [to, with, .]
to prep to ADP IN Transfer [a platter]
a platter pobj platter NOUN NN to [and, top]
and cc and CCONJ CC a platter []
top conj top NOUN NN a platter []
with prep with ADP IN Transfer [remaining 1/2 tsp]
remaining 1/2 tsp pobj tsp NOUN NN with []
. punct . PUNCT . Transfer []
lime zest ROOT zest NOUN NN lime zest [;, season, .]
; punct ; PUNCT : lime zest []
season appos season NOUN NN lime zest [with]
with prep with ADP IN season [salt]
salt pobj salt NOUN NN with [and, pepper]
and cc and CCONJ CC salt []
pepper conj pepper NOUN NN salt []
. punct . PUNCT . lime zest []
Serve ROOT serve VERB VB Serve [with, .]
with prep with ADP IN Serve [a crunchy green salad]
a crunchy green salad pobj salad NOUN NN with []
. punct . PUNCT . Serve []
Immediately advmod immediately ADV RB pour []
pour ROOT pour VERB VBP pour [Immediately, entire contents, into, ,, spread, .]
entire contents dobj content NOUN NNS pour [of]
of prep of ADP IN entire contents [pot]
pot pobj pot NOUN NN of []
into prep into ADP IN pour [a large colander]
a large colander pobj colander NOUN NN into [drain]
to aux to PART TO drain []
drain relcl drain VERB VB a large colander [to]
, punct , PUNCT , pour []
then advmod then ADV RB spread []
spread dep spread VERB VB pour [then, out, corn]
out prt out PART RP spread []
corn dobj corn NOUN NN spread [,, potatoes, ;, discard lemon halves]
, punct , PUNCT , corn []
potatoes conj potato NOUN NNS corn [,, and, shrimp]
, punct , PUNCT , potatoes []
and cc and CCONJ CC potatoes []
shrimp conj shrimp VERB VB potatoes [on]
on prep on ADP IN shrimp [a large rimmed baking sheet]
a large rimmed baking sheet pobj sheet NOUN NN on [or, sheets]
or cc or CCONJ CC a large rimmed baking sheet []
sheets conj sheet NOUN NNS a large rimmed baking sheet [of]
of prep of ADP IN sheets [newspaper]
newspaper pobj newspaper NOUN NN of []
; punct ; PUNCT : corn []
discard lemon halves appos half NOUN NNS corn []
. punct . PUNCT . pour []
Cook wings nsubj wing NOUN NNS inserted []
, punct , PUNCT , inserted []
moving advcl move VERB VBG inserted [to, or, reducing]
to prep to ADP IN moving [a cooler section]
a cooler section pobj section NOUN NN to [of]
of prep of ADP IN a cooler section [grill]
grill pobj grill NOUN NN of []
or cc or CCONJ CC moving []
reducing conj reduce VERB VBG moving [heat, start]
heat dobj heat NOUN NN reducing []
if mark if ADP IN start []
they nsubj -PRON- PRON PRP start []
start advcl start VERB VBP reducing [if, they, burn]
to aux to PART TO burn []
burn xcomp burn VERB VB start [to]
, punct , PUNCT , inserted []
until mark until ADP IN cooked []
cooked advcl cook VERB VBN inserted [until, through]
through prt through PART RP cooked []
, punct , PUNCT , inserted []
an instant-read thermometer nsubj thermometer NOUN NN inserted []
inserted ROOT insert VERB VBN inserted [Cook wings, ,, moving, ,, cooked, ,, an instant-read thermometer, into, but, touching, ,, and, is]
into prep into ADP IN inserted [the flesh]
the flesh pobj flesh NOUN NN into []
but cc but CCONJ CC inserted []
not neg not ADV RB touching []
touching conj touch VERB VBG inserted [not, registers, 165°F]
the det the DET DT registers []
bone compound bone NOUN NN registers []
registers dobj register VERB VBZ touching [the, bone]
165°F dobj f PROPN NNP touching []
, punct , PUNCT , inserted []
and cc and CCONJ CC inserted []
skin nsubj skin NOUN NN is []
is conj be VERB VBZ inserted [skin, crisp, minutes, .]
crisp acomp crisp ADJ JJ is [and, charred, ,]
and cc and CCONJ CC crisp []
lightly advmod lightly ADV RB charred []
charred conj char VERB VBN crisp [lightly]
, punct , PUNCT , crisp []
5–10 nummod 5–10 NUM CD minutes []
minutes npadvmod minute NOUN NNS is [5–10]
. punct . PUNCT . is []

Featurization¶

In [14]:

import re
from collections import defaultdict

def featurize(d):
    s_features = defaultdict(int)
    for idx, token in enumerate(d):
        if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags 
            s_features[FeatureName.VERB.name] += 1
            # FOLLOWING_POS
            # FOLLOWING_POSTAG
            next_idx = idx + 1;
            if next_idx < len(d):
                s_features[f'{FeatureName.FOLLOWING_POS.name}_{d[next_idx].pos_}'] += 1
                s_features[f'{FeatureName.FOLLOWING_POSTAG.name}_{d[next_idx].tag_}'] += 1
            # PARENT_DEP
            # PARENT_POS
            # PARENT_POSTAG
            '''
            "Because the syntactic relations form a tree, every word has exactly one head.
            You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
            https://spacy.io/docs/usage/dependency-parse#navigating
            '''
            if (token.head is not token):
                s_features[f'{FeatureName.PARENT_DEP.name}_{token.head.dep_.upper()}'] += 1
                s_features[f'{FeatureName.PARENT_POS.name}_{token.head.pos_}'] += 1
                s_features[f'{FeatureName.PARENT_POSTAG.name}_{token.head.tag_}'] += 1
            # CHILD_DEP
            # CHILD_POS
            # CHILD_POSTAG
            for child in token.children:
                s_features[f'{FeatureName.CHILD_DEP.name}_{child.dep_.upper()}'] += 1
                s_features[f'{FeatureName.CHILD_POS.name}_{child.pos_}'] += 1
                s_features[f'{FeatureName.CHILD_POSTAG.name}_{child.tag_}'] += 1
    return dict(s_features)

In [15]:

featuresets = [(doc[0], (featurize(doc[0]), doc[1])) for doc in docs]

In [16]:

from statistics import mean, median, mode, stdev
f_lengths = [len(fs[1][0]) for fs in featuresets]

print('Stats on number of features per example:')
print(f'mean: {mean(f_lengths)}')
print(f'stdev: {stdev(f_lengths)}')
print(f'median: {median(f_lengths)}')
print(f'mode: {mode(f_lengths)}')
print(f'max: {max(f_lengths)}')
print(f'min: {min(f_lengths)}')

Stats on number of features per example:
mean: 24.562363715656346
stdev: 14.281121014936279
median: 24.0
mode: 0
max: 73
min: 0

In [17]:

featuresets[:2]

Out[17]:

[(Toss cherry tomatoes with 1 Tbsp. oil in a small bowl; season with salt.,
  ({}, 'pos')),
 (Transfer to a platter and top with remaining 1/2 tsp. lime zest; season with salt and pepper.,
  ({'CHILD_DEP_PREP': 2,
    'CHILD_DEP_PUNCT': 1,
    'CHILD_POSTAG_.': 1,
    'CHILD_POSTAG_IN': 2,
    'CHILD_POS_ADP': 2,
    'CHILD_POS_PUNCT': 1,
    'FOLLOWING_POSTAG_IN': 1,
    'FOLLOWING_POS_ADP': 1,
    'PARENT_DEP_ROOT': 1,
    'PARENT_POSTAG_VB': 1,
    'PARENT_POS_VERB': 1,
    'VERB': 1},
   'pos'))]

On one run, the above line printed the following featureset: (Gather foil loosely on top and bake for 1 1/2 hours., ({}, 'pos'))

This is because the Spacy.io POS tagger provided this: Gather/NNP foil/NN loosely/RB on/IN top/NN and/CC bake/NN for/IN 1 1/2 hours./NNS

...with no VERBs tagged, which is incorrect.

"Voting - POS taggers and classifiers" in the Next Steps/Improvements section below is meant to improve on this.

Compare to Stanford CoreNLP POS tagger: Gather/VB foil/NN loosely/RB on/IN top/JJ and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.

And Stanford Parser: Gather/NNP foil/VB loosely/RB on/IN top/NN and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.

Classification¶

In [18]:

random.shuffle(featuresets)

num_classes = 2
split_num = round(num_per_class*num_classes / 5)

# train and test sets
testing_set = [fs[1] for i, fs in enumerate(featuresets[:split_num])]
training_set =  [fs[1] for i, fs in enumerate(featuresets[split_num:])]

print(f'# training samples: {len(training_set)}')
print(f'# test samples: {len(testing_set)}')

# training samples: 3669
# test samples: 917

In [19]:

# decoupling the functionality of nltk.classify.accuracy
def predict(classifier, gold, prob=True):
    if (prob is True):
        predictions = classifier.prob_classify_many([fs for (fs, ll) in gold])
    else:
        predictions = classifier.classify_many([fs for (fs, ll) in gold])
    return list(zip(predictions, [ll for (fs, ll) in gold]))

def accuracy(predicts, prob=True):
    if (prob is True):
        correct = [label == prediction.max() for (prediction, label) in predicts]
    else:
        correct = [label == prediction for (prediction, label) in predicts]
        
    if correct:
        return sum(correct) / len(correct)
    else:
        return 0

Note below the use of DummyClassifier to provide a simple sanity check, a baseline of random predictions. stratified means it "generates random predictions by respecting the training set class distribution." (http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)

More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong: features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc…

If a classifier can beat the DummyClassifier, it is at least learning something valuable! How valuable is another question...

In [20]:

from nltk import NaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegressionCV, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV

dummy = SklearnClassifier(DummyClassifier(strategy='stratified', random_state=0))
dummy.train(training_set)
dummy_predict = predict(dummy, testing_set)
dummy_accuracy = accuracy(dummy_predict)
print("Dummy classifier accuracy percent:", dummy_accuracy*100)

nb = NaiveBayesClassifier.train(training_set)
nb_predict = predict(nb, testing_set)
nb_accuracy = accuracy(nb_predict)
print("NaiveBayes classifier accuracy percent:", nb_accuracy*100)

multinomial_nb = SklearnClassifier(MultinomialNB())
multinomial_nb.train(training_set)
mnb_predict = predict(multinomial_nb, testing_set)
mnb_accuracy = accuracy(mnb_predict)
print("MultinomialNB classifier accuracy percent:", mnb_accuracy*100)

bernoulli_nb = SklearnClassifier(BernoulliNB())
bernoulli_nb.train(training_set)
bnb_predict = predict(bernoulli_nb, testing_set)
bnb_accuracy = accuracy(bnb_predict)
print("BernoulliNB classifier accuracy percent:", bnb_accuracy*100)

# ??logistic_regression._clf
#   sklearn.svm.LinearSVC : learns SVM models using the same algorithm.
logistic_regression = SklearnClassifier(LogisticRegressionCV())
logistic_regression.train(training_set)
lr_predict = predict(logistic_regression, testing_set)
lr_accuracy = accuracy(lr_predict)
print("LogisticRegressionCV classifier accuracy percent:", lr_accuracy*100)

# ??sgd._clf
#    The 'log' loss gives logistic regression, a probabilistic classifier.
# ??linear_svc._clf
#   can optimize the same cost function as LinearSVC
#   by adjusting the penalty and loss parameters. In addition it requires
#   less memory, allows incremental (online) learning, and implements
#   various loss functions and regularization regimes.
sgd = SklearnClassifier(SGDClassifier(loss='log'))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print("SGD classifier accuracy percent:", sgd_accuracy*100)

# slow
# using libsvm with kernel 'rbf' (radial basis function)
svc = SklearnClassifier(SVC(probability=True))
svc.train(training_set)
svc_predict = predict(svc, testing_set)
svc_accuracy = accuracy(svc_predict)
print("SVC classifier accuracy percent:", svc_accuracy*100)

# ??linear_svc._clf
#    Similar to SVC with parameter kernel='linear', but implemented in terms of
#    liblinear rather than libsvm, so it has more flexibility in the choice of
#    penalties and loss functions and should scale better to large numbers of
#    samples.
#    Prefer dual=False when n_samples > n_features.
# Using CalibratedClassifierCV as wrapper to get predict probabilities (https://stackoverflow.com/a/39712590)
linear_svc = SklearnClassifier(CalibratedClassifierCV(LinearSVC(dual=False)))
linear_svc.train(training_set)
linear_svc_predict = predict(linear_svc, testing_set)
linear_svc_accuracy = accuracy(linear_svc_predict)
print("LinearSVC classifier accuracy percent:", linear_svc_accuracy*100)

# slower
dt = DecisionTreeClassifier.train(training_set)
dt_predict = predict(dt, testing_set, False)
dt_accuracy = accuracy(dt_predict, False)
print("DecisionTree classifier accuracy percent:", dt_accuracy*100)

random_forest = SklearnClassifier(RandomForestClassifier(n_estimators = 100))
random_forest.train(training_set)
rf_predict = predict(random_forest, testing_set)
rf_accuracy = accuracy(rf_predict)
print("RandomForest classifier accuracy percent:", rf_accuracy*100)

Dummy classifier accuracy percent: 49.836423118865866
NaiveBayes classifier accuracy percent: 68.92039258451473
MultinomialNB classifier accuracy percent: 79.17121046892039
BernoulliNB classifier accuracy percent: 78.08069792802618
LogisticRegressionCV classifier accuracy percent: 83.31515812431843
SGD classifier accuracy percent: 80.91603053435115
SVC classifier accuracy percent: 82.11559432933478
LinearSVC classifier accuracy percent: 83.20610687022901
DecisionTree classifier accuracy percent: 77.42639040348965
RandomForest classifier accuracy percent: 83.53326063249727

SGD: Multiple Epochs¶

sgd classifiers improves with epochs. ??sgd._clf tells us that the default number of epochs n_iter is 5. So let's run more epochs. Also not that the training_set shuffle is True by default.

In [21]:

num_epochs = 1000
sgd = SklearnClassifier(SGDClassifier(loss='log', n_iter=num_epochs))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print(f"SGDClassifier classifier accuracy percent (epochs: {num_epochs}):", sgd_accuracy*100)

SGDClassifier classifier accuracy percent (epochs: 1000): 83.96946564885496

Fortunately, 1000 epochs run very quickly! And SGDClassifier performance has improved with more iterations.

Also note that we can set warm_start to True if we want to take advantage of online learning and reuse the solution of the previous call.

GridSearch and Cross-Validation¶

Next we perform 1) grid search to find optimal hyperparameters, and 2) cross-validation to evaluate performance over multiple folds of the data (to avoid overfitting).

http://scikit-learn.org/stable/modules/grid_search.html#grid-search

http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

In [22]:

from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV

In [23]:

# https://stackoverflow.com/a/16388804
from sklearn.model_selection import KFold
from sklearn.base import clone
from numpy import zeros

def cross_val(name, model, debug=True):
    num_splits = 3
    original_clf = clone(model._clf)
    cvidx = KFold(n_splits=num_splits, shuffle=True).split(training_set)
    
    nested_acc = zeros(num_splits)
    i=0
    for trainidx, testidx in cvidx:
        model._clf = clone(original_clf)  # we clone the estimator to make sure that all the folds are independent
        classifier = model.train(training_set[trainidx[0]:trainidx[len(trainidx)-1]])
        pred = predict(classifier, training_set[testidx[0]:testidx[len(testidx)-1]])
        nested_acc[i] = accuracy(pred)
        i += 1
    
    if debug == True:
        print(f"{name} CV accuracies:", nested_acc)
    
    return nested_acc.mean()

In [24]:

# http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Set the parameters by cross-validation
model_parameters = [{'LinearSVC': [{
                        'loss': ['hinge'],
                        'dual': [True],
                        'penalty': ['l2'],
                        'tol': [1e-3, 1e-4, 1e-5],
                        'max_iter': [1000, 10000],
                        'C': [100.0, 1.0, 0.01]
                    },
                    {
                        'loss': ['squared_hinge'],
                        'dual': [False, True],
                        'penalty': ['l2'],
                        'tol': [1e-3, 1e-4, 1e-5],
                        'max_iter': [1000, 10000],
                        'C': [100.0, 1.0, 0.01]
                    }]},
                   {'LogisticRegression': [{
                        'penalty': ['l1'],
                        'dual': [False],
                        'C': [100.0, 1.0, 0.01],
                        'solver': ['liblinear']
                   },
                   {
                        'penalty': ['l2'],
                        'dual': [False, True],
                        'C': [100.0, 1.0, 0.01],
                        'max_iter': [100, 1000],
                        'solver': ['liblinear'],
                        'tol': [1e-3, 1e-4, 1e-5]
                   },
                   {
                        'penalty': ['l2'],
                        'dual': [False],
                        'C': [100.0, 1.0, 0.01],
                        'max_iter': [100, 1000],
                        'solver': ['newton-cg', 'lbfgs', 'sag'],
                        'tol': [1e-3, 1e-4, 1e-5]
                   }]},
                   {'SGD': [{
                        'penalty': ['l1', 'l2', 'elasticnet'],
                        'alpha': [1e-3, 1e-4, 1e-5],
                        'average': [True, False],
                        'n_iter': [100, 1000, 10000]
                   }]},
                   {'RandomForest': [{
                        'n_estimators': [10, 100, 1000],
                        'criterion': ['gini', 'entropy'],
                        'max_features': ['auto', 'log2', None],
                        'oob_score': [True, False]
                   }]}]

score = 'roc_auc'

for i, model_param in enumerate(model_parameters):
    model = [key for i, key in enumerate(model_param)][0]
    
    print(f"# {model}: Tuning hyper-parameters for {score}")
    print()
    
    if model == 'LinearSVC': 
        clf = LinearSVC()
    elif model == 'LogisticRegression':
        clf = LogisticRegression()
    elif model == 'SGD':
        clf = SGDClassifier(loss='log')
    elif model == 'RandomForest':
        clf = RandomForestClassifier()
    else:
        raise Exception('%s model needs to be added to the if-block' % model)

    grid = SklearnClassifier(GridSearchCV(clf, model_param[model], cv=5,
                       scoring=score, n_jobs=-1))
    grid.train(training_set)

    print("Best parameters set found on development set:")
    print()
    print(grid._clf.best_params_)
    mean = grid._clf.cv_results_['mean_test_score'][grid._clf.best_index_]
    std = grid._clf.cv_results_['std_test_score'][grid._clf.best_index_]
    print("roc_auc: %0.3f (+/-%0.03f)" % (mean, std * 2))
    print()
    
    if model == 'LinearSVC':
        # Wrapping LinearSVC in CalibratedClassifierCV to add support for probability prediction
        # Note that there is a difference in accuracies between raw GridSearchCV and calibrated GridSearchCV
        # However, I'm willing to sacrifice the potential 'best' result from raw in order to output probabilities
        grid_calibrated = SklearnClassifier(CalibratedClassifierCV(grid._clf.best_estimator_, cv=None))
        grid_calibrated.train(training_set)
        gridc_predict = predict(grid_calibrated, testing_set)
        gridc_accuracy = accuracy(gridc_predict)
        print(f"{model} (calibrated) classifier accuracy percent:", gridc_accuracy*100)
        
        grid_predict = predict(grid, testing_set, False)
        grid_accuracy = accuracy(grid_predict, False)
        print(f"{model} (raw) classifier accuracy percent:", grid_accuracy*100)
        
        # CV after parameter optimization
        cv_acc = cross_val(model, grid_calibrated)
        print(f"{model} (calibrated) CV classifier avg accuracy percent:", cv_acc*100)
        
        linear_svc_opt = grid_calibrated
        linear_svc_predict = gridc_predict
    else:
        grid_predict = predict(grid, testing_set)
        grid_accuracy = accuracy(grid_predict)
        print(f"{model} classifier accuracy percent:", grid_accuracy*100)
        
        # CV after parameter optimization
        cv_acc = cross_val(model, grid)
        print(f"{model} CV classifier avg accuracy percent:", cv_acc*100)
        
        if model == 'LogisticRegression':
            logistic_regression_opt = grid
            lr_predict = grid_predict
        elif model == 'SGD':
            sgd_opt = grid
            sgd_predict = grid_predict
        elif model == 'RandomForest':
            random_forest_opt = grid
            rf_predict = grid_predict
        else:
            raise Exception('%s model was not run through Grid Search' % model)
    
    print()

# LinearSVC: Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'C': 0.01, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000, 'penalty': 'l2', 'tol': 0.001}
roc_auc: 0.934 (+/-0.019)

LinearSVC (calibrated) classifier accuracy percent: 83.20610687022901
LinearSVC (raw) classifier accuracy percent: 82.87895310796074
LinearSVC CV accuracies: [ 0.85550082  0.85566166  0.85496183]
LinearSVC (calibrated) CV classifier avg accuracy percent: 85.5374772491

# LogisticRegression: Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'C': 1.0, 'dual': False, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.001}
roc_auc: 0.937 (+/-0.019)

LogisticRegression classifier accuracy percent: 83.53326063249727

C:\Users\narho_000\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:286: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

LogisticRegression CV accuracies: [ 0.86619334  0.86637578  0.86630286]
LogisticRegression CV classifier avg accuracy percent: 86.6290661978

# SGD: Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'alpha': 0.0001, 'average': True, 'n_iter': 100, 'penalty': 'l2'}
roc_auc: 0.937 (+/-0.018)

SGD classifier accuracy percent: 82.76990185387132
SGD CV accuracies: [ 0.86724939  0.86743044  0.8672301 ]
SGD CV classifier avg accuracy percent: 86.7303308486

# RandomForest: Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 1000, 'oob_score': True}
roc_auc: 0.936 (+/-0.016)

RandomForest classifier accuracy percent: 83.86041439476554
RandomForest CV accuracies: [ 0.94050218  0.94011485  0.94055086]
RandomForest CV classifier avg accuracy percent: 94.0389296885

VotingClassifier¶

We're going to create an ensemble classifier by letting our top-performing classifiers, which consistently perform with >80% accuracy — LogisticRegression, LinearSVC, SGD, and RandomForest (excluding SVC due to its slowness) — vote on each prediction.

In [25]:

from sklearn.ensemble import VotingClassifier

voting = SklearnClassifier(VotingClassifier(estimators=[
    ('lr', logistic_regression_opt._clf),
    ('linear_svc', linear_svc_opt._clf),
    ('sgd', sgd_opt._clf),
    ('rf', random_forest_opt._clf)
], voting='soft', weights=[1,1,1,3], n_jobs=-1))
voting.train(training_set)

voting_predict = predict(voting, testing_set)
voting_accuracy = accuracy(voting_predict)
print("Soft voting classifier accuracy percent:", voting_accuracy*100)

# CV after parameter optimization
voting_acc = cross_val("Soft voting", voting)
print(f"Soft voting CV classifier accuracy percent:", voting_acc*100)

Soft voting classifier accuracy percent: 83.96946564885496
Soft voting CV accuracies: [ 0.90834697  0.90854491  0.90673575]
Soft voting CV classifier accuracy percent: 90.7875877339

Analysis¶

Similarly to the voting model, we're also going to scope analysis down to our top-performing classifiers. We'll include the Voting model itself, and then Dummy as a baseline.

Most Informative Features¶

In [26]:

# https://stackoverflow.com/a/11140887
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:round(n/2)], coefs_with_fns[:-(round(n/2) + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In [27]:

print('SGD')
show_most_informative_features(sgd._vectorizer, sgd._clf, 15)
print()
print('Logistic Regression')
show_most_informative_features(logistic_regression._vectorizer, logistic_regression._clf, 15)

SGD
	-3.3820	FOLLOWING_POSTAG_WRB		2.9712	CHILD_POSTAG_-LRB-
	-2.1901	CHILD_DEP_AGENT		2.5766	FOLLOWING_POSTAG_JJS
	-2.0411	CHILD_DEP_DET  		2.1984	FOLLOWING_POSTAG_VBZ
	-2.0337	FOLLOWING_POSTAG_JJR		1.8884	CHILD_DEP_NPADVMOD
	-2.0008	CHILD_DEP_NEG  		1.8787	PARENT_DEP_PARATAXIS
	-1.7942	CHILD_DEP_RELCL		1.7963	CHILD_DEP_APPOS
	-1.7393	FOLLOWING_POSTAG_VB		1.7699	FOLLOWING_POSTAG_RB
	-1.7151	CHILD_POSTAG_HYPH		1.7124	CHILD_POSTAG_PDT

Logistic Regression
	-1.4486	CHILD_DEP_NEG  		1.4545	CHILD_POSTAG_-LRB-
	-1.4322	CHILD_DEP_AGENT		1.3721	CHILD_DEP_NPADVMOD
	-1.0650	PARENT_POSTAG_VBZ		1.1594	PARENT_POSTAG_VB
	-1.0267	CHILD_POSTAG_WDT		1.0527	CHILD_POSTAG_-RRB-
	-1.0153	CHILD_DEP_NSUBJ		1.0491	CHILD_DEP_MARK 
	-0.9563	PARENT_DEP_XCOMP		0.9183	CHILD_POSTAG_VB
	-0.9126	CHILD_DEP_DET  		0.9078	CHILD_POS_PROPN
	-0.8719	CHILD_DEP_AUX  		0.9077	CHILD_POSTAG_NNP

Note: Because CalibratedClassifierCV has no attribute coef_, we cannot show the most informative features for LinearSVC while it's wrapped. Random Forest and Voting also lack coef_.

In [28]:

spacy.explain("JJS")

Out[28]:

'adjective, superlative'

Negative coefficients:

VERB parents AGENT: "used for agents of passive verbs" - interpreting this to mean that existence of passive verbs (i.e., the opposite of active verbs) means negative correlation with it being imperative
VERB followed by a WRB: "wh-adverb" (where, when)
VERB is a child of AMOD: "any adjective or adjectival phrase that serves to modify the meaning" of the verb

Positive coefficients:

VERB parents a -RRB-: "right round bracket"
VERB is a child of PROPN: "proper noun"
VERB is a child of NNP: "noun, proper singular"

Scikit Learn metrics: Confusion matrix, Classification report, F1 score, Log loss¶

http://scikit-learn.org/stable/modules/model_evaluation.html

In [29]:

from sklearn import metrics

def classification_report(predict, prob=True):
    predictions, labels = zip(*predict)
    if prob is True:
        return metrics.classification_report(labels, [p.max() for p in predictions])
    else:
        return metrics.classification_report(labels, predictions)

def confusion_matrix(predict, prob=True, print_layout=False):
    predictions, labels = zip(*predict)
    if print_layout is True:
        print('Layout\n[[tn   fp]\n [fn   tp]]\n')
    if prob is True:
        return metrics.confusion_matrix(labels, [p.max() for p in predictions])
    else:
        return metrics.confusion_matrix(labels, predictions)
    
def log_loss(predict):
    predictions, labels = zip(*predict)
    return metrics.log_loss(labels, [p.prob('pos') for p in predictions])

def roc_auc_score(predict):
    predictions, labels = zip(*predict)
    # need to convert labels to binary classification of 0 or 1
    return metrics.roc_auc_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions], average='weighted')

def precision_recall_curve(predict):
    predictions, labels = zip(*predict)
    return metrics.precision_recall_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')

def average_precision_score(predict):
    predictions, labels = zip(*predict)
    return metrics.average_precision_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions])

def roc_curve(predict):
    predictions, labels = zip(*predict)
    return metrics.roc_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')

In [30]:

print('SGD')
print(classification_report(sgd_predict))
print()
print('Logistic Regression')
print(classification_report(lr_predict))
print()
print('LinearSVC')
print(classification_report(linear_svc_predict))
print()
print('Random Forest')
print(classification_report(rf_predict))

SGD
             precision    recall  f1-score   support

        neg       0.87      0.79      0.83       480
        pos       0.79      0.87      0.83       437

avg / total       0.83      0.83      0.83       917


Logistic Regression
             precision    recall  f1-score   support

        neg       0.88      0.79      0.83       480
        pos       0.79      0.89      0.84       437

avg / total       0.84      0.84      0.84       917


LinearSVC
             precision    recall  f1-score   support

        neg       0.88      0.78      0.83       480
        pos       0.79      0.89      0.83       437

avg / total       0.84      0.83      0.83       917


Random Forest
             precision    recall  f1-score   support

        neg       0.88      0.80      0.84       480
        pos       0.80      0.89      0.84       437

avg / total       0.84      0.84      0.84       917

In [31]:

print('Voting')
print(classification_report(voting_predict))

Voting
             precision    recall  f1-score   support

        neg       0.88      0.80      0.84       480
        pos       0.80      0.88      0.84       437

avg / total       0.84      0.84      0.84       917

In [32]:

print('Layout\n[[tn   fp]\n [fn   tp]]\n')

print('SGD')
print(confusion_matrix(sgd_predict))
print()
print('Logistic Regression')
print(confusion_matrix(lr_predict))
print()
print('LinearSVC')
print(confusion_matrix(linear_svc_predict))
print()
print('Random Forest')
print(confusion_matrix(rf_predict))

Layout
[[tn   fp]
 [fn   tp]]

SGD
[[379 101]
 [ 57 380]]

Logistic Regression
[[379 101]
 [ 50 387]]

LinearSVC
[[375 105]
 [ 49 388]]

Random Forest
[[382  98]
 [ 50 387]]

In [33]:

print('Voting')
print(confusion_matrix(voting_predict))

Voting
[[384  96]
 [ 51 386]]

The lower the better for log_loss...

In [34]:

print(f'SGD: {log_loss(sgd_predict)}')
print(f'Logistic Regression: {log_loss(lr_predict)}')
print(f'LinearSVC: {log_loss(linear_svc_predict)}')
print(f'Random Forest: {log_loss(rf_predict)}')

SGD: 0.5408885923854805
Logistic Regression: 0.3269831271366439
LinearSVC: 0.34151481635171227
Random Forest: 0.3557548099256462

In [35]:

print(f'Voting: {log_loss(voting_predict)}')

Voting: 0.3058827183576675

The higher the better for roc_auc_score...

In [36]:

print(f'SGD: {roc_auc_score(sgd_predict)}')
print(f'Logistic Regression: {roc_auc_score(lr_predict)}')
print(f'LinearSVC: {roc_auc_score(linear_svc_predict)}')
print(f'Random Forest: {roc_auc_score(rf_predict)}')

SGD: 0.9333333333333333
Logistic Regression: 0.9354405034324942
LinearSVC: 0.9284610983981694
Random Forest: 0.9428751906941265

In [37]:

print(f'Voting: {roc_auc_score(voting_predict)}')

Voting: 0.9454185736079329

Performance on sample tasks¶

In [38]:

sample_tasks = ["Mow lawn", "Mow the lawn", "Buy new shoes", "Feed the dog", "Send report to Kyle", "Send the report to Kyle", "Peel the potatoes"]
features = [featurize(nlp(task)) for task in sample_tasks]

tasks_dummy = [(l, p.prob('pos')*1.0) for l, p in zip(dummy.classify_many(features), dummy.prob_classify_many(features))]
tasks_logistic = [(l, p.prob('pos')) for l,p in zip(logistic_regression_opt.classify_many(features), logistic_regression_opt.prob_classify_many(features))]
tasks_linear_svc = [(l, p.prob('pos')) for l,p in zip(linear_svc_opt.classify_many(features), linear_svc_opt.prob_classify_many(features))]
tasks_sgd = [(l, p.prob('pos')) for l,p in zip(sgd_opt.classify_many(features), sgd_opt.prob_classify_many(features))]
tasks_rf = [(l, p.prob('pos')) for l,p in zip(random_forest_opt.classify_many(features), random_forest_opt.prob_classify_many(features))]
tasks_voting = [(l, p.prob('pos')) for l,p in zip(voting.classify_many(features), voting.prob_classify_many(features))]

print(f'Dummy: {tasks_dummy}')
print(f'LogisticRegression: {tasks_logistic}')
print(f'LinearSVC: {tasks_linear_svc}')
print(f'SGD: {tasks_sgd}')
print(f'Random Forest: {tasks_rf}')
print()
print(f'Voting: {tasks_voting}')

Dummy: [('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('pos', 1.0), ('neg', 0.0), ('pos', 1.0)]
LogisticRegression: [('pos', 0.5177528042129399), ('pos', 0.5177528042129399), ('pos', 0.95041384668781259), ('pos', 0.9042204368988711), ('pos', 0.77613792093402312), ('pos', 0.74783727994957849), ('pos', 0.90181743756911736)]
LinearSVC: [('pos', 0.56492618977808062), ('pos', 0.56492618977808062), ('pos', 0.92860840939215505), ('pos', 0.8805809957950842), ('pos', 0.82874640690443713), ('pos', 0.83449212505086656), ('pos', 0.8889786555784075)]
SGD: [('pos', 0.53547937378473365), ('pos', 0.53547937378473365), ('pos', 0.99992355013586687), ('pos', 0.99902040085679145), ('pos', 0.99105878886231658), ('pos', 0.98526269258531807), ('pos', 0.99895146721659678)]
Random Forest: [('pos', 0.50561516177492483), ('pos', 0.50561516177492483), ('pos', 0.96279172244353028), ('pos', 0.97799999999999998), ('pos', 0.95699999999999996), ('pos', 0.94999999999999996), ('pos', 0.96999999999999997)]

Voting: [('pos', 0.52202910464797891), ('pos', 0.52202910464797891), ('pos', 0.95976101963659921), ('pos', 0.95395871789375353), ('pos', 0.9032303318255851), ('pos', 0.89501175259253529), ('pos', 0.95289944311959662)]

Voting Model: Curves¶

In [39]:

import matplotlib.pyplot as plt

precision, recall, prc_thresholds = precision_recall_curve(voting_predict)
average_precision = average_precision_score(voting_predict)

plt.figure()
plt.step(recall, precision, color='b', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='b')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
          average_precision))
plt.show()

In [40]:

fpr, tpr, roc_thresholds = roc_curve(voting_predict)
area = roc_auc_score(voting_predict)

plt.figure()
plt.step(fpr, tpr, color='b', alpha=0.2,
         where='post')
plt.fill_between(fpr, tpr, step='post', alpha=0.2,
                 color='b')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Receiving Operating Characteristic: area={0:0.2f}'.format(
          area))
plt.show()

Considered a bad idea to actually adjust predictions based on optimal Threshold from holdout test data curves - it's a form of overfitting on the test set: https://stackoverflow.com/questions/32627926/scikit-changing-the-threshold-to-create-multiple-confusion-matrixes (although using ROC to do this might be ok? or on cross-validated training data? https://stackoverflow.com/a/35300649)

Pickling the Voting Model¶

In [41]:

import pickle

print ("Exporting the voting model to model.v2.pkl")
with open('model.v2.pkl', 'wb') as f:
    pickle.dump(voting, f)

Exporting the voting model to model.v2.pkl

In [42]:

# load the model back into memory
print("Importing the model from model.v2.pkl")
with open('model.v2.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

# predict on a new sample
task_new = 'Buy ice cream'
print ('New sample: {}'.format(task_new))

# score on the new sample
features = featurize(nlp(task_new));
predict = [(l, p.prob('pos')) for l,p in zip(loaded_clf.classify_many(features), loaded_clf.prob_classify_many(features))]
print('Predicted class is {}'.format(predict[0]))

Importing the model from model.v2.pkl
New sample: Buy ice cream
Predicted class is ('pos', 0.96540602930541619)

Next Steps and Improvements¶

Training set may be too specific/not relevant enough (recipe instructions for positive dataset, recipe descriptions+short movie reviews for negative dataset)
Throwing features into a blender - need to understand value of each
- What feature "classes" tend to perform the best/worst?
- PCA: Reducing dimensionality using most informative feature information
Phrase vectorizations of all 0s - how problematic is this?
Varying feature vector lengths - does this matter?
Voting - POS taggers
- SciKit Learn: Ensembles
- Kaggle Ensembling Guide
Combining verb phrases
Look at examples from different quadrants of the confusion matrix - is there something we can learn?
- Same idea with the classification report

Things abandoned¶

NLTK¶

I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the Stanford CoreNLP toolkit and its associated software to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.

I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later.

In [ ]:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [ ]:

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Tokenization¶

In [ ]:

sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences

In [ ]:

tagged_sentences = []
for s in sentences:
    words = word_tokenize(s)
    tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
    tagged_sentences.append(tagged)
print(tagged_sentences)

Note: POS accuracy¶

Run down to the shop, will you, Peter is parsed unexpectedly by nltk.pos_tag:

[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]

Run is tagged as a NNP (proper noun, singular)

I expected an output more like what the Stanford Parser provides:

Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP

Run is tagged as a VGB (verb, gerund/present participle) - still not quite the VB I want, but at least it's a V*

MEANWHILE...

nltk.pos_tag did better with:

[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]

Compared to Stanford CoreNLP (note that this is different than what Stanford Parser outputs):

(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))

Concern: clean as VB (verb, base form) vs JJ (adjective)

IMPROVE POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.

Note what Spacy POS tagger did with Run down to the shop, will you Peter:

Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP

where `Run` is the `VB` I expected from POS tagging (compared to `nltk.pos_tag` result of `NNP`). Also note that Spacy collapses `the shop` into a single unit, which should be helpful during featurization.

Featurization¶

In [ ]:

import re
from collections import defaultdict

featuresets = []
for ts in tagged_sentences:
    s_features = defaultdict(int)
    for idx, tup in enumerate(ts):
        #print(tup)
        pos = tup[1]
        # FeatureName.VERB
        is_verb = re.match(r'VB.?', pos) is not None
        print(tup, is_verb)
        if is_verb:
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(ts):
                s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
            # VERB_MODIFIER
            # VERB_MODIFYING
        else:
            s_features[FeatureName.VERB] = 0
    featuresets.append(dict(s_features))

print()
print(featuresets)

Stanford NLP ¶

Setup guide used: https://stackoverflow.com/a/34112695

In [ ]:

# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip

In [ ]:

from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer

NLP for Task Classification¶

Some Terminology¶

Proposed Features¶

Data and Setup¶

Building a recipe corpus¶

spaCy.io for NLP¶

Using the Spacy Data Model for NLP¶

NLP output¶

Featurization¶

Classification¶

SGD: Multiple Epochs¶

GridSearch and Cross-Validation¶

VotingClassifier¶

Analysis¶

Most Informative Features¶

Scikit Learn metrics: Confusion matrix, Classification report, F1 score, Log loss¶

Performance on sample tasks¶

Voting Model: Curves¶

Pickling the Voting Model¶

Next Steps and Improvements¶

Things abandoned¶

NLTK¶

Tokenization¶

Note: POS accuracy¶

Featurization¶

Stanford NLP¶

Stanford NLP ¶