__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
The focus of this notebook is building feature representations for use with (mostly linear) classifiers (though you're encouraged to try out some non-linear ones as well!).
The core characteristics of the feature functions we'll build here:
These classifiers tend to be highly competitive. We'll look at more powerful deep learning models in the next notebook, and it will immediately become apparent that it is very difficult to get them to measure up to well-built classifiers based in sparse feature representations.
See the previous notebook for set-up instructions.
from collections import Counter
from sklearn.linear_model import LogisticRegression
import scipy.stats
from sgd_classifier import BasicSGDClassifier
from tf_shallow_neural_classifier import TfShallowNeuralClassifier
import sst
import utils
/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
Feature representation is arguably the most important step in any machine learning task. As you experiment with the SST, you'll come to appreciate this fact, since your choice of feature function will have a far greater impact on the effectiveness of your models than any other choice you make.
We will define our feature functions as dict
s mapping feature names (which can be any object that can be a dict
key) to their values (which must be bool
, int
, or float
).
To prepare for optimization, we will use sklearn
's DictVectorizer class to turn these into matrices of features.
The dict
-based approach gives us a lot of flexibility and frees us from having to worry about the underlying feature matrix.
A typical baseline or default feature representation in NLP or NLU is built from unigrams. Here, those are the leaf nodes of the tree:
def unigrams_phi(tree):
"""The basis for a unigrams feature function.
Parameters
----------
tree : nltk.tree
The tree to represent.
Returns
-------
defaultdict
A map from strings to their counts in `tree`. (Counter maps a
list to a dict of counts of the elements in that list.)
"""
return Counter(tree.leaves())
In the docstring for sst.sentiment_treebank_reader
, I pointed out that the labels on the subtrees can be used in a way that feels like cheating. Here's the most dramatic instance of this: root_daughter_scores_phi
uses just the labels on the daughters of the root to predict the root (label). This will result in performance well north of 90% F1, but that's hardly worth reporting. (Interestingly, using the labels on the leaf nodes is much less powerful.) Anyway, don't use this function!
def root_daughter_scores_phi(tree):
"""The best way we've found to cheat without literally using the
labels as part of the feature representations.
Don't use this for any real experiments!
"""
return Counter([child.label() for child in tree])
It's generally good design to write lots of atomic feature functions and then bring them together into a single function when running experiments. This will lead to reusable parts that you can assess independently and in sub-groups as part of development.
The second major phase for our analysis is a kind of set-up phase. Ingredients:
train_reader
unigrams_phi
binary_class_func
The convenience function sst.build_dataset
uses these to build a dataset for training and assessing a model. See its documentation for details on how it works. Much of this is about taking advantage of sklearn
's many functions for model building.
train_dataset = sst.build_dataset(
reader=sst.train_reader,
phi=unigrams_phi,
class_func=sst.binary_class_func,
vectorizer=None)
print("Train dataset with unigram features has {:,} examples and {:,} features".format(
*train_dataset['X'].shape))
Train dataset with unigram features has 6,920 examples and 16,282 features
Notice that sst.build_dataset
has an optional argument vectorizer
:
If it is None
, then a new vectorizer is used and returned as dataset['vectorizer']
. This is the usual scenario when training.
For evaluation, one wants to represent examples exactly as they were represented during training. To ensure that this happens, pass the training vectorizer
to this function:
dev_dataset = sst.build_dataset(
reader=sst.dev_reader,
phi=unigrams_phi,
class_func=sst.binary_class_func,
vectorizer=train_dataset['vectorizer'])
print("Dev dataset with unigram features has {:,} examples and {:,} features".format(
*dev_dataset['X'].shape))
Dev dataset with unigram features has 872 examples and 16,282 features
We're now in a position to begin training supervised models!
For the most part, in this course, we will not study the theoretical aspects of machine learning optimization, concentrating instead on how to optimize systems effectively in practice. That is, this isn't a theory course, but rather an experimental, project-oriented one.
Nonetheless, we do want to avoid treating our optimizers as black boxes that work their magic and give us some assessment figures for whatever we feed into them. That seems irresponsible from a scientific and engineering perspective, and it also sends the false signal that the optimization process is inherently mysterious. So we do want to take a minute to demystify it with some simple code.
The module sgd_classifier
contains a complete optimization framework, as BasicSGDClassifier
. Well, it's complete in the sense that it achieves our full task of supervised learning. It's incomplete in the sense that it is very basic. You probably wouldn't want to use it in experiments. Rather, we're going to encourage you to rely on sklearn
for your experiments (see below). Still, this is a good basic picture of what's happening under the hood.
So what is BasicSGDClassifier
doing? The heart of it is the fit
function (reflecting the usual sklearn
naming system). This method implements a hinge-loss stochastic sub-gradient descent optimization. Intuitively, it works as follows:
0
.This process repeats for a user-specified number of iterations (default 10
below), and the weight movement is tempered by a learning-rate parameter eta
(default 0.1
). The output is a set of weights that can be used to make predictions about new (properly featurized) examples.
In more technical terms, the objective function is
$$ \min_{\mathbf{w} \in \mathbb{R}^{d}} \sum_{(x,y)\in\mathcal{D}} \max_{y'\in\mathbf{Y}} \left[\mathbf{Score}_{\textbf{w}, \phi}(x,y') + \mathbf{cost}(y,y')\right] - \mathbf{Score}_{\textbf{w}, \phi}(x,y) $$where $\mathbf{w}$ is the set of weights to be learned, $\mathcal{D}$ is the training set of example–label pairs, $\mathbf{Y}$ is the set of labels, $\mathbf{cost}(y,y') = 0$ if $y=y'$, else $1$, and $\mathbf{Score}_{\textbf{w}, \phi}(x,y')$ is the inner product of the weights $\mathbf{w}$ and the example as featurized according to $\phi$.
The fit
method is then calculating the sub-gradient of this objective. In succinct pseudo-code:
This is very intuitive – push the weights in the direction of the positive cases. It doesn't require any probability theory. And such loss functions have proven highly effective in many settings. For a more powerful version of this classifier, see sklearn.linear_model.SGDClassifier. With loss='hinge'
, it should behave much like BasicSGDClassifier
(but faster!).
For the sake of our experimental framework, a simple wrapper for SGDClassifier
:
def fit_basic_sgd_classifier(X, y):
"""Wrapper for `BasicSGDClassifier`.
Parameters
----------
X : 2d np.array
The matrix of features, one example per row.
y : list
The list of labels for rows in `X`.
Returns
-------
BasicSGDClassifier
A trained `BasicSGDClassifier` instance.
"""
mod = BasicSGDClassifier()
mod.fit(X, y)
return mod
As I said above, we likely don't want to rely on BasicSGDClassifier
(though it does a good job with SST!). Instead, we want to rely on sklearn
. Here's a simple wrapper for sklearn.linear.model.LogisticRegression using our
build_dataset
paradigm.
def fit_maxent_classifier(X, y):
"""Wrapper for `sklearn.linear.model.LogisticRegression`. This is also
called a Maximum Entropy (MaxEnt) Classifier, which is more fitting
for the multiclass case.
Parameters
----------
X : 2d np.array
The matrix of features, one example per row.
y : list
The list of labels for rows in `X`.
Returns
-------
sklearn.linear.model.LogisticRegression
A trained `LogisticRegression` instance.
"""
mod = LogisticRegression(fit_intercept=True)
mod.fit(X, y)
return mod
The sklearn.linear_model package has a number of other classifier models that could be effective for SST.
The sklearn.ensemble package contains powerful classifiers as well. The theme that runs through all of them is that one can get better results by averaging the predictions of a bunch of more basic classifiers. A RandomForestClassifier will bring some of the power of deep learning models without the optimization challenges (though see this blog post on some limitations of the current sklearn implementation).
The sklearn.svm contains variations on Support Vector Machines (SVMs).
We now have all the pieces needed to run experiments. And we're going to want to run a lot of experiments, trying out different feature functions, taking different perspectives on the data and labels, and using different models.
To make that process efficient and regimented, sst
contains a function experiment
. All it does is pull together these pieces and use them for training and assessment. It's complicated, but the flexibility will turn out to be an asset.
_ = sst.experiment(
unigrams_phi,
fit_maxent_classifier,
train_reader=sst.train_reader,
assess_reader=None,
train_size=0.7,
class_func=sst.ternary_class_func,
score_func=utils.safe_macro_f1,
verbose=True)
Accuracy: 0.617 precision recall f1-score support negative 0.634 0.696 0.664 997 neutral 0.239 0.106 0.147 483 positive 0.666 0.772 0.715 1084 avg / total 0.573 0.617 0.588 2564
A few notes on this function call:
Since assess_reader=None
, the function reports performance on a random train–test split. Give sst.dev_reader
as the argument to assess against the dev
set.
unigrams_phi
is the function we defined above. By changing/expanding this function, you can start to improve on the above baseline, perhaps periodically seeing how you do on the dev set.
fit_maxent_classifier
is the wrapper we defined above. To assess new models, simply define more functions like this one. Such functions just need to consume an (X, y)
constituting a dataset and return a model.
_ = sst.experiment(
unigrams_phi,
fit_maxent_classifier,
class_func=sst.ternary_class_func,
assess_reader=sst.dev_reader)
Accuracy: 0.602 precision recall f1-score support negative 0.628 0.689 0.657 428 neutral 0.343 0.153 0.211 229 positive 0.629 0.750 0.684 444 avg / total 0.569 0.602 0.575 1101
_ = sst.experiment(
unigrams_phi,
fit_basic_sgd_classifier,
class_func=sst.ternary_class_func,
assess_reader=sst.dev_reader)
Accuracy: 0.572 precision recall f1-score support negative 0.624 0.589 0.606 428 neutral 0.293 0.170 0.215 229 positive 0.601 0.764 0.673 444 avg / total 0.546 0.572 0.552 1101
Where does our default set-up sit with regard to published baselines for the binary problem? (Compare Socher et al., Table 1.)
_ = sst.experiment(
unigrams_phi,
fit_maxent_classifier,
class_func=sst.binary_class_func,
assess_reader=sst.dev_reader)
Accuracy: 0.772 precision recall f1-score support negative 0.783 0.741 0.761 428 positive 0.762 0.802 0.782 444 avg / total 0.772 0.772 0.772 872
While we're at it, we might as well see whether adding a hidden layer to our maxent classifier yields any benefits. Whereas LogisticRegression
is, at its core, computing
the shallow neural network inserts a hidden layer with a non-linear activation applied to it:
$$\begin{align*} h &= \tanh(xW_{xh} + b_{h}) \\ y &= \textbf{softmax}(hW_{hy} + b_{y}) \end{align*}$$def fit_nn_classifier(X, y):
mod = TfShallowNeuralClassifier(hidden_dim=50, max_iter=100)
mod.fit(X, y)
return mod
_ = sst.experiment(
unigrams_phi,
fit_nn_classifier,
class_func=sst.binary_class_func)
Iteration 100: loss: 3.178435742855072
Accuracy: 0.645 precision recall f1-score support negative 0.623 0.595 0.609 964 positive 0.662 0.687 0.674 1112 avg / total 0.644 0.645 0.644 2076
It looks like, with enough iterations (and perhaps some fiddling with the activation function and hidden dimensionality), this classifier would meet or exceed the baseline set up by LogisticRegression
.
The training process learns parameters — the weights. There are typically lots of other parameters that need to be set. For instance, our BasicSGDClassifier
has a learning rate parameter and a training iteration parameter. These are called hyperparameters. The more powerful sklearn
classifiers often have many more such hyperparameters. These are outside of the explicitly stated objective, hence the "hyper" part.
So far, we have just set the hyperparameters by hand. However, their optimal values can vary widely between datasets, and choices here can dramatically impact performance, so we would like to set them as part of the overall experimental framework.
Luckily, sklearn
provides a lot of functionality for setting hyperparameters via cross-validation. The function sst.fit_classifier_with_crossvalidation
implements a basic framework for taking advantage of these options.
This method has the same basic shape as fit_maxent_classifier
above: it takes a dataset as input and returns a trained model. However, to find its favored model, it explores a space of hyperparameters supplied by the user, seeking the optimal combination of settings.
Note: this kind of search seems not to have a large impact for SST as we're using it. However, it can matter a lot for other data sets, and it's also an important step to take when trying to publish, since reviewers are likely to want to check that your comparisons aren't based in part on opportunistic or ill-considered choices for the hyperparameters.
Here's a fairly full-featured use of the above for the LogisticRegression
model family:
def fit_maxent_with_crossvalidation(X, y):
"""A MaxEnt model of dataset with hyperparameter
cross-validation. Some notes:
* 'fit_intercept': whether to include the class bias feature.
* 'C': weight for the regularization term (smaller is more regularized).
* 'penalty': type of regularization -- roughly, 'l1' ecourages small
sparse models, and 'l2' encourages the weights to conform to a
gaussian prior distribution.
Other arguments can be cross-validated; see
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Parameters
----------
X : 2d np.array
The matrix of features, one example per row.
y : list
The list of labels for rows in `X`.
Returns
-------
sklearn.linear_model.LogisticRegression
A trained model instance, the best model found.
"""
basemod = LogisticRegression()
cv = 5
param_grid = {'fit_intercept': [True, False],
'C': [0.4, 0.6, 0.8, 1.0, 2.0, 3.0],
'penalty': ['l1','l2']}
return sst.fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)
_ = sst.experiment(
unigrams_phi,
fit_maxent_with_crossvalidation,
class_func=sst.binary_class_func)
Best params {'C': 2.0, 'fit_intercept': True, 'penalty': 'l2'} Best score: 0.755 Accuracy: 0.772 precision recall f1-score support negative 0.762 0.742 0.752 966 positive 0.781 0.798 0.789 1110 avg / total 0.772 0.772 0.772 2076
The models written for this course are also compatible with this framework. They "duck type" the sklearn
models by having methods fit
, predict
, get_params
, and set_params
, and an attribute params
.
def fit_basic_sgd_classifier_with_crossvalidation(X, y):
basemod = BasicSGDClassifier()
cv = 5
param_grid = {'eta': [0.01, 0.1, 1.0], 'max_iter': [10]}
return sst.fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)
_ = sst.experiment(
unigrams_phi,
fit_basic_sgd_classifier_with_crossvalidation,
class_func=sst.binary_class_func)
Best params {'eta': 0.01, 'max_iter': 10} Best score: 0.743 Accuracy: 0.752 precision recall f1-score support negative 0.717 0.787 0.750 980 positive 0.791 0.722 0.755 1096 avg / total 0.756 0.752 0.753 2076
Suppose two classifiers differ according to an effectiveness measure like F1 or accuracy. Are they meaningfully different?
For very large datasets, the answer might be clear: if performance is very stable across different train/assess splits and the difference in terms of correct predictions has practical import, then you can clearly say yes.
With smaller datasets, or models whose performance is closer together, it can be harder to determine whether the two models are different. We can address this question in a basic way with repeated runs and basic null-hypothesis testing on the resulting score vectors.
The function sst.compare_models
is designed for such testing. The default set-up uses the non-parametric Wilcoxon signed-rank test to make the comparisons, which is relatively conservative and recommended by Demšar 2006 for cases where one can afford to do multiple assessments.
Here's an example showing the default parameters values and comparing LogisticRegression
and BasicSGDClassifier
:
_ = sst.compare_models(
unigrams_phi,
fit_maxent_classifier,
stats_test=scipy.stats.wilcoxon,
trials=10,
phi2=None, # Defaults to same as first required argument.
train_func2=fit_basic_sgd_classifier, # Defaults to same as second required argument.
reader=sst.train_reader,
train_size=0.7,
class_func=sst.ternary_class_func,
score_func=utils.safe_macro_f1)
Model 1 mean: 0.515 Model 2 mean: 0.505 p = 0.074
In general, one wants to compare two feature functions against the same model, or one wants to compare two models with the same feature function used for both. If both are changed at the same time, then it will be hard to figure out what is causing any differences you see.
In order to get a feel for the codebase and prepare for the in-class bake-off, we suggest some rounds of the basic development cycle for models based in hand-built feature functions:
sst.experiment
to evaluate your new feature function on the binary and ternary versons of SST, with at least fit_basic_sgd_classifier
and fit_maxent_classifier
.unigrams_phi
using compare_models
.dev
set.Error analysis is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.
To bring error analysis into your development cycle, you could improve sst.experiment
by adding a keyword argument view_errors
with default value 0
. Where the value is n
, the function prints out a random selection of n
errors: the underlying tree, the correct label, and the predicted label.