In [1]:

```
%pylab inline
from sklearn import datasets
from numpy import arange, array, dot, linspace, meshgrid, \
ones, sign, vectorize, vstack
```

This lecture will briefly motivate *discriminative* classifiers, and then discuss in great detail
the family of *linear classifiers*, conceptually simple yet **Unreasonably Effective** tools for natural language processing. Then, time permitting, we will review a number of extensions to the linear classifier framework, including:

*large margin*classification- strategies for
*multiclass classification* - strategies for
*structured classification* - linear functions for
*ranking*competing hypotheses - multiplicative linear classification using the
*winnow* - non-linear classification using the
*kernel trick*

In natural language processing, we are often faced with the problem of making decisions in the face of competing cues of varying or unknown reliability. The traditional approach to this problem proceeds as follows:

*Probability density estimation*: estimate the conditional probability $p(y~|~\phi)$ where $y$ is the decision and $\phi$ are the observations*Thresholding*: pick the outcome $\hat{y}$ which maximizes the conditional probability given observation $\phi$:$$\hat{y}~=~\underset{y~\in~Y}{\operatorname{argmax}}~p(y~|~\phi)~.$$

This approach is used in many machine learning approaches. In *generative models* such as naïve Bayes, the joint probability $p(\phi,~y)$ is first estimated, and then converted to a conditional probability using Bayes' rule

Similarly, in *discriminative models* such as multinomial logistic regression (aka *maxent*), the conditional probability $p(y~|~\phi)$ is estimated directly. What both types of models assume is that the data is generated by a known stochastic process.

In the 1980s and 1990s, alternative approaches were proposed under the banner of *statistical learning theory*. In this framework, the goal is not to estimate a probability density function (per se) but rather to learn a *classification function* $f(\phi)$ which predicts $y$. This function is determined using a *probably approximately correct* (PAC) learning algorithm (Valiant 1984); that is, an algorithm where, if enough data is provided, it will (*probably*, i.e., with very high probability) predict $y$ with minimal error (i.e., it will be *approximately correct*). Learning this classification function is often easier than probability density estimation, and can be done even when the data-generating function is unknown.

One of the simplest types of classification functions is a *binary linear classifier*, which makes a binary classification decision based on the value of a linear combination (i.e., a weighted sum) of feature values, as follows.

- A
*label*$y$ (i.e., the thing to be predicted) is a binary value in $\{-1, +1\}^n$ - A
*feature vector*$\phi$ is a real-valued vector - An
*observation*is a $(y, \phi)$ tuple - The classifier itself is defined a real-valued
*weight vector*$w$

Given a classifier with weight vector $w$, the *predicted label* for an observation is defined as

where $w~\cdot~\phi$ is the dot product (here, a weighted sum) of the weights and features.

Without loss of generality, we assume that there is also a "bias" feature present in all feature vectors $\phi$ and that there is a corresponding weight in $w$ (call it $w_b$), which acts much like an *intercept* in linear regression.

Geometrically speaking, the *decision boundary* lies orthogonal to $w$.

In [2]:

```
# the famous iris data set (Fisher 1936), but throwing
# out the third species, for sake of simplicity
iris = datasets.load_iris()
Y = iris.target
mask = iris.target != 2
Y = iris.target[mask]
Phi = iris.data[mask, :2]
## a decision boundary, fit by hand
w = array([-3., 1.2, -1.1])
## plot decision boundary
# create prediction mesh
h = .02
x_min = Phi[:, 0].min() - 1
x_max = Phi[:, 0].max() + 1
y_min = Phi[:, 1].min() - 1
y_max = Phi[:, 1].max() + 1
(xx, yy) = meshgrid(arange(x_min, x_max, h),
arange(y_min, y_max, h))
Phi_mesh = vstack((ones(xx.size),
xx.ravel(), yy.ravel()))
Z = sign(dot(w, Phi_mesh)).reshape(xx.shape)
# plot mesh
contourf(xx, yy, Z, alpha=0.1)
## plot points
scatter(Phi[:, 0], Phi[:, 1], c=Y, cmap=cm.Paired)
xlabel("Sepal length")
ylabel("Sepal width")
xlim(xx.min(), xx.max())
ylim(yy.min(), yy.max())
## plot $w$
# a phi on the decision boundary
index = tuple(vstack(nonzero(Z == 0.))[:, -2])
arrow(xx[index], yy[index], w[1], w[2],
head_width=.1, color="k")
annotate("W", xy=(xx[index] + w[1] + .1, yy[index] + w[2] + .1))
```

Out[2]:

But how do we learn the weight vector? Perhaps the simplest method is to initialize all weights $w_i \in w$ to zero, and to then apply the *perceptron update rule* (Rosenblatt 1958), as follows. We iterate over observations, classifying each example.

The *loss function* for the perceptron update rule is defined as

That is, loss is 0 if the label and prediction match, and 1 otherwise. (For this reason, this loss function is sometimes called *0-1 loss*.) The *update* $\tau$ is given by

In prose, then, the update is defined as

- $\tau = +1$ if $y = +1$ but $\hat{y} = -1$
- $\tau = -1$ if $y = -1$ but $\hat{y} = +1$
- $\tau = 0$ otherwise

Of course, in the last case (i.e., if the observation is correctly classified) we do not bother to compute $\tau$. To apply the update, define the weight vector $w_t$ at time $t$ according to

$$w_t = w_{t - 1} + \alpha~\tau~\phi$$where $\alpha$ is *learning rate*, a positive real number. Without loss of generality, we can assume that $\alpha = 1$ and selectively omit it henceforth. In practice, it may be set to a smaller value or gradually tapered off as part of a *learning schedule*.

This update rule is a special case of *stochastic gradient descent*. The error function is

The inner loop here is simply $y_i~\hat{y_i}$—that is, $1$ when data is correctly classified and $0$ otherwise—and the entire expression is simply $-1$ times the number of observations that are correctly classified. (Thanks to X. Song for helpful discussion on this matter.) In traditional gradient descent, we compute the sum of the gradients for all observation and then take a step in the direction of the negative gradient, repeating until convergence. In stochastic gradient descent, we take a step in direction of the the negative gradient after *every observation*. See the appendix below for more.

The perceptron update rule guarantees convergence in finite time when the data is *linearly separable* (i.e., when a perfect decision boundary exists), and good approximation bounds when the data is not (see Freund & Schapire 1999 for simple proofs).

In NLP, we generally adopt several simplifying assumptions. First, we assume that all features $\phi_i$ are binary-valued (i.e., $\{0, 1\}$). Secondly, we assume that the vast majority of the features for any observation are zero (i.e., feature values are sparse). We thus conceive of $\phi$ as a list of values which are "activated" for this observation. Third, we assume that most weights have a true zero value.

With these assumptions in place we can rewrite the decision rule and update rules as

$$\begin{align} \hat{y} &= \begin{cases} +1 & \text{if }\displaystyle\sum_{i~\in~\phi} w_i > 0 \\ -1 & \text{otherwise}\end{cases} \\ w_{i,t} &= w_{i, t - 1} + \alpha~\tau \end{align}$$where $i~\in~\phi$ means that $\phi_i$ is "activated" for this observation, and $w_{i,t}$ is the $i$-th weight at time $t$. Now, let's put it all together.

In [3]:

```
# an abstract base class
from random import Random
from collections import defaultdict
class SGDLinearBinaryClassifier(object):
"""
Abstract base class for stochastic gradient descent-based
linear classification on sparse, hashable binary features
"""
def __init__(self, seed=None, w_constructor=int):
self.random = Random(seed)
self.weights = defaultdict(w_constructor)
def score(self, phi):
return sum(self.weights[phi_i] for phi_i in phi)
# NB: prediction is now boolean rather than {-1, +1}
def predict(self, phi):
return self.score(phi) > 0
# to be explained
def fit(self, Y, Phi, epochs):
# we make `epochs` passes through the data, shuffling the
# order of presentation each time
data = list(zip(Y, Phi)) # which is a copy
for _ in xrange(epochs):
self.random.shuffle(data)
for (y, phi) in data:
self.fit_one(y, phi)
def fit_one(self, y, phi):
raise NotImplementedError
```

In [4]:

```
class BinaryPerceptronClassifier(SGDLinearBinaryClassifier):
"""
Binary linear classifier using the perceptron update rule
and stochastic gradient descent
"""
def update(self, phi, tau, alpha=1):
"""
Generic update function, where `tau` is the penalty to be
applied and `alpha` is the learning rate
"""
for phi_i in phi:
self.weights[phi_i] += alpha * tau
def fit_one(self, y, phi):
yhat = self.predict(phi)
if yhat and not y: # false positive
self.update(phi, -1)
elif y and not yhat: # false negative
self.update(phi, +1)
# else: loss, and tau, are 0
# sample usage:
#
# classifier = BinaryPerceptronClassifier()
# classifier.fit(Y, Phi, epochs=20)
# yhat = classifier.predict(phi)
```

In what follows, we describe widely-used variations on perceptron learning. For simplicity, we will use the "dense" notation rather than the "sparse" notation introduced immediately above.

One weakness of the "vanilla" perceptron (described above) is that it lacks *stability*; a very last example may greatly alter the weight vector, resulting in poor generalizability. One strategy to address this is to use the *pocket* trick (Gallant 1990), i.e., store a copy of the best $w$ so far ("in the pocket"), and use the pocket weights for inference once training is complete. Freund & Schapire (1999) propose another method, which they refer to as *voting*. Rather than modifying the weight vector $w$ in place, they keep a copy of every weight vector $w_0, w_1,...w_t$. Once training is complete, each weight vector is allowed to "vote" on the prediction, as in

Intuitively, this results in greater stability. However, it has a much higher memory complexity than the vanilla perceptron—a naïve implementation has much greater memory requirements—as well as greater time complexity at inference time. Freund & Schapire therefore suggest an alternative to voting, naming *averaging* of the weights at inference time, as in

where $\bar{w}$ represents the averaged weight vector. The averaged perceptron has the same space and time complexity as the vanilla perceptron, but Freund & Schapire find that averaging performs just as well as voting. As a result, averaging is considered a "best practice" for most applications of linear classification.

In addition to the additional stability it imparts, averaging can be thought of as the "poor man's $L_2$ regularization". The initial state of the weight vector is all zeros, and the early weight vectors act to reduce the magnitude of the final weights.

There are two ways to implement averaging. First, the classifier can preserve two weight vectors: $w_t$, the current weights—and $\sum_i^t w$, the itemwise sum of all weights so far. However, the latter term may overflow when when using fixed sized integers to represent weights. One alternative is to employ an stable online averaging algorithm (Welford 1982):

$$\bar{w}_t = \bar{w}_{t - 1} + \frac{w_t - \bar{w}_{t - 1}}{t}$$*Margin* is a measure of the degree to which a classification is correct. For a binary linear classifier, the margin for an observation $(y, \phi)$ is simply the "score", that is

What kind of margin does the perceptron update rule enforce? Since an update applies anytime misclassification occurs, and since misclassificiaton occurs anytime the margin is negative, the perceptron update rule enforces a *positive margin*, i.e., any margin greater than zero.

Machine learning theorists have argued that, all else held equal, a classifier exhibits better stability when it is enforces a *large margin*, even at the cost of misclassification. We accomplish this by specifying a new loss function. For instance, *hinge loss* enforces a *unit margin*, triggering an update any time the margin is less than one, that is

*Passive-aggressive classifiers* (Crammer et al. 2006) and linear-kernel *support vector machines* are common examples of *large margin classifiers*, which provide for a larger margin than the positive margin produced by the perceptron update rule.

Unlike many classifiers, the perceptron generalizes naturally to multiclass (i.e., more than two label) classification. In the binary formulation above, decisions were made based on the sign of the score $s = w~\cdot~\phi$. Imagine instead that observations were scored using two separate weight vectors,

$$\begin{align} s &= w~\cdot~\phi \\ s'&= w'~\cdot~\phi \end{align}$$where $w$ and $w'$ are the weight vectors for $y = +1$ and $y = -1$, respectively. In this formulation, the decision rule is

$$\hat{y} = \begin{cases} +1 & \text{if } s > s' \\ -1 & \text{otherwise}\end{cases}$$and the update rule is modified to penalize only the "wrong" weight vector and reward the "right" vector. As it turns out, this alternative formulation is, in the binary case, exactly identical to our earlier definitions. The only reason we don't need to keep track of $w'$ in the binary case is that it's implicitly defined: it is simply the additive inverse of $w$.

This alternative formulation is useful, however, when attempting to classify more than two labels. Let $w(l)$ be the weight vector associated with label $l$, now conceived of as a nominal attribute rather than as the earlier $\{-1, +1\}$. Then, we can generalize the decision rule as

$$\hat{y} = \underset{l~\in~L}{\operatorname{argmax}} w(l)~\cdot~\phi~.$$The simplest update rule for the multiclass perceptron (the *basic update rule*) rewards the true label $y$ and penalizes the incorrect hypothesis $\hat{y}$:

However, this penalizes only the *top-ranked* incorrect hypothesis though there may be many incorrect hypotheses ranked above the true hypothesis. Crammer & Singer (2003) propose two alternative updating strategies for multiclass problems. The first is known as *uniform update*. In this strategy, the "penalty phase" (second half) of the above update is modified so that for all false hypotheses $\hat{y}$ which are ranked higher than the true hypothesis,

where $E$ is the number of false hypotheses ranked above the true hypothesis $y$. A second alternative is *proportional update*, in which each false hypotheses ranked above the true hypothesis is penalized in proportion to the degree to which it is wrongly ranked. As pointed out by Crammer & Singer, both of these alternative strategies are naturally adapted for case where there is more than one true hypothesis as well, by scaling the update during the "reward phase" as well.

In addition to the "true" multiclass perceptron decision rule described above, there are two decision rules applicable to multiclass problems.

In the *one-vs.-rest* strategy, a $|L|$-multiclass problem is reduced to $|L|$ binary classifiers. Each of these binary classifiers regards one label $l~\in~L$ as a positive example, and all other labels as negative examples; let $w(l, \neg~l)$ be the weight vector for this binary classifier. The decision rule selects the hypothesized label $l$ according to the "most confident" binary classifier, as follows:

In the *one-vs.-one* strategy, a $|L|$-multiclass classification is performed with $|L| (|L| - 1)/2$ binary classifiers—one for each pair of labels— where $|L|$ is the number of unique labels. The decision rule selects the hypothesized label which receives the most "yes" ($+1$) votes:

Both of these are poorly-understood heuristics, however, and were developed primarily for use with classifiers that do not naturally support multiclass classification, so they are less commonly used with perceptron-style update rules.

Linear classifiers can also be used as the "backend" for structured classification tasks. Generally speaking, structured linear classifiers offer the same convergence guarantees as their unstructured counterparts **as long as the search is exact**.

Linear classifiers are often used as scoring functions for hidden Markov models (HMMs) that arise in tagging tasks such as POS tagging or chunking (e.g., Collins 2002). These models classify the $t$-th example using a feature vector consisting both of attributes of $x_t$ (as well as $x_{t - 1}$, $x_{t}$, etc.), but also hypotheses about the label $y_{t - 1}$. Let us call the latter set of features we will call the *transition features*. In the simplest—*greedy*—formulation, transition features at time $t$ are generated using the current best hypotheses for the labels for prior obsevations $\hat{y}_{t - 1}$, $\hat{y}_{t - 2}$, and so on. Exact search is possible using the Viterbi algorithm, with the lattice populated by linear classifier scores rather than emission and transmission probabilities. You will implement a HMM linear classifier in MP4.

In some cases, the labels of the structure to be classified partially depend on something more elaborate than simply the previous few labels as above. In this case, one option is to perform a *beam search* using the perceptron to score each state in the search tree. A common use of this is for shift-reduce dependency parsing (e.g., Zhang & Clark 2011). Here, the classifier is used to decide whether to perform a shift or reduce operation. The following is a sketch of such an implementation. A *state* is a (possibly incomplete) dependency graph, with an associated score generated by the perceptron. Each state is also associated with a function which generates all *successor states* (which are defined by adding another shift or reduce operation to the current state's dependency graph). We initialize the search by adding an empty dependency graph to the *beam*. We then generate all successor states of all states in the beam, and use the linear classifier to *rank* all these successor states. The beam is then redefined to contain the top $k$ states, or all states with some score above a threshold $\theta$. This process is repeated until we reach a successor state which meets some some pre-specified halting criterion; in the case of dependency parsing, this would be a complete dependency graph (i.e., one which has consumed all the input symbols).

See Daniel Connelly's notes about tree-style searches for implementational hints.

Above, we described the use of linear classifiers to rank incomplete hypotheses in structured classification. Ranking based on linear classifiers is often used to perform *global scoring* on a list of hypotheses produced by a local generative model. This can be done on ASR lattices, parse forests, or hypothesized machine translations. The simplest form of this technique consists of a linear model which is used to rank hypotheses; the highest-ranked hypothesis is then selected. This from of *ranking* is what Shen & Joshi (2005) refer to as *1-splitting* since the objective, informally speaking, is to separate the true hypothesis from all others. A more elaborate technique is *re-ranking*, in which a global linear model is used to modify the ranking provided by the local generative model.

The *winnow* (Littlestone 1988) is an **multiplicative** (rather than additive) variant of the perceptron update rule is used to learn linear classifiers. In the binary case, all weights are non-negative values initialized to 1, and labels are boolean variables $\{0, 1\}^n$. The decision function is given by

where $\theta$ is a positive real number called the *threshold*. As with the perceptron update rule, updates are performed only when an observation is misclassified, but the update involves multiplication or division. Given a *learning rate* $\alpha$, where $\alpha > 1$:

- if $y = 1$ but $\hat{y} = 0$, all weights for that example are multiplied by $\alpha$
- if $y = 0$ but $\hat{y} = 1$, all weights for that example are divided by $\alpha$

This can be expresed more elegantly in log-space: the update is given by

$$\log w_i,t = \log w_{i,t - 1} + \operatorname{sgn}(y - \hat{y}) \log \alpha~.$$General wisdom holds that the winnow is most effective when there is a very large number of features, most of which are irrelevant for classification.

One limitation of linear classifiers is that they are only guaranteed convergence in the case that the data is *linearly separable*, that is, only when there exists a hyperplane which separates the positive and negative examples. Minsky & Papert (1969) point out that many simple patterns are not linearly separable, their famous example being the exclusive-or (XOR) function. The *kernel trick* is just one of the many responses to this critique. To visualize the kernel trick, imagine that the $n$-dimensional dataset we observe (where each dimension corresponds to a feature) is merely a low-rank approximation—a "shadow on the cave"—of the real, Platonic dataset, and crucially, the data is linearly separable in the real Platonic-space. So, all we need to do is to transform the observed data back to Platonic-space. We do this using a *kernel function* (call it $k$) applied to each $\phi$. The most common kernel in NLP is the *polynomial kernel*, defined by

where $\gamma$, $c$, and $d$ are hyperparameters. The *degree parameter* $d$ controls the degree of the approximation: when $d = 2$, the kernel augments $\phi$ with a new feature for $\phi_i$, $\phi_j$. As a result, the decision boundary will be quadratic rather than strictly linear. For visual examples of various kernels, see the `scikit-learn`

docs. Many NLP researchers do not use the kernel trick directly; instead, they manually augment the feature set by creating composite features.

Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In *EMNLP*, 1-8.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. *Machine Learning* 20(3): 273-297.

Koby Crammer and Yoram Singer. 2003. A family of additive online algorithms for category ranking. *Journal of Machine Learning Research* 3: 1025-1058.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shvartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. *Journal of Machine Learning Research* 7: 551-585.

Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. *Machine Learning* 37(3): 277-296.

Stephen I. Gallant. 1990. Perceptron-based learning algorithms. *IEEE Transactions on Neural Networks* 1(2): 179-191.

Nick Littlestone. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. *Machine Learning* 2(4): 285-318.

Marvin L. Minsky and Seymour A. Papert. 1969. *Perceptrons*. Cambridge: MIT Press.

Frank Rosenblatt. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. *Psychological Review* 65(6): 386-408.

Libin Shen and Aravind K. Joshi. 2005. Ranking and reranking with perceptron. *Machine Learning* 60(1): 73-96.

Vladimir N. Vapnik. 1998. *Statistical learning theory*. New York: Wiley-Interscience.

Leslie G. Valiant. 1984. A theory of the learnable. *Communications of the ACM* 27(11): 1134-1142.

Yue Zhang and Steven Clark. 2011. Syntactic processing using the generalized perceptron and beam search. *Computational Linguistics* 37(1): 105-151.

Consider the quartic function $f(x) = 2x^4 - 2x^3 + 3x^2 - x + 3$, graphed below.

In [5]:

```
@vectorize
def f(x):
return 2 * x * x * x * x - 2 * x * x * x + 3 * x * x - x + 3
x = linspace(-2, 2, 100)
y = f(x)
plot(x, y)
```

Out[5]:

It looks like there is a local minimum of $f(x)$ somewhere around $x = .2$. Gradient descent provides a way to compute this local minimum with arbitrary precision. The basic insight is that at any point $x$, a differentiable, continuous function $f(x)$ decreases faster (and is thus closer to the local minimum) towards the negative gradient $-f'(x)$. It follows that if $x' = x - \alpha f'(x)$ implies $f(x') \le f(x)$ for any sufficiently small value of $\alpha \in (0, 1)$. This provides a simple descent algorithm, outlined below. First, we compute the first derivative with respect to $x$; the reader can confirm that is $f'(x) = 8x^3 - 6x^2 + 6x - 1$.

In [6]:

```
@vectorize
def fprime(x):
return 8 * x * x * x - 6 * x * x + 6 * x - 1
```

Then, starting at 0 (a value chosen simply because it appears to be close to the local minimum), we take steps towards the negative gradient scaled by the learning rate $\alpha$, terminating once the last step size is less than some small positive value $\epsilon$.

In [7]:

```
ALPHA = .01 # learning rate
EPSILON = .0001 # detect convergence
def gd(fprime, x, alpha=ALPHA, epsilon=EPSILON):
x_old = float("inf")
while abs(x - x_old) > epsilon:
x_old = x
x = x_old - alpha * fprime(x_old)
return x
print("Local minimum at x = {:.4f}".format(gd(fprime, 0)))
```