In [2]:

%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd .. 
import sys
sys.path.append("..")
import statnlpbook.util as util
import statnlpbook.sequence as seq
import pandas as pd
import matplotlib
import warnings
warnings.filterwarnings('ignore')
matplotlib.rcParams['figure.figsize'] = (8.0, 5.0)

$$ \newcommand{\Xs}{\mathcal{X}} \newcommand{\Ys}{\mathcal{Y}} \newcommand{\y}{\mathbf{y}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\aligns}{\mathbf{a}} \newcommand{\align}{a} \newcommand{\source}{\mathbf{s}} \newcommand{\target}{\mathbf{t}} \newcommand{\ssource}{s} \newcommand{\starget}{t} \newcommand{\repr}{\mathbf{f}} \newcommand{\repry}{\mathbf{g}} \newcommand{\x}{\mathbf{x}} \newcommand{\prob}{p} \newcommand{\bar}{\,|\,} \newcommand{\vocab}{V} \newcommand{\params}{\boldsymbol{\theta}} \newcommand{\param}{\theta} \DeclareMathOperator{\perplexity}{PP} \DeclareMathOperator{\argmax}{argmax} \DeclareMathOperator{\argmin}{argmin} \newcommand{\train}{\mathcal{D}} \newcommand{\counts}[2]{\#_{#1}(#2) } \newcommand{\length}[1]{\text{length}(#1) } \newcommand{\indi}{\mathbb{I}} $$

In [3]:

%%HTML
<style>
.rendered_html td {
    font-size: x-large;
    text-align: left; !important
}
.rendered_html th {
    font-size: x-large;
    text-align: left; !important
}
</style>

Sequence Labelling¶

assigns labels to each element in a sequence

Example: PoS tagging¶

assign each token in a sentence its Part-of-Speech tag

1	2	3	4	5	6	7
I	predict	I	won't	win	a	single	game
O	V	O	V	V	D	A	N

Example: Named Entity Recognition¶

label tokens as beginning (B), inside (I) our outside (O) a named entity

Barack	Obama	was	born	in
B-PER	I-PER	O	O	O	B-LOC

Sequence Labelling as Structured Prediction¶

Input Space $\Xs$: sequences of items to label
Output Space $\Ys$: sequences of output labels labels
Model: $s_{\params}(\x,\y)$
Prediction: $\argmax_\y s_{\params}(\x,\y)$

Conditional Models¶

model probability distributions over label sequences $\y$ conditioned on input sequences $\x$ $$ s_{\params}(\x,\y) = \prob_\params(\y|\x) $$

analog to conditional models of text classification chapter

Part-of-Speech Tagging as Sequence Labelling¶

Example data: tagging for tweets for the Tweebank dataset

In [4]:

def show_instance(x,y,begin=0,end=-1):
    return pd.DataFrame([x[begin:end],y[begin:end]]) 

In [5]:

train = seq.load_tweebank("../data/oct27.splits/oct27.train")
dev = seq.load_tweebank("../data/oct27.splits/oct27.dev")
test = seq.load_tweebank("../data/oct27.splits/oct27.test")
show_instance(*train[0],0,12)

Out[5]:

	0	1	2	3	4	5	6	7	8	9	10	11
0	I	predict	I	won't	win	a	single	game	I	bet	on	.
1	O	V	O	V	V	D	A	N	O	V	P	,

Tags (such as "O", "V" and "^") are described in the Tweebank annotation guideline

In [6]:

# count tags here?`xw
from collections import defaultdict
import pandas as pd
examples = {}
counts = defaultdict(int)
words = defaultdict(set)
for x,y in train:
    for i in range(0, len(x)):
        if y[i] not in examples:
            examples[y[i]] = [x[j] + "/" + y[j] if i == j else x[j] for j in range(max(i-2,0),min(i+2,len(x)-1))]
        counts[y[i]] += 1
        words[y[i]].add(x[i])
sorted_tags = sorted(counts.items(),key=lambda x:-x[1])
sorted_tags_with_examples = [(t,c,len(words[t])," ".join(examples[t])) for t,c in sorted_tags]

sorted_tags_table = pd.DataFrame(sorted_tags_with_examples, columns=['Tag','Count','Unique Words','Example'])

In [7]:

sorted_tags_table[:10] 

Out[7]:

	Tag	Count	Unique Words	Example
0	V	2219	873	I predict/V I
1	N	2003	1377	a single game/N I
2	,	1715	84	bet on ./, Got
3	P	1252	126	I bet on/P .
4	O	1063	97	I/O predict
5	^	890	741	. Got Cliff/^ Lee
6	D	869	68	won't win a/D single
7	A	755	449	win a single/A game
8	@	713	694	me RT @e_one/@ :
9	R	689	217	but I still/R hate

Local Models / Classifiers¶

A fully factorised or local model:

$$ p_\params(\y|\x) = \prod_{i=1}^n p_\params(y_i|\x,i) $$

labels are independent of each other
inference in this model is trivial
Compare to "Naive Translation Model"

Example¶

$$ \prob_\params(\text{"D A N"} \bar \text{"a single game"}) = \\\\ \prob_\params(\text{"D"}\bar \text{"a single game"},1) \prob_\params(\text{"A"} \bar \text{"a single game"},2) \ldots $$

Parametrisation¶

Log-linear classifier $p_\params(y\bar\x,i)$ to predict class for sentence $\x$ and position $i$

$$ p_\params(y\bar\x,i) = \frac{1}{Z_\x} \exp \langle \repr(\x,i),\params_y \rangle $$

What are good

Features?¶

In [8]:

show_instance(*train[0],0,12)

Out[8]:

	0	1	2	3	4	5	6	7	8	9	10	11
0	I	predict	I	won't	win	a	single	game	I	bet	on	.
1	O	V	O	V	V	D	A	N	O	V	P	,

Bias: $$ \repr_0(\x,i) = 1 $$

Word at token to tag: $$ \repr_w(\x,i) = \begin{cases}1 \text{ if }x_i=w \\\\ 0 \text{ else} \end{cases} $$

In [9]:

def feat_1(x,i):
    return {
        'bias':1.0,  
        'word:' + x[i]: 1.0,
    }
local_1 = seq.LocalSequenceLabeler(feat_1, train, class_weight='balanced')

We can assess the accuracy of this model on the development set.

In [10]:

seq.accuracy(dev, local_1.predict(dev))

Out[10]:

0.6964544889073191

How to Improve?¶

Look at confusion matrix

In [11]:

seq.plot_confusion_matrix(dev, local_1.predict(dev))

Shows:

strong diagonal (good predictions)
N receives a lot of wrong counts
@ complete failure

Let us start with @ ...

In [12]:

local_1.plot_lr_weights('@')

Features for specific users such as "word=@justinbieber" do not generalise well

How to address this?

In [13]:

def feat_2(x,i):
    return {
        **feat_1(x,i),
        'first_at:' + str(x[i][0:1] == '@'): 1.0,
    }
local_2 = seq.LocalSequenceLabeler(feat_2, train)
seq.accuracy(dev, local_2.predict(dev))

Out[13]:

0.7484967862326353

To confirm that these results actually from improved '@' prediction, let us look at the confusion matrix again

In [14]:

seq.plot_confusion_matrix(dev, local_2.predict(dev))

Solved!

In [15]:

local_2.plot_lr_weights('@')

Other errors?

In [16]:

seq.plot_confusion_matrix(dev, local_2.predict(dev))

Look for errors with high frequency:

distinguishing proper nouns (label '^')
- such as "McDonalds" or "Thursday"
from common nouns (label 'N')
- such as "wife" or "rain".

Micro View¶

How do these errors look like?

In [17]:

util.Carousel(local_2.errors(dev[10:20], 
                             filter_guess=lambda y: y=='N',
                             filter_gold=lambda y: y=='^'))

Out[17]:

Previous Next

Proper nouns tend to be capitalised!

In [18]:

def feat_3(x,i):
    return {
        **feat_2(x,i),
        'is_lower:' + str(x[i].islower()): 1.0,
#         'first_char:' + str(x[i][0:1]): 1.0
        'is_lower:' + str(x[i].islower()): 1.0
    }
local_3 = seq.LocalSequenceLabeler(feat_3, train)
seq.accuracy(dev, local_3.predict(dev))

Out[18]:

0.771511507360564

This improvement indeed comes from being able to identify proper nouns when they are capitalised:

In [19]:

util.Carousel(local_3.errors(dev[10:20], 
                             filter_guess=lambda y: y=='N',
                             filter_gold=lambda y: y=='^'))
# seq.find_contexts(train, lambda w: w == 'Senate')

Out[19]:

Previous Next

Find more problems:

In [20]:

seq.plot_confusion_matrix(dev, local_3.predict(dev))

High frequency error:

misclassifying verbs ('V') as common nouns ('N')

Inspect examples...

In [21]:

util.Carousel(local_3.errors(dev[:20], 
                             filter_guess=lambda y: y=='N',
                             filter_gold=lambda y: y=='V'))

Out[21]:

Previous Next

"laughing", "blowing" or "passed" are misclassified as common nouns
For $f_{\text{word},w}$ feature template weights are $0$

Suggests that word has not appeared (or not appeared as a verb) in the training set!

However, we can tell that these words may be verbs:

check suffixes such as "ing" or "ed".

Incorporate as features!

In [22]:

def feat_4(x,i):
    return {
        **feat_3(x,i),
        'last_3:' + "".join(x[i][-3:]): 1.0,
        'last_2:' + "".join(x[i][-2:]): 1.0,
    }
local_4 = seq.LocalSequenceLabeler(feat_4, train)
seq.accuracy(dev, local_4.predict(dev))

Out[22]:

0.7876840140991085

In [23]:

util.Carousel(local_4.errors(dev[:20], 
                             filter_guess=lambda y: y=='N',
                             filter_gold=lambda y: y=='V' ))

Out[23]:

Previous Next

Markov Models¶

We have dependencies between consecutive labels

Example¶

after non-possessive pronoun ("O") such as "I" a verb ("V") is more likely than a noun ("N")

local model cannot directly capture this

Maximum Entropy Markov Model¶

$$ p_\params(\y|\x) = \prod_{i=1}^n p_\params(y_i|\x,y_{i-1},i) $$

a product of local logistic regression (aka Maximum Entropy) classifiers $\prob_\params(y_i|\x,y_{i-1},i)$

but classifiers can use the previous label as observed feature
makes a first-order Markov assumption

Example¶

$$ \prob_\params(\text{"D A N"} \bar \text{"a single game"}) = \\\\ \prob_\params(\text{"D"}\bar \text{"a single game"},\text{"PAD"},1) \\\\ \prob_\params(\text{"A"} \bar \text{"a single game"},\text{"D"},2) \ldots $$

Log-linear version with access to previous label:

$$ p_\params(y_i|\x,y_{i-1},i) = \frac{1}{Z_{\x,y_{i-1},i}} \exp \langle \repr(\x,y_{i-1},i),\params_{y_i} \rangle $$

where $Z_{\x,y_{i-1},i}=\sum_y \exp \langle \repr(\x,y_{i-1},i),\params_{y_i} \rangle $ is a local per-token normalisation factor

Training MEMMs¶

Optimising the conditional likelihood

$$ \sum_{(\x,\y) \in \train} \log \prob_\params(\y|\x) $$

Decomposes nicely: $$ \sum_{(\x,\y) \in \train} \sum_{i=1}^{|\x|} \log \prob_\params(y_i|\x,y_{i-1},i) $$

Easy to train

Equivalent to a logistic regression objective for a classifier that assigns labels based on previous gold labels

Let's specify a MEMM using

Feature Functions¶

In [24]:

def memm_feat_1(x,i,hist):
    return {
        **feat_4(x,i),
        'prev_y': hist[0],
    }

memm_1 = seq.MEMMSequenceLabeler(memm_feat_1, train, order=1, C=10)

Prediction in MEMMs¶

To predict the best label sequence find a $\y^*$ with maximal conditional probability

$$ \y^* =\argmax_\y \prob_\params(\y|\x). $$

Greedy Prediction¶

We cannot simply choose each label in isolation because decisions depend on each other

Simple alternative:

Choose highest scoring label for token 1
Choose highest scoring label for token 2, conditioned on best label from 1
etc.

In [51]:

memm_1.predict_next(["the","man"],0,[])

Out[51]:

'D'

In [26]:

memm_1.predict_next(["the","man"],1,['D'])

Out[26]:

'N'

In [27]:

def memm_greedy_predict(memm: seq.MEMMSequenceLabeler, data, use_gold_history=False):
    result = []
    for x, y in data:
        y_guess = []
        for i in range(0, len(x)):
            prediction = memm.predict_next(x, i, y_guess if not use_gold_history else y)
            y_guess.append(prediction)
        result.append(y_guess)
    return result

In [28]:

seq.accuracy(dev,memm_greedy_predict(memm_1, dev))

Out[28]:

0.8100767157370931

Some Noun vs Verb errors fixed:

In [29]:

util.Carousel(seq.errors(dev[:20], memm_greedy_predict(memm_1, dev[:20]), 
                         'V', 'N',model=memm_1))

Out[29]:

Previous Next

For the case of verbs ('V') we observe a high weight for $f_{\text{prev_y},\text{O}}$

indicating that pronouns are often followed by verbs, as we expected earlier

In [30]:

memm_1.plot_lr_weights('V',feat_filter=lambda s: s.startswith("prev_"))

Greedy decoding may lead to

Search Errors¶

when returned $\y^*$ is not highest scoring global solution

In [31]:

memm_1.predict_label_scores(["What","better","way"],0,[])[:3]

Out[31]:

[('O', -0.21551259764203171),
 ('D', -2.0962312777396073),
 ('#', -3.7029947994500017)]

In [32]:

memm_1.predict_label_scores(["What","better","way"],1,['O'])[:3]

Out[32]:

[('R', -0.5630160447367506),
 ('A', -1.0671842955591822),
 ('V', -2.6373919064817759)]

In [52]:

memm_1.predict_label_scores(["What","better","way"],2,['O','A'])[:3]

Out[52]:

[('N', -0.036587380196569194),
 ('R', -3.8313002827728648),
 ('A', -5.9863085075420566)]

Use

Beam Search¶

and remember a beam of previous solutions

In [56]:

x = ["What","better","way"]
init_beam = [] 
for y, score in memm_1.predict_label_scores(x,1,['O']):
    init_beam.append((('O',y),score))
beam_size = 3
beam = init_beam[:beam_size]
beam

Out[56]:

[(('O', 'R'), -0.5630160447367506),
 (('O', 'A'), -1.0671842955591822),
 (('O', 'V'), -2.6373919064817759)]

In [64]:

new_beam = []
for prev_y, prev_s in beam:
    for y,s in memm_1.predict_label_scores(x,2,prev_y):
        new_beam.append((prev_y + (y,), prev_s + s))
sorted(new_beam, key=lambda p: -p[1])[:3]

Out[64]:

[(('O', 'A', 'N'), -1.1037716757557514),
 (('O', 'R', 'N'), -1.1870333223939724),
 (('O', 'R', 'R'), -1.9702637950583792)]

In [36]:

def memm_beam_search(memm, x, width=2):
    beam = [([],0.)]
    history = [beam]
    for i in range(0, len(x)):
        # use priority queue 
        candidates = []
        for (prev,score) in beam:
            scores = memm.predict_scores(x, i, prev)
            for label_index,label_score in enumerate(scores):
                candidates.append((prev + [memm.labels()[label_index]], score + label_score))
        beam = sorted(candidates, key=lambda x: -x[1])[:width]
        history.append(beam)
    return beam, history
            
def batch_predict(data, beam_predictor):
    return [beam_predictor(x)[0][0][0] for x,y in data]

Full Example:

In [65]:

example = 56
beam, history = memm_beam_search(memm_1, dev[example][0],1)
seq.render_beam_history(history, dev[example], end=17)

Out[65]:

Previous Next

Does it this help?

In [66]:

seq.accuracy(dev, batch_predict(dev, lambda x: memm_beam_search(memm_1, x, 10)))

Out[66]:

0.8127721335268505

Beam search is wasteful for first-order models, instead use

Viterbi Algorithm¶

Consider a beam of size 2:

In [39]:

example = 56
beam, history = memm_beam_search(memm_1, dev[example][0],2)
seq.render_beam_history(history, dev[example], end=17) 

Out[39]:

Previous Next

Histories differ in early positions, but does all the past matter?

In [72]:

memm_1.predict_label_scores(["What","better","way"],2,['O','R'])[:3]

Out[72]:

[('N', -0.62401727765722192),
 ('R', -1.4072477503216287),
 ('V', -1.8299397831261677)]

Past only matters until the previous token

Viterbi Algorithm = Remember only the best history per last label

In [41]:

from collections import defaultdict
import math
def memm_viterbi_search(memm, x, width=2):
    labels = memm.labels()
    # initialise
    alpha = [{}]
    beta = [{}]
    for label_index, label_score in enumerate(memm.predict_scores_hist(x, 0, ["PAD"])):
        label = labels[label_index]
        alpha[0][label] = label_score
        beta[0][label] = "PAD"
    
    # prune
    seq.prune_alpha_beta(alpha[0], beta[0], width)
    
    # recursion 
    for i in range(1, len(x)):
        alpha.append(defaultdict(lambda: -math.inf))
        beta.append({})
        for p in alpha[i-1].keys():
            for label_index, label_score in enumerate(memm.predict_scores_hist(x, i, [p])):
                label = labels[label_index]
                new_score =  alpha[i-1][p] + label_score
                if new_score > alpha[i][label]:
                    alpha[i][label] = new_score
                    beta[i][label] = p
        # prune
        seq.prune_alpha_beta(alpha[i], beta[i], width)
    
    # convert to beam history to be used in the same way beam search was used.  
    history = seq.convert_alpha_beta_to_history(x, alpha, beta)
    return history[-1], history

In action:

In [42]:

beam, history = memm_viterbi_search(memm_1, dev[example][0],2)
seq.render_beam_history(history, dev[example], 17)

Out[42]:

Previous Next

Now, does this help?

In [48]:

seq.accuracy(dev, batch_predict(dev, lambda x: memm_viterbi_search(memm_1, x, 2)))

Out[48]:

0.8138088326767572

Check Models on Test Set:

In [44]:

pd.DataFrame([
        ["word", seq.accuracy(test, local_1.predict(test))],
        ["+ first @", seq.accuracy(test, local_2.predict(test))],
        ["+ cap", seq.accuracy(test, local_3.predict(test))],
        ["+ suffix", seq.accuracy(test, local_4.predict(test))],
        ["MEMM", seq.accuracy(test, memm_1.predict(test))],
    ])

Out[44]:

	0	1
0	word	0.703440
1	+ first @	0.757271
2	+ cap	0.776007
3	+ suffix	0.796980
4	MEMM	0.811102

Summary¶

Many problems can be cast as sequence labeling
Solution 1: sequence of classifiers
- rely on good feature engineering
Solution 2: MEMM to model label dependencies
- require non-trivial search algorithms
- but greedy and beam search often works well
Other Options:
- CRFs (in notes)
- RNNs (discussed later)

Background Material¶

MEMM

bias	first_at:False	word:Senate
1.0	1.0	1.0
-2.93	0.30	1.67
-2.97	1.45	1.94

passed	and	who	made	the	Dirty	Dozen	.	#arts	http://t.co/BAh2iUL
V	&	O	V	D	^	^	,	#	U
N	&	O	V	D	N	N	,	N	N

bias	first_at:False	word:Dirty
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

and	who	made	the	Dirty	Dozen	.	#arts	http://t.co/BAh2iUL	via
&	O	V	D	^	^	,	#	U	P
&	O	V	D	N	N	,	N	N	P

bias	first_at:False	word:Dozen
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:Pal
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:Gasol
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

@comicsguy024	I	don't	use	Chrome	due	to	the	lack
@	O	V	V	^	P	P	D	N
@	O	V	V	N	A	P	D	N

bias	first_at:False	word:Chrome
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

So	who's	going	to	the	Ethernet	Expo	next	week	in
P	L	V	P	D	^	^	A	N	P
P	L	V	P	D	N	^	A	N	P

bias	first_at:False	word:Ethernet
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:NYC
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:TETRIS
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:Forest
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

to	go	for	Halloween	on	fri	and	sat	...	Thinking
P	V	P	^	P	^	&	^	,	V
P	V	P	^	P	N	&	V	,	N

Senate	#ArtsGrades	are	in	!
^	N	V	P	,
N	N	V	P	,

29p	11r	Pal	Gasol	went	da	fuck
N	N	^	^	V	D	N
N	N	N	N	V	D	V

29p	11r	Pal	Gasol	went	da	fuck	off
N	N	^	^	V	D	N	P
N	N	N	N	V	D	V	T

X	)	RT	@DarCoxaj	:	TETRIS	!	(:	"
E	E	~	@	~	^	,	E	,
N	,	~	@	~	N	,	E	,

bias	first_at:False	word:fri
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	word:pyramid
1.0	1.0	1.0
-2.93	0.30	0.00
-2.97	1.45	0.00

bias	first_at:False	is_lower:False	word:Senate
1.0	1.0	1.0	1.0
-2.45	0.85	0.01	0.78
-2.28	1.80	-1.67	2.49

bias	first_at:False	is_lower:True	word:fri
1.0	1.0	1.0	1.0
-2.45	0.85	-2.46	0.00
-2.28	1.80	-0.61	0.00

RT	@TheRealQuailman	:	Currently	laughing	at	Laker	haters	.
~	@	~	R	V	P	^	N	,
~	@	~	^	N	P	^	N	,

@ShiversTheNinja	forgive	me	for	blowing	up	your	youtube	comment
@	V	O	P	V	T	D	^	N
@	N	O	P	N	T	D	N	N

last	night	,	but	didn't	bother	calling	Shawn	because	I'd
A	N	,	&	V	V	V	^	P	L
A	N	,	&	V	N	V	^	P	L

and	watch	the	news	and	tune	out	over	some	fresh
&	V	D	N	&	V	T	P	D	A
&	V	D	N	&	N	T	P	D	A