%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('structured_prediction.ipynb')
No emerging unified theory of NLP, most textbooks and courses explain NLP as
a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.
-- Hal Daume
but there is a reoccurring pattern ... the
Good NLPers combine three skills in accordance with this recipe:
util.Table(train, column_names=["x","y"])
x | y |
---|---|
I ate an apple | Ich aß einen Apfel |
I ate a red apple | Ich aß einen roten Apfel |
util.Table(test)
Yesterday I ate a red apple | Gestern aß ich einen roten Apfel |
Yesterday I ate a red apply with a friend | Gestern aß ich einen roten Apfel mit einem Freund |
y_space
['Ich aß einen Apfel', 'Ich aß einen roten Apfel', 'Gestern aß ich einen roten Apfel', 'Gestern aß ich einen roten Apfel mit einem Freund']
Note: $\param$ should capture fact that German sentences are a little longer (here!)
Let us inspect this model:
util.Table([(x, y, f(x), g(y), s(1.0, x, y)) for x, y in train],
column_names=["Source x","Target y","f(x)","g(y)","score"])
Source x | Target y | f(x) | g(y) | score |
---|---|---|---|---|
I ate an apple | Ich aß einen Apfel | 14 | 18 | -4.0 |
I ate a red apple | Ich aß einen roten Apfel | 17 | 24 | -7.0 |
Does this scoring function help to discriminate right from wrong?
util.Table([(train[1][0],y,"{:.2f}".format(s(1.3,train[1][0],y)))
for y in y_space],
column_names=["x","y","score"])
x | y | score |
---|---|---|
I ate a red apple | Ich aß einen Apfel | -4.10 |
I ate a red apple | Ich aß einen roten Apfel | -1.90 |
I ate a red apple | Gestern aß ich einen roten Apfel | -9.90 |
I ate a red apple | Gestern aß ich einen roten Apfel mit einem Freund | -26.90 |
How to estimate $\param$? Let us define a
where
A finite approximation of the search space ...
thetas = np.linspace(0.0, 2.0, num=50)
plt.plot(thetas, [loss(theta,train) for theta in thetas])
[<matplotlib.lines.Line2D at 0x7fa813770278>]
is as simple as choosing the parameter with the lowest loss:
$$ \param^* = \argmin_{\param \in [0,2]} l(\param) $$theta_star = thetas[np.argmin([loss(theta,train) for theta in thetas])]
theta_star
1.2653061224489794
same thing, just in $\Ys$:
$$\y{^*}_{\param}=\argmax_\y s_\param(\x,\y).$$Seen before? Yes, training often involves prediction in inner loop.
util.Table([(x,predict(theta_star, x)) for x,_ in test])
Yesterday I ate a red apple | Gestern aß ich einen roten Apfel |
Yesterday I ate a red apply with a friend | Gestern aß ich einen roten Apfel mit einem Freund |
Feature representations and scoring functions are more elaborate
Parameter space usually multi-dimensional (millions of dimensions).
Output space often exponentional sized (e.g. all German sentences)