%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')
# import tikzmagic
%load_ext tikzmagic
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
from IPython.display import Image
import random
%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
Not quite yet!
calculate the probability of seeing a sequence of words.
What is the most likely next word?
We're going to ...
How about now?
We're going to win ...
How likely is this sequence?
We're going to win bigly.
Is it more likely than this one?
We're going to win big league.
We will win by a mile
or
We will win bigly
Other applications of language models?
Image(url='../img/lm-params-2022.png'+'?'+str(random.random()), width=1000)
Image(url='../img/epoch.png'+'?'+str(random.random()))
... but first, the basics
Models the probability
$$\prob(w_1,\ldots,w_d)$$of observing sequences of words $w_1,\ldots,w_d$.
Without loss of generality:
\begin{align} \prob(w_1,\ldots,w_d) &= p(w_1) p(w_2|w_1) p(w_3|w_1, w_2) \ldots \\ &= \prob(w_1) \prod_{i = 2}^d \prob(w_i|w_1,\ldots,w_{i-1}) \end{align}Impossible to estimate sensible probability for each history
$$ \x=w_1,\ldots,w_{i-1} $$truncate history to last $n-1$ words:
$$ \mathbf{f}(\x)=w_{i-(n-1)},\ldots,w_{i-1} $$$\prob(\text{bigly}|\text{...,blah, blah, blah, we, will, win}) = \prob(\text{bigly}|\text{we, will, win})$
Set $n=1$: $$ \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i). $$
$\prob(\text{bigly}|\text{we, will, win}) = \prob(\text{bigly})$
Set $n=2$: $$ \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-1}). $$
$\prob(\text{bigly}|\text{we, will, win}) = \prob(\text{bigly}|\text{win})$
Set $n=0$:
Same probability for each word in the vocabulary $\vocab$:
$$ \prob(w_i|w_1,\ldots,w_{i-1}) = \frac{1}{|\vocab|}. $$$\prob(\text{big}) = \prob(\text{bigly}) = \frac{1}{|\vocab|}$
Let us look at a training set and create a uniform LM from it.
train[:9]
['Can', "'t", 'even', 'call', 'this', 'a', 'blues', 'song', 'It']
vocab = set(train)
baseline = UniformLM(vocab)
sum([baseline.probability(w) for w in vocab])
0.9999999999999635
What about words outside the vocabulary? What is their probability?
Sample incrementally, one word at a time
def sample_once(lm, history, words):
probs = [lm.probability(word, *history) for word in words]
return np.random.choice(words,p=probs)
sample_once(baseline, [], list(baseline.vocab))
'sunrise'
def sample(lm, initial_history, amount_to_sample):
words = list(lm.vocab)
result = []
result += initial_history
for _ in range(0, amount_to_sample):
history = result[-(lm.order - 1):]
result.append(sample_once(lm,history,words))
return result
sample(baseline, [], 10)
['dummies', 'find', 'being', 'bars', 'clap', 'rapping', 'droppin', 'fender', 'hated', 'Recognize']
Shannon Game: Predict next word, win if prediction matches the word in actual corpus
Our horrible trade agreements with [???]
The expected reward is the probability of the corpus.
Formalised by
\begin{align} \prob(w_1) \prob(w_2|w_1) \ldots \prob(w_T|w_1,\ldots,w_{T-1}) &= \prod_{i=1}^T \prob(w_i|w_1,\ldots,w_{i-1}) \end{align}But then the longer the sequence, the lower the probability...
$\to$ normalise by the length
Given test sequence $w_1,\ldots,w_T$, perplexity $\perplexity$ is geometric mean of inverse probabilities or, put differently the inverse probability of the test set, normalised by the number of words:
\begin{align} \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\frac{1}{\prob(w_1)} \frac{1}{\prob(w_2|w_1)} \ldots} \\ &= \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_1,\ldots,w_{i-1})}} \end{align}Perplexity for a bigram language model:
\begin{align} \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_{i-1})}} \end{align}Perplexity for a unigram language model:
\begin{align} \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i)}} \end{align}Perplexity for a uniform language model:
\begin{align} \perplexity(w_1,\ldots,w_T) &= \sqrt[T]{\prod_i^T \frac{1}{1/|V|}} = |V| \end{align}Consider LM where
Then
Perplexity of uniform LM on an unseen test set?
perplexity(baseline, test)
inf
Problem: model assigns zero probability to words not in the vocabulary.
[(w,baseline.probability(w)) for w in test if w not in vocab][:5]
[('does', 0.0), ('Ceremonies', 0.0), ('Masquerading', 0.0), ('also', 0.0), ('Creativity', 0.0)]
New words not specific to our corpus:
Let us plot word frequency ranks (x-axis) against frequency (y-axis)
plt.xscale('log')
plt.yscale('log')
plt.plot(ranks, sorted_counts)
[<matplotlib.lines.Line2D at 0x132222ced70>]
In log-space such rank vs frequency graphs are linear
Let $r_w$ be the rank of a word $w$, and $f_w$ its frequency:
$$ f_w \propto \frac{1}{r_w}. $$There will virtually always be words with zero counts in your test set.
Why is this a problem?
Solutions:
OOV
Injection Procedures¶print(test[60:100])
# Replace every word not within the vocabulary with the `OOV` symbol
# [word if word in vocab else OOV for word in data]
print(replace_OOVs(baseline.vocab, test[60:100]))
['with', 'the', 'lyrics', 'of', 'the', 'year', 'Than', 'the', 'gimmick', 'with', 'the', 'gear', 'and', 'the', 'right', 'puppeteer', 'Now', 'you', 'can', 'be', 'the', 'next', 'rock', 'Shakespear', 'you', "'", 're', 'still', '10', 'steps', 'away', 'from', 'having', 'a', 'career', 'You', 'step', 'up', 'the', 'plate'] ['with', 'the', 'lyrics', 'of', 'the', '[OOV]', '[OOV]', 'the', '[OOV]', 'with', 'the', '[OOV]', 'and', 'the', 'right', '[OOV]', 'Now', 'you', 'can', 'be', 'the', 'next', 'rock', '[OOV]', 'you', "'", 're', 'still', '10', 'steps', 'away', 'from', '[OOV]', 'a', 'career', 'You', '[OOV]', 'up', 'the', 'plate']
Consider the "words"
AA AA BB BB AA
Going left to right, how often do I see new words?
Inject OOV
tokens to mark these "new word events"
inject_OOVs(["AA","AA","BB","BB","AA"])
['[OOV]', 'AA', '[OOV]', 'BB', 'AA']
OOV
Probability¶What is the probability of seeing a word you haven't seen before?
Train on replaced data...
oov_train = inject_OOVs(train)
oov_vocab = set(oov_train)
oov_test = replace_OOVs(oov_vocab, test)
oov_baseline = UniformLM(oov_vocab)
perplexity(oov_baseline,oov_test)
1287.9999999984573
OOV
and Perplexity¶N-gram language models condition on a limited history:
$$ \prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-(n-1)},\ldots,w_{i-1}). $$What are its parameters (continuous values that control its behaviour)?
One parameter $\param_{w,h}$ for each word $w$ and history $h=w_{i-(n-1)},\ldots,w_{i-1}$ pair:
$$ \prob_\params(w|h) = \param_{w,h} $$$\prob_\params(\text{bigly}|\text{win}) = \param_{\text{bigly, win}}$
Assume training set $\train=(w_1,\ldots,w_d)$
Find $\params$ that maximises the log-likelihood of $\train$:
$$ \params^* = \argmax_\params \log p_\params(\train) $$where
$$ \prob_\params(\train) = \ldots \prob_\params(w_i|\ldots w_{i-1}) \prob_\params(w_{i+1}|\ldots w_{i}) \ldots $$Structured Prediction: this is your continuous optimisation problem!
Maximum-log-likelihood estimate (MLE) can be calculated in closed form: $$ \prob_{\params^*}(w|h) = \param^*_{w,h} = \frac{\counts{\train}{h,w}}{\counts{\train}{h}} $$
where
$$ \counts{D}{e} = \text{Count of } e \text{ in } D $$Event $h$ means seeing the history $h$, and $w,h$ seeing the history $h$ followed by word $w$.
Many LM variants: different estimation of counts.
Let us train a unigram model...
What do you think the most probable words are?
Remember our training set looks like this ...
oov_train[1000:1010]
['bird', 'I', 'know', 'this', '[OOV]', '[OOV]', '[OOV]', 'is', 'this', '[OOV]']
unigram = NGramLM(oov_train,1)
plot_probabilities(unigram)
# sum([unigram.probability(w) for w in unigram.vocab])
The unigram LM has substantially reduced (and hence better) perplexity than the uniform LM:
perplexity(oov_baseline,oov_test), perplexity(unigram,oov_test)
(1287.9999999984573, 128.9093846843014)
Its samples look (a little) more reasonable:
print(sample(oov_baseline, [], 10), "\n")
print(sample(unigram, [], 10))
['hands', 'play', 'below', 'never', 'around', 'type', 'grows', 'about', 'debate', 'himself'] ['the', '[OOV]', 'to', 'in', 'Singing', '[OOV]', '[OOV]', "'m", 'live', 'to']
We can do better by setting $n=2$
bigram = NGramLM(oov_train,2)
plot_probabilities(bigram, ("I",)) # bigrams starting with "I"
Samples should look (slightly) more fluent:
" ".join(sample(bigram, ['I'], 30)) # try: I, FIND, [OOV]
"I set em Yo [OOV] enemies [OOV] [OOV] is yours what the [OOV] it So it There 's mind [OOV] Recognize your [OOV] up [OOV] [OOV] wanna get up [OOV] and"
How about perplexity?
perplexity(bigram,oov_test)
inf
Some contexts where OOV word (and others) haven't been seen, hence 0 probability...
bigram.probability("[OOV]","money")
0.0
There will virtually always be n-grams with zero counts in your test set.
Solutions:
Maximum likelihood
Solution: smooth the probabilities and move mass from seen to unseen events.
Add pseudo counts to each event in the dataset
$$ \param^{\alpha}_{w,h} = \frac{\counts{\train}{h,w} + \alpha}{\counts{\train}{h} + \alpha \lvert V \rvert } $$laplace_bigram = LaplaceLM(bigram, 0.1)
laplace_bigram.probability("[OOV]","money")
0.0007704160246533128
Perplexity should look better now:
perplexity(LaplaceLM(bigram, 0.001),oov_test)
255.11837473847797
Consider three events:
c = ["word", "train count", "MLE", "Laplace", "Same Denominator"]
r1 = ["smally", "0", "0/3", "1/6", "0.5/3"]
r2 = ["bigly", "1", "1/3", "2/6", "1/3"]
r3 = ["tremendously", "2", "2/3", "3/6", "1.5/3"]
util.Table([r1,r2,r3], column_names=c)
word | train count | MLE | Laplace | Same Denominator |
---|---|---|---|---|
smally | 0 | 0/3 | 1/6 | 0.5/3 |
bigly | 1 | 1/3 | 2/6 | 1/3 |
tremendously | 2 | 2/3 | 3/6 | 1.5/3 |
laplace_bigram.probability('rhyme','man'), \
laplace_bigram.probability('of','man')
# also try: 'skies','skies' vs. '[/BAR]','skies'
(0.0005656108597285067, 0.0005656108597285067)
Problem: not all unseen words (in a context) are equal
With interpolation we can do better:
interpolated = InterpolatedLM(bigram,unigram,0.01)
interpolated.probability('rhyme','man'), \
interpolated.probability('of','man')
(0.0014514278429372768, 0.009276517083120857)
Can we find a good $\alpha$ parameter? Tune on some development set!
alphas = np.arange(0,1.1,0.1)
perplexities = [perplexity(InterpolatedLM(bigram,unigram,alpha),oov_test)
for alpha in alphas]
plt.plot(alphas,perplexities)
[<matplotlib.lines.Line2D at 0x13225463010>]
Let $w$ be a word and $h_{m}$ an n-gram of length $m$:
$$ \prob_{\mbox{Stupid}}(w|h_{m}) = \begin{cases} \frac{\counts{\train}{h_{m},w}}{\counts{\train}{h_{m}}} &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\ \prob_{\mbox{Stupid}}(w|h_{m-1}) & \mbox{otherwise} \end{cases} $$What is the problem with this model?
stupid = StupidBackoff(bigram, unigram, 0.1)
sum([stupid.probability(word, 'the') for word in stupid.vocab])
1.0684727180010114
The score is not a probability distribution (probabilities do not sum to 1). Sampling thus requires further normalisation.
Recall that in test data, a constant probability mass is taken away for each non-zero count event. Can this be captured in a smoothing algorithm?
Yes: subtract (tunable) constant $d$ from each non-zero probability:
$$ \prob_{\mbox{Absolute}}(w|h_{m}) = \begin{cases} \frac{\counts{\train}{h_{m},w}-d}{\counts{\train}{h_{m}}} &= \mbox{if }\counts{\train}{h_{m},w} > 0 \\\\ \alpha(h_{m-1})\cdot\prob_{\mbox{Absolute}}(w|h_{m-1}) & \mbox{otherwise} \end{cases} $$$\alpha(h_{m-1})$ is a normaliser
Assume, for example:
Then the final-backoff unigram model might assign a higher probability to
I can't see without my reading Def
than
I can't see without my reading glasses
because $\prob(\text{Def}) > \prob(\text{glasses})$
But Def never follows anything but Mos, and we can determine this by looking at the training data!
Absolute Discounting, but as final backoff probability, use the probability that a word appears after (any) word in the training set:
$$ \prob_{\mbox{KN}}(w) = \frac{\left|\{w_{-1}:\counts{\train}{w_{-1},w}> 1\} \right|} {\sum_{w'}\left|\{w_{-1}:\counts{\train}{w_{-1},w'}\} > 1 \right|} $$This is the continuation probability
Rather than using a single discount $d$: use three different discounts $d_1$, $d_2$, $d_3$ for 1-grams, 2-grams and n-grams with count 3 or more