Table of Contents

  1. Introduction - trying to understand it
  2. Corpora and Tokenization - finding food for it
  3. Coding the Model - having a go at it
  4. Learning from Reading - reading someone else's go at it
  5. Uses - Where is it?

Introduction

The n-gram model makes for a great introduction to natural-language processing because it is extremely simple, but still versatile and powerful enough to be used for real work. Once you intuitively understand it, the n-gram model is the simplest thing in the world. This introduction is my attempt to transfer an intuitive perception of the model to you, the reader.

If you understand things better by looking at Python code, you might want to skip to the next section: Corpora and Tokenization.

Let's start with a question. Why do we have expectations about the words we read? We definitely have such expectations -- that's how how you can read this sentence without noticing that there were two "how"s.

Our first thought is that our expectations are based on the structure of the language -- the rules of grammar, or syntax. But we can quickly think of many other things that give us expectations about the coming words: common idiom, the author's style, or even other knowledge we have that's external to the text can influence our expectations.

We want to be able to quantify and compute these expectations, but unfortunately in Python we can't access that useful context in your head that you use to generate expectations when you read. We can find some context, though: collections of the author's texts, and maybe larger collections of texts by many different authors, all lovingly digitized, give us a lot of relevant data. A human being with that kind of data could probably read for two years and emerge with some extremely good expectations. How can we do the same thing with a computer?

This is the point where I like to start thinking about everything as likelihoods. That's why I like the word "expectation" -- it's just a likelihood of an event, but we understand expectations. We know what an expectation feels like. It's a feeling like something is right for a spot -- it just ought to be, obviously. It fits. So carry that intuition over to the more technical word, likelihood. A likelihood is just an expectation of an event, quantified as a real number between $0$ and $1$. When we feel an expectation, we're unconsciously assigning a likelihood to an event -- "I think this future event is $X\%$ likely, given this stuff that's already known." (Authors note: Is this the actual thought process in our brain? Probably not. But it's something we can relate to, which is the important part).

But how do we compute this future event is $X\%$ likely given this known past stuff? In order to calculate a likelihood like that, we can see the "known past stuff" happen a bunch of times, and count how many times it's followed by the future event in question.

Of course, we are still left wondering how to make this known past stuff and future event explicit for our program. In a sense, this is the heart of the NLP problem. The n-gram model is just one way of thinking about future events and past stuff and their relationship, and it's based on the idea that language itself usually has some fairly obvious content delimeters. Typically, the contents of each of the delimited blobs are very tightly coupled syntactically, logically, and/or stylistically.

In English, the sentence is an obvious example. It's really hard to predict the first word of a sentence, compared to how easy it is to predict the further words, even if you know the words in the sentence beforehand. There's a less meaningful attachment between words in two separate sentences. So sentences are important: they are little, self-contained blobs of meaning. The other, purely practical, advantage of the sentence-sized blob is that we have a lot of them floating about. Paragraphs are blobby in a similar way, but they are much longer, which makes them harder to use. Paragraphs are too complicated to get good results.

Another great blob in the English language is the word. Although words are not necessarily the "smallest unit of meaning" (thanks to prefixes, suffixes, and compound words), they seem like as good a "unit of meaning" as anything we can find in natural language. There are some models, n-gram or otherwise, that try to consider sub-word-level meaning. Sometimes this can make the model better and sometimes it doesn't really help.

Since a word is a small, fairly predictable thing, and sentences provide such convenient content delimiters, the n-gram model is focused on estimating the likelihood of the next word in the sentence -- that's our future event -- given the sentence so far as our past data.

I tried to emphasize that this is a fairly arbitrary distinction. There is nothing essential about using words and sentences as the most meaningful units when analyzing natural language. It just seems to make the most sense to us right now.

Unfortunately, it turns out that even calculating the likelihood of the next word in the sentence, given the sentence so far is not very practical. Unusual sentences, like the long, run-on variety, might throw a wrench into your counting process (remember, that's how we're calculating the likelihoods for the next word -- by counting all previous times we've seen the sentence so far) by being utterly unique, which would make it impossible for you predict the next word -- although you would never encounter a run-on sentence like that in a notebook like this! So, it turns out we actually get better results if we only use part of the sentence.

This might seem counterintuitive, but it's because data scarcity makes prediction difficult. When we have a huge collection of text -- millions of words -- then although many sentences are still repeated only once or twice, 3-4 word segments almost always appear multiple times. And that's important, because we need many appearances for our model to make sane predictions. To see why, you need to remember that the model doesn't know about any word combinations it hasn't seen before.

So, for example, suppose you are trying to guess the next word in this sentence: "Three days later, details about the event were leaked to the..." what? You probably have an idea what it's being leaked to already, but bear with me. If our model had never seen that sentence before -- and it's not hard to imagine that even in a million words of English that sentence could be completely unique -- how would it guess the next word?

On the other hand, suppose we threw out the start of the sentence. Suppose we only want three words of context. "Leaked to the..." what? In this case, we would probably expect that the words "leaked to the <something>" would appear at least a couple times in a million words, perhaps in newspaper reports or Web articles. Maybe there was a sentence, "A month later his cover was blown, and the once-secret investigation was leaked to the media," and another sentence, "Wayne Manor was raided last night after it was leaked to the press that Bruce Wayne was Batman." Now the model has a sense that "leaked to the..." is usually followed by "media" or "press."

In the last example we used three words of context ("leaked to the") instead of the whole sentence. This would be called a quadrigram -- "quad" for the three words of context, plus the word we are guessing. If you use two words of context it's a trigram, etc. In the general case it's called an n-gram.

But even with only a few words of context, we sometimes run into n-grams that have never been seen before. If you purely follow the probabilities, and the three past words have never been seen before, you will predict that there is a 0% chance that any word will come next! That can be problematic (in the dividng-by-zero sense of the word), so we often have to take other steps to make sure that our model is able to at least try to come up with a prediction.

Imagine rolling a dice twice and it comes up 1 and 2. Now this is your data about the dice.

  1. 50%
  2. 50%
  3. 0%
  4. 0%
  5. 0%
  6. 0%

Imagine this is all the data you have about the dice. You don't even know what it's shaped like! So you just use the data for your prediction -- you have no choice, and really, no reason to expect anything other than a perfect prediction of future behavior, except for some vague misgivings about sample size. But, those percentages are completely wrong for a 6-sided dice, so you won't predict future behavior well at all! You need more data to make accurate guesses; two data points is not enough for the 6 possible outcomes.

Actually, a human probably wouldn't just take that table at face value. They would figure, "there's no way the others are 0%; we probably don't have enough data." So, they adjust the values before using their results, or they try to get more data. This kind of manual tinkering with probabilities -- putting our hands on it -- is quintessentially human. It turns that "fuzzing" the prediction can be helpful for computers as well, so we often add similar functionality to our n-gram models. You might think that having the model deliberately guess around or fiddle with the data would make it less accurate, but it turns out to be very important, because it gives us a path forward even if we have never seen our words-of-context before.

Well, so much for the intuitive approach. Maybe you're still a little lost, but that's alright. Let's get started with some concrete examples. First we need a big set of words, a corpus, for our program to obsessively count through. Let's feed our Count von Count.

Corpora and Tokenization

A corpus is just a big collection of documents which are analyzed by the model program, a process called training. Selection of a corpus is essential in real-world NLP. The corpus should be not too big, or the model may take too long to train, but it also can’t be too small because the model needs a lot of text in order to learn. The more complex and varied the subject matter, the larger the corpus required.

Some corpora (that's plural for corpus) have millions of words, pulled from books, magazines, or other material. Some corpora are organized by genre or media, and some rare gems even have all of their hundreds of thousands of words tagged with additional data like their parts of speech, presumably by underpaid grad students. Some contain chat logs or Twitter feeds. Notably, Google has a collection of goddamn huge private corpora, which is one of the reasons their natural language stuff is so good. They might have the biggest NLP corpora in the world.

For this program, I will be using the Project Gutenberg corpus from NLTK’s collection of corpora. NLTK (Natural Language Toolkit) is a vast Python library that has a lot of utilities for dealing with natural language. We'll be seeing it a lot in our code for this tutorial. Project Gutenberg is a website that contains the full text content of a bunch of public domain books. NLTK bundles some of Project Gutenberg's full-text extracts, which lets us use them easily in our program.

Note that one or two books is not a great corpus for a serious project. You will find the quality of your results to be not very good compared to a larger corpus like the well-known, free, million-plus-word Brown corpus, or one of the many other large free corpora available on the Web. But for a tutorial, a small corpus is nice because it's familiar and fast. And using individual authors can be interesting because the resulting model will end up anticipating the author's style.

Let's get the books.

In [2]:
import nltk
!python -m nltk.downloader gutenberg
[nltk_data] Downloading package gutenberg to /home/luke/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!

Check that it installed correctly:

In [3]:
from nltk.corpus import gutenberg
gutenberg.raw('carroll-alice.txt')[1:50]
Out[3]:
"Alice's Adventures in Wonderland by Lewis Carroll"

The book is represented as a long string of characters, but a raw-string representation of a book is not very meaningful to our model. The mathematics behind the n-gram model are described in terms of "tokens". For our purposes we can think of tokens as individual semantic units, like words or punctuation, that we want to consider in the model. We want to convert our raw string into a list of tokens so our model can understand it.

This is important: tokens are not meant to represent the syntax -- physical structure -- of the text. They instead provide a consistent way to represent the semantics -- the meaning -- behind the text, in a way that our NLP model can easily parse.

The process of breaking a character string into meaningful chunks is called tokenization in NLP circles, but many programmers also know it as lexical analysis, which is what it's called when you are talking about compilers. An NLP person might say, "we want to turn this string of characters into a list of tokens". A compiler guy might call them atoms. One vocabulary was created for describing natural language stuff, the other was made for computer language stuff, but they mean the same thing. I'll be using the NLP vocabulary, to help you get used to it.

NTLK includes a number of tokenizers for us to use. In your projects, you should probably use these. For example:

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
word_tokenize(sent_tokenize(gutenberg.raw('carroll-alice.txt'))[1])[:10]
Out[4]:
['Down',
 'the',
 'Rabbit-Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired']

For this tutorial, though, we're going to make our own tokenizer: a nice, simple one that we can modify easily. Our tokenizer is going to be primarily powered by regular expressions. Yes, those horrid things, notoriously unable to parse anything defined in a recursive way, are going to be applied to the impossibly vague and horribly complex grammar of English text! Relax; we're not doing any complicated syntactic parsing, just some simple search-and-replace.

Since we will be using hardcoded regex substitutions as a tokenization strategy, we have to make choices about what changes we want to make to the source text. Remember, the choices we make here are important because of how they affect the n-gram model. Having said that, there are no wrong answers: different tokenization schemes will work better or worse depending on your corpus and your model and your goals.

To start with, we want to do some basic transformations:

  1. Standardize whitespace (replacing hyphens and newlines with spaces).
  2. Split the text on spaces so we have a list of tokens (i.e. words)
  3. CAPSLOCK all tokens (because Apple and apple have the same meaning -- they should use the same token)
In [5]:
import re

def tokenize(words):
    re.sub(r'[-\n]', " ", words) # standardize whitespace
    return [w.upper() for w in words.split(' ') if w != ""]

tokenize(gutenberg.raw('carroll-alice.txt')[1:50])
Out[5]:
["ALICE'S", 'ADVENTURES', 'IN', 'WONDERLAND', 'BY', 'LEWIS', 'CARROLL']

Great. What else should our tokenizer be doing?

For starters, there's a lot of punctuation we should deal with. Some punctuation is semantically important, like .,:;!? and some is not so useful, like ()[]{}#$%. We have to decide what punctuation we want to preserve as meaningful tokens, and what we can discard as simply distracting.

Obviously, sentence-breaking punctuation is important. Our model's every generated sentence is going to begin and end with a sentence break, so we need a consistent way to represent them. Although we could differentiate between !, ?, and ., we can get more meaningful results with such a small corpus by replacing all of these with a consistent [SENTENCE-BREAK] token.

Other common punctuation is also important if we want our model to capture any sense of grammar. At least we can preserve ,, ;, :, and .... Parenthesis are a little meaningful as well since an open-parenthesis indicates the start of a new clause in the sentence (a parenthetical clause). But for this simple tokenizer, I want to leave them out.

Symbols like \/#%$^*@ (just hold Shift and spam your number keys for additional examples) are also a tiny bit meaningful, but not enough to keep them. This is especially true because the symbols are fairly rare, so in such a small corpus, our model won't see enough repetions of the symbol to "understand" them statistically. Braces []{} are in the same boat, and although quotation marks " are quite common, their meaningfulness is questionable, and the n-gram model has no sense of "matching" quotes or any other elements that come in pairs, so we will get rid of them.

On the topic of symbolic rarity, numbers are notoriously annoying for n-gram models or any statistical models on text. Our model treats each token as having unique meaning, and while 122 is certainly different from 123, it's very hard to find useful statistical information in that distinction. If each number is treated as discrete, we can end up with tens or hundreds of numeric tokens that have each only been seen once or twice, which is bad news. It's pretty much impossible for an n-gram model to extract meaning from such sparse data.

On the other hand, the idea of numbers is useful -- the idea that, hey, a number should go in this spot. The exact number doesn't really matter, but there definitely should be a number here. We will represent this by replacing every number with a token, like [NUMBER], and assuming that the fact that there is a number is more important than the actual digits.

Let's put these ideas into code by expanding our previous tokenize function. Once again, remember that you can fiddle with these in your own project, or use a preexisting tokenizer.

In [6]:
from functools import reduce

def tokenize(words):
    # The list of regular expressions and replacements to be applied
    replacements = [
         ["[-\n]",                  " "] # Hyphens to whitespace
        ,[r'[][(){}#$%"]',          ""] # Strip unwanted characters
        ,[r'\s([./-]?\d+)+[./-]?\s'," [NUMBER] "] # Standardize numbers
        ,[r'\.{3,}',                " [ELLIPSIS] "]
        ,[r',',                     " [COMMA] "]
        ,[r';',                     " [SEMICOLON] "]
        ,[r':',                     " [COLON] "]
        ,[r'[.!?]',                 " [SENTENCE-BREAK] "]
    ]
    # This is a function that applies a single replacement from the list
    resub = lambda words, repls: re.sub(repls[0], repls[1], words)
    # Applies each replacement in order before splitting the text into tokens
    tokens = [w.upper() for w in 
              reduce(resub, replacements, words).split(' ') if w != '']
    return tokens + ["[SENTENCE-BREAK]"] # Add a sentence break in case the corpus cuts off mid-sentence

tokenize("This is 1 line of Text? How, does, it look; I wonder...")
Out[6]:
['THIS',
 'IS',
 '[NUMBER]',
 'LINE',
 'OF',
 'TEXT',
 '[SENTENCE-BREAK]',
 'HOW',
 '[COMMA]',
 'DOES',
 '[COMMA]',
 'IT',
 'LOOK',
 '[SEMICOLON]',
 'I',
 'WONDER',
 '[ELLIPSIS]',
 '[SENTENCE-BREAK]']

A few notes about this new code. It defines a table of [regex, replacement] pairs and then applies each one in turn to the passed-in text words. It's important that the application is in-order, or the sentence-break replacement might clobber the ellipsis replacement.

We do this with the reduce function, which is typically hard to grok at first. The way it's used here, it applies the resub function to each [regex, replacement] in replacements in turn. resub returns the modified text and the search begins again with the next regex. In the end, the reduce call returns the value returned by the last call of resub, which is the fully-modified text.

After the regexes are applied, the text is split on spaces and each token is capslocked.

Finally, we append [SENTENCE-BREAK] to the end. This will prevent our model from "walking off" the end of the corpus if it ends midsentence. (More concretely: if the corpus ends in some word $w_L$, and $w_L$ doesn't appear anywhere else in the corpus, then there won't be a word $w_{L+1}$ such that $P(w_{L+1}|w_L) \neq 0$. This is only acceptable for [SENTENCE-BREAK].

Coding the Model

Implementing n-gram model training in Python is fairly straightforward. I am going to use Python's defaultdict, because it helps the model's logic shine by automatically instantiating new dict keys.

At its heart, the model is a just conditional frequency distribution, modeled as a nested dictionary. The keys of the outer dictionaries are the priors, which in Python are represented as a tuple of length $n-1$ for an $n$-gram model. The keys of the inner dictionaries are single words, and their values are frequencies. For example, in a $3$-gram model,

model[("four", "score")]["and"] = $F(\text{and}|\text{four score})$

Note that we don't have to convert the frequency into a probability (i.e. normalize it). It would not affect the results because we would be dividing by a common factor. In fact, normalization can even be harmful, because it prevents us from precisely representing probabilities which are very close to zero (because of the inaccuracies of floating-point numbers, not because of a problem with the mathematical model itself).

In [5]:
from collections import defaultdict
from functools import partial
from itertools import dropwhile

def make_model(n, words):
    prior_n = n-1 # n-1 words in the prior tuple
    freq_dist = partial(defaultdict, int) # frequency distribution constructor
    model = defaultdict(freq_dist)

    for index in range(prior_n,len(words)):
        prior = words[index-prior_n:index]
        if "[SENTENCE-BREAK]" in prior:
            # Discard unneeded context
            prior = dropwhile(lambda x: x != "[SENTENCE-BREAK]", prior)
        # Note: tuples are hashable
        model[tuple(prior)][words[index]] += 1

    return model

Instead of using defaultdict, we could also have used the NLTK's built-in ConditionalFreqDist class (which inherits from defaultdict).

In [6]:
from nltk.probability import ConditionalFreqDist

def make_model(n, words):
    prior_n = n-1 # n-1 words in the prior tuple
    model = ConditionalFreqDist()

    for index in range(prior_n,len(words)):
        prior = words[index-prior_n:index]
        if "[SENTENCE-BREAK]" in prior:
            # Discard unneeded context
            prior = dropwhile(lambda x: x != "[SENTENCE-BREAK]", prior)
        # Note: tuples are hashable
        model[tuple(prior)].inc(words[index])

    return model

If we want to be really conise (and don't mind a slight change in semantics), we could go with this opaque but fun one-liner:

In [7]:
from nltk.probability import ConditionalFreqDist as cfr

def make_model(n, ws):
    return cfr((tuple(ws[i-n+1:i]), ws[i]) for i in range(n-1,len(ws)))

Of course we also have the option of using the more fully-featured n-gram model built in to NLTK.

In [8]:
from nltk.model.ngram import NgramModel

model = NgramModel(2, tokenize("This is a sentence. So is this."))
model.choose_random_word(["this"])
Out[8]:
'SENTENCE'

If you mess around with NgramModel, you'll notice some interesting behavior. How can it estimate the probabilities of words that it hasn't seen?

Learning from Reading

Let's take a look at the source for the NgramModel class. What does a mature implementation have that our ten-line routine lacks? Is there a more efficient way to implement the model in code? These are important questions; you can learn a lot by examining code in NLTK or other big projects.

Here's the code for creating the NgramModel class that NLTK uses:

In [ ]:
class NgramModel(ModelI):
    def __init__(self, n, train, pad_left=True, pad_right=False,
                 estimator=None, *estimator_args, **estimator_kwargs):
        self._n = n
        self._lpad = ('',) * (n - 1) if pad_left else ()
        self._rpad = ('',) * (n - 1) if pad_right else ()

        if estimator is None:
            estimator = _estimator

        cfd = ConditionalFreqDist()
        self._ngrams = set()


        # If given a list of strings instead of a list of lists, create enclosing list
        if (train is not None) and isinstance(train[0], compat.string_types):
            train = [train]

        for sent in train:
            for ngram in ngrams(chain(self._lpad, sent, self._rpad), n):
                self._ngrams.add(ngram)
                context = tuple(ngram[:-1])
                token = ngram[-1]
                cfd[context].inc(token)

        if not estimator_args and not estimator_kwargs:
            self._model = ConditionalProbDist(cfd, estimator, len(cfd))
        else:
            self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)

        # recursively construct the lower-order models
        if n > 1:
            self._backoff = NgramModel(n-1, train, pad_left, pad_right,
                                       estimator, *estimator_args, **estimator_kwargs)

Let's see what's going on.

The __init__ function populates two main data structures: cfd is the now-familiar ConditionalFrequencyDistribution, and _ngrams is a set of $n$-length tuples which is kept for easily testing if a given ngram tuple has been seen in the training set.

The estimator is a way of constructing probability distributions from frequency distributions. We'll get into its use later also.

This model is designed to work on list(list(string))s, which typically means a list of sentences which are lists of tokens or words. This is different from our model which uses a flat list of strings, but it can be convenient because you don't have to worry about "walking off" sentences or being strict with your [SENTENCE-BREAK] tokens.

If you pass in a list(string), it's wrapped in another list. It's not split into sentences, like you might expect, because to be able to split the input would mean that you are making an assumption about it. The model makes no assumptions about the meaning of its training data -- this same model could be used for French or Japanese as well as English. Tokenization is left completely to the calling code.

Finally, we reach the heart of the training process:

In [ ]:
for sent in train:
    for ngram in ngrams(chain(self._lpad, sent, self._rpad), n):
        self._ngrams.add(ngram)
        context = tuple(ngram[:-1])
        token = ngram[-1]
        cfd[context].inc(token)

Ah, now this is starting to look familiar, especially the last three lines.

Each sentence in the training set is padded up. The _lpad and _rpads are optional, but _lpad is enabled by default so the first word in the sentence can have a context, even if the context is just a bunch of placeholder strings.

Each ngram tuple $(w_{(i-n+1)}, w_{(i-n+2)}, \dots, w_i)$ is stored in _ngrams, and then split into the context (aka the prior) $(w_{(i-n+1)}, w_{(i-n+2)}, \dots, w_{(i-1)})$ and the current token $w_i$, and put into the cfd.

There's something important here that's easy to miss: we're outsourcing the actual ngram production to ngrams(), because it's useful in other areas. This function is defined in nltk.util. Let's look at it too. But before we do, let's think about what we expect this function to be doing. It's being passed a list(string) and an $n$, and whatever it outputs is being added straight to the _ngrams set. This tells us it must return an iterable over ngram tuples.

With that in mind, how would you implement ngrams()? I would probably do something like this.

In [19]:
def _ngrams(words, n):
    return (tuple(words[i-(n-1):i+1]) for i in range(n-1, len(words)))

list(_ngrams(tokenize("This is a sentence"), 3))
Out[19]:
[('THIS', 'IS', 'A'),
 ('IS', 'A', 'SENTENCE'),
 ('A', 'SENTENCE', '[SENTENCE-BREAK]')]

Alright, let's compare and contrast with the NTLK implementation, duped below. Once again, I stripped block comments.

In [ ]:
def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):

    sequence = iter(sequence)
    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))

    history = []
    while n > 1:
        history.append(next(sequence))
        n -= 1
    for item in sequence:
        history.append(item)
        yield tuple(history)
        del history[0]

First, since the function is used elsewhere, there is redundant padding functionality. Our NgramModel passes in padded strings already, so that capacity is not used. But what about the rest of it?

There's actually something really interesting going on here. This is a generator function, which are one of my favorite Python features. If you're not familiar with it, read the wiki, because I'm having trouble explaining it briefly.

So, are you back? Now that we understand generator functions, we know that yield is a keyword that is similar to return -- it's part of the function's output. What are we yielding here? Well, the same list -- history -- over and over again. To see why, let's try running this function to generate some quadrigrams:

In [21]:
from nltk.util import ngrams
list(ngrams(tokenize("This is a slightly longer sentence"), 4))
Out[21]:
[('THIS', 'IS', 'A', 'SLIGHTLY'),
 ('IS', 'A', 'SLIGHTLY', 'LONGER'),
 ('A', 'SLIGHTLY', 'LONGER', 'SENTENCE'),
 ('SLIGHTLY', 'LONGER', 'SENTENCE', '[SENTENCE-BREAK]')]

There's a lot of repetition there. The ngram is something like a "moving window" into the text. It makes sense from a performance perspective to preserve the elements that are repeated between grams, dropping from the beginning and appending to the end.

So this function first produces a list of the first $n$ words, and then for each next word in the sequence, it dels the "oldest" word in history and appends the new one to produce the next ngram.

This might involve less memory allocation since it only uses a single list. But it does still have to create a new tuple for each ngram (you can only hash immutable data types), so I'm not sure if the savings are significant compared to our naive _ngrams implementation. Let's run a quick test:

In [34]:
token_poetry = tokenize(gutenberg.raw("austen-emma.txt"))
%timeit -n 5 list(ngrams(token_poetry, 4))
%memit list(ngrams(token_poetry, 4))
%timeit -n 5 list(_ngrams(token_poetry, 4))
%memit list(_ngrams(token_poetry, 4))
5 loops, best of 3: 115 ms per loop
peak memory: 79.31 MiB, increment: 11.40 MiB
5 loops, best of 3: 151 ms per loop
peak memory: 79.29 MiB, increment: 11.64 MiB

Seems using an inplace list is a little faster than the naive approach, but not much.

Note that generator functions are lazily evaluated, so we force evaluation by pulling the iterable into an actual list. (Our function _ngrams also returns an iterator because it uses a generator expression, so we have to give it the same treatment).

Anyway. Enough comparison.

Now we know how ngram() works, let's carry on with NgramModel.__init__(). The next lines are:

In [ ]:
if not estimator_args and not estimator_kwargs:
    self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
    self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)

# recursively construct the lower-order models
if n > 1:
    self._backoff = NgramModel(n-1, train, pad_left, pad_right,
                               estimator, *estimator_args, **estimator_kwargs)

OK, once again we have some estimator business.

Then, we are actually doing a recursive construction of all n-gram models with a lower $n$, and storing it in _backoff. This seems like a potential performance concern! Why do we need so many models?

Surprisingly, this is a pretty common tactic. As $n$ grows, the $n$-gram model retains more and more context. This can be beneficial -- certainly, it gets us closer to the truth, which is that each word is contextualized by the entire preceding text and arguably the succeeding text as well. But it can also be dangerous: as we add more and more context, the incidence of a specific context decreases. For instance, consider quintigrams: it's pretty unlikely that any but the biggest corpus is going to contain many quintigrams you throw at it, and sexigrams are even worse.

So what do you do if you can't match the provided context exactly? Just start throwing context out until you find a match -- which is what the $n-1$ model in _backoff is for.

Concretely: what if you are trying to find the probability of "steamed" following "I get all" $P(\text{steamed}|\text{I get all})$, but your corpus doesn't contain nursery rhymes? You can "back off" to looking for $P(\text{steamed}|\text{get all})$. And if you can't find that, back off to $P(\text{steamed}|\text{all})$, which maybe exists in your corpus because it was found in "They all steamed vegetables every morning" or "her clothes were all steamed and pressed". This "backing off" the context can give you an answer, but at the cost of discarding information.

Uses

In [37]:
from encodings import base64_codec
base64_codec.base64_decode(b"VGhlcmUncyBub3RoaW5nIGhlcmUuLi4=")[0]
Out[37]:
b"There's nothing here..."