1. Introduction - trying to understand it
2. Corpora and Tokenization - finding food for it
3. Coding the Model - having a go at it
4. Learning from Reading - reading someone else's go at it
5. Uses - Where is it?

# Introduction¶

The n-gram model makes for a great introduction to natural-language processing because it is extremely simple, but still versatile and powerful enough to be used for real work. Once you intuitively understand it, the n-gram model is the simplest thing in the world. This introduction is my attempt to transfer an intuitive perception of the model to you, the reader.

If you understand things better by looking at Python code, you might want to skip to the next section: Corpora and Tokenization.

Let's start with a question. Why do we have expectations about the words we read? We definitely have such expectations -- that's how how you can read this sentence without noticing that there were two "how"s.

Our first thought is that our expectations are based on the structure of the language -- the rules of grammar, or syntax. But we can quickly think of many other things that give us expectations about the coming words: common idiom, the author's style, or even other knowledge we have that's external to the text can influence our expectations.

We want to be able to quantify and compute these expectations, but unfortunately in Python we can't access that useful context in your head that you use to generate expectations when you read. We can find some context, though: collections of the author's texts, and maybe larger collections of texts by many different authors, all lovingly digitized, give us a lot of relevant data. A human being with that kind of data could probably read for two years and emerge with some extremely good expectations. How can we do the same thing with a computer?

This is the point where I like to start thinking about everything as likelihoods. That's why I like the word "expectation" -- it's just a likelihood of an event, but we understand expectations. We know what an expectation feels like. It's a feeling like something is right for a spot -- it just ought to be, obviously. It fits. So carry that intuition over to the more technical word, likelihood. A likelihood is just an expectation of an event, quantified as a real number between $0$ and $1$. When we feel an expectation, we're unconsciously assigning a likelihood to an event -- "I think this future event is $X\%$ likely, given this stuff that's already known." (Authors note: Is this the actual thought process in our brain? Probably not. But it's something we can relate to, which is the important part).

But how do we compute this future event is $X\%$ likely given this known past stuff? In order to calculate a likelihood like that, we can see the "known past stuff" happen a bunch of times, and count how many times it's followed by the future event in question.

Of course, we are still left wondering how to make this known past stuff and future event explicit for our program. In a sense, this is the heart of the NLP problem. The n-gram model is just one way of thinking about future events and past stuff and their relationship, and it's based on the idea that language itself usually has some fairly obvious content delimeters. Typically, the contents of each of the delimited blobs are very tightly coupled syntactically, logically, and/or stylistically.

In English, the sentence is an obvious example. It's really hard to predict the first word of a sentence, compared to how easy it is to predict the further words, even if you know the words in the sentence beforehand. There's a less meaningful attachment between words in two separate sentences. So sentences are important: they are little, self-contained blobs of meaning. The other, purely practical, advantage of the sentence-sized blob is that we have a lot of them floating about. Paragraphs are blobby in a similar way, but they are much longer, which makes them harder to use. Paragraphs are too complicated to get good results.

Another great blob in the English language is the word. Although words are not necessarily the "smallest unit of meaning" (thanks to prefixes, suffixes, and compound words), they seem like as good a "unit of meaning" as anything we can find in natural language. There are some models, n-gram or otherwise, that try to consider sub-word-level meaning. Sometimes this can make the model better and sometimes it doesn't really help.

Since a word is a small, fairly predictable thing, and sentences provide such convenient content delimiters, the n-gram model is focused on estimating the likelihood of the next word in the sentence -- that's our future event -- given the sentence so far as our past data.

I tried to emphasize that this is a fairly arbitrary distinction. There is nothing essential about using words and sentences as the most meaningful units when analyzing natural language. It just seems to make the most sense to us right now.

Unfortunately, it turns out that even calculating the likelihood of the next word in the sentence, given the sentence so far is not very practical. Unusual sentences, like the long, run-on variety, might throw a wrench into your counting process (remember, that's how we're calculating the likelihoods for the next word -- by counting all previous times we've seen the sentence so far) by being utterly unique, which would make it impossible for you predict the next word -- although you would never encounter a run-on sentence like that in a notebook like this! So, it turns out we actually get better results if we only use part of the sentence.

This might seem counterintuitive, but it's because data scarcity makes prediction difficult. When we have a huge collection of text -- millions of words -- then although many sentences are still repeated only once or twice, 3-4 word segments almost always appear multiple times. And that's important, because we need many appearances for our model to make sane predictions. To see why, you need to remember that the model doesn't know about any word combinations it hasn't seen before.

So, for example, suppose you are trying to guess the next word in this sentence: "Three days later, details about the event were leaked to the..." what? You probably have an idea what it's being leaked to already, but bear with me. If our model had never seen that sentence before -- and it's not hard to imagine that even in a million words of English that sentence could be completely unique -- how would it guess the next word?

On the other hand, suppose we threw out the start of the sentence. Suppose we only want three words of context. "Leaked to the..." what? In this case, we would probably expect that the words "leaked to the <something>" would appear at least a couple times in a million words, perhaps in newspaper reports or Web articles. Maybe there was a sentence, "A month later his cover was blown, and the once-secret investigation was leaked to the media," and another sentence, "Wayne Manor was raided last night after it was leaked to the press that Bruce Wayne was Batman." Now the model has a sense that "leaked to the..." is usually followed by "media" or "press."

In the last example we used three words of context ("leaked to the") instead of the whole sentence. This would be called a quadrigram -- "quad" for the three words of context, plus the word we are guessing. If you use two words of context it's a trigram, etc. In the general case it's called an n-gram.

But even with only a few words of context, we sometimes run into n-grams that have never been seen before. If you purely follow the probabilities, and the three past words have never been seen before, you will predict that there is a 0% chance that any word will come next! That can be problematic (in the dividng-by-zero sense of the word), so we often have to take other steps to make sure that our model is able to at least try to come up with a prediction.

Imagine rolling a dice twice and it comes up 1 and 2. Now this is your data about the dice.

1. 50%
2. 50%
3. 0%
4. 0%
5. 0%
6. 0%

Imagine this is all the data you have about the dice. You don't even know what it's shaped like! So you just use the data for your prediction -- you have no choice, and really, no reason to expect anything other than a perfect prediction of future behavior, except for some vague misgivings about sample size. But, those percentages are completely wrong for a 6-sided dice, so you won't predict future behavior well at all! You need more data to make accurate guesses; two data points is not enough for the 6 possible outcomes.

Actually, a human probably wouldn't just take that table at face value. They would figure, "there's no way the others are 0%; we probably don't have enough data." So, they adjust the values before using their results, or they try to get more data. This kind of manual tinkering with probabilities -- putting our hands on it -- is quintessentially human. It turns that "fuzzing" the prediction can be helpful for computers as well, so we often add similar functionality to our n-gram models. You might think that having the model deliberately guess around or fiddle with the data would make it less accurate, but it turns out to be very important, because it gives us a path forward even if we have never seen our words-of-context before.

Well, so much for the intuitive approach. Maybe you're still a little lost, but that's alright. Let's get started with some concrete examples. First we need a big set of words, a corpus, for our program to obsessively count through. Let's feed our Count von Count.

# Corpora and Tokenization¶

A corpus is just a big collection of documents which are analyzed by the model program, a process called training. Selection of a corpus is essential in real-world NLP. The corpus should be not too big, or the model may take too long to train, but it also can’t be too small because the model needs a lot of text in order to learn. The more complex and varied the subject matter, the larger the corpus required.

Some corpora (that's plural for corpus) have millions of words, pulled from books, magazines, or other material. Some corpora are organized by genre or media, and some rare gems even have all of their hundreds of thousands of words tagged with additional data like their parts of speech, presumably by underpaid grad students. Some contain chat logs or Twitter feeds. Notably, Google has a collection of goddamn huge private corpora, which is one of the reasons their natural language stuff is so good. They might have the biggest NLP corpora in the world.

For this program, I will be using the Project Gutenberg corpus from NLTK’s collection of corpora. NLTK (Natural Language Toolkit) is a vast Python library that has a lot of utilities for dealing with natural language. We'll be seeing it a lot in our code for this tutorial. Project Gutenberg is a website that contains the full text content of a bunch of public domain books. NLTK bundles some of Project Gutenberg's full-text extracts, which lets us use them easily in our program.

Note that one or two books is not a great corpus for a serious project. You will find the quality of your results to be not very good compared to a larger corpus like the well-known, free, million-plus-word Brown corpus, or one of the many other large free corpora available on the Web. But for a tutorial, a small corpus is nice because it's familiar and fast. And using individual authors can be interesting because the resulting model will end up anticipating the author's style.

Let's get the books.

In [2]:
import nltk
[nltk_data]   Package gutenberg is already up-to-date!

Check that it installed correctly:

In [3]:
from nltk.corpus import gutenberg
gutenberg.raw('carroll-alice.txt')[1:50]
Out[3]:
"Alice's Adventures in Wonderland by Lewis Carroll"

The book is represented as a long string of characters, but a raw-string representation of a book is not very meaningful to our model. The mathematics behind the n-gram model are described in terms of "tokens". For our purposes we can think of tokens as individual semantic units, like words or punctuation, that we want to consider in the model. We want to convert our raw string into a list of tokens so our model can understand it.

This is important: tokens are not meant to represent the syntax -- physical structure -- of the text. They instead provide a consistent way to represent the semantics -- the meaning -- behind the text, in a way that our NLP model can easily parse.

The process of breaking a character string into meaningful chunks is called tokenization in NLP circles, but many programmers also know it as lexical analysis, which is what it's called when you are talking about compilers. An NLP person might say, "we want to turn this string of characters into a list of tokens". A compiler guy might call them atoms. One vocabulary was created for describing natural language stuff, the other was made for computer language stuff, but they mean the same thing. I'll be using the NLP vocabulary, to help you get used to it.

NTLK includes a number of tokenizers for us to use. In your projects, you should probably use these. For example:

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
word_tokenize(sent_tokenize(gutenberg.raw('carroll-alice.txt'))[1])[:10]
Out[4]:
['Down',
'the',
'Rabbit-Hole',
'Alice',
'was',
'beginning',
'to',
'get',
'very',
'tired']

For this tutorial, though, we're going to make our own tokenizer: a nice, simple one that we can modify easily. Our tokenizer is going to be primarily powered by regular expressions. Yes, those horrid things, notoriously unable to parse anything defined in a recursive way, are going to be applied to the impossibly vague and horribly complex grammar of English text! Relax; we're not doing any complicated syntactic parsing, just some simple search-and-replace.

Since we will be using hardcoded regex substitutions as a tokenization strategy, we have to make choices about what changes we want to make to the source text. Remember, the choices we make here are important because of how they affect the n-gram model. Having said that, there are no wrong answers: different tokenization schemes will work better or worse depending on your corpus and your model and your goals.

To start with, we want to do some basic transformations:

1. Standardize whitespace (replacing hyphens and newlines with spaces).
2. Split the text on spaces so we have a list of tokens (i.e. words)
3. CAPSLOCK all tokens (because Apple and apple have the same meaning -- they should use the same token)
In [5]:
import re

def tokenize(words):
re.sub(r'[-\n]', " ", words) # standardize whitespace
return [w.upper() for w in words.split(' ') if w != ""]

tokenize(gutenberg.raw('carroll-alice.txt')[1:50])
Out[5]:
["ALICE'S", 'ADVENTURES', 'IN', 'WONDERLAND', 'BY', 'LEWIS', 'CARROLL']

Great. What else should our tokenizer be doing?

For starters, there's a lot of punctuation we should deal with. Some punctuation is semantically important, like .,:;!? and some is not so useful, like ()[]{}#$%. We have to decide what punctuation we want to preserve as meaningful tokens, and what we can discard as simply distracting. Obviously, sentence-breaking punctuation is important. Our model's every generated sentence is going to begin and end with a sentence break, so we need a consistent way to represent them. Although we could differentiate between !, ?, and ., we can get more meaningful results with such a small corpus by replacing all of these with a consistent [SENTENCE-BREAK] token. Other common punctuation is also important if we want our model to capture any sense of grammar. At least we can preserve ,, ;, :, and .... Parenthesis are a little meaningful as well since an open-parenthesis indicates the start of a new clause in the sentence (a parenthetical clause). But for this simple tokenizer, I want to leave them out. Symbols like \/#%$^*@ (just hold Shift and spam your number keys for additional examples) are also a tiny bit meaningful, but not enough to keep them. This is especially true because the symbols are fairly rare, so in such a small corpus, our model won't see enough repetions of the symbol to "understand" them statistically. Braces []{} are in the same boat, and although quotation marks " are quite common, their meaningfulness is questionable, and the n-gram model has no sense of "matching" quotes or any other elements that come in pairs, so we will get rid of them.

On the topic of symbolic rarity, numbers are notoriously annoying for n-gram models or any statistical models on text. Our model treats each token as having unique meaning, and while 122 is certainly different from 123, it's very hard to find useful statistical information in that distinction. If each number is treated as discrete, we can end up with tens or hundreds of numeric tokens that have each only been seen once or twice, which is bad news. It's pretty much impossible for an n-gram model to extract meaning from such sparse data.

On the other hand, the idea of numbers is useful -- the idea that, hey, a number should go in this spot. The exact number doesn't really matter, but there definitely should be a number here. We will represent this by replacing every number with a token, like [NUMBER], and assuming that the fact that there is a number is more important than the actual digits.

Let's put these ideas into code by expanding our previous tokenize function. Once again, remember that you can fiddle with these in your own project, or use a preexisting tokenizer.

In [6]:
from functools import reduce

def tokenize(words):
# The list of regular expressions and replacements to be applied
replacements = [
["[-\n]",                  " "] # Hyphens to whitespace

# Uses¶

In [37]:
from encodings import base64_codec
base64_codec.base64_decode(b"VGhlcmUncyBub3RoaW5nIGhlcmUuLi4=")[0]
Out[37]:
b"There's nothing here..."