A concept of iteration over containers is very powerfull. Python defines a protocol how iteration is performed. To see it in action, let's get the ASCII letters and enumarate them.

Press shift+enter to execute a cell.

In [1]:
from string import lowercase
In [2]:

enumerate() enumerates a sequence of things. Appending ? of ?? to a function, module or an object shows its help message.

In [3]:

Create an iterator

In [4]:
enumerated_letters = enumerate(lowercase)

next() returns the next value from an iterator, which in case of enumerate() are (index, value) pairs.

In [5]:
(0, 'a')
In [6]:
(1, 'b')

One way of accessing the values in a tuple is to assign all of them to variables

In [7]:
position, letter = next(enumerated_letters)

If we need to show something to a user which is a mix of text and data .format() is handy. Defining a string template with data placesholders and formating it with values avoid string concatenation using +.

In [8]:
'{letter} is enumerated as {position}'.format(letter=letter, position=position)
'c is enumerated as 2'

In case a sequence of strings has to be joined use .join().

In [9]:
print ', '.join('{l} is enumerated as {p}'.format(l=l, p=p) for p, l in enumerated_letters)
d is enumerated as 3, e is enumerated as 4, f is enumerated as 5, g is enumerated as 6, h is enumerated as 7, i is enumerated as 8, j is enumerated as 9, k is enumerated as 10, l is enumerated as 11, m is enumerated as 12, n is enumerated as 13, o is enumerated as 14, p is enumerated as 15, q is enumerated as 16, r is enumerated as 17, s is enumerated as 18, t is enumerated as 19, u is enumerated as 20, v is enumerated as 21, w is enumerated as 22, x is enumerated as 23, y is enumerated as 24, z is enumerated as 25

Once an iterator is exhausted, a StopIteration exception is raised. However, it's possible to have infinite iterators.

In [10]:
StopIteration                             Traceback (most recent call last)
<ipython-input-10-6592de41862c> in <module>()
----> 1 next(enumerated_letters)



Given a sequence of words (or an iterable), bigrams() returns (yields) its bigrams. Here is another way of defining an iterator (actually a generator):

In [11]:
from itertools import chain

def bigrams(words):
    """Bigram generated from a seuence of words.
    :param iter words: the sequence of words
    :returns: adjacent word pairs
    # In the very beginning, ther is no previous word,
    # but we start with a bigram (None, words[0]) to
    # indicate that words[0] has been seen in the beginning
    # of the text.
    previous = None
    # We also need to append None to the end of the text.
    # itertools.chain chains iterables togetner
    # (yes, a list in this regard is an iterable, as a
    # string is and many other things.
    for word in chain(words, [None]):
        # A bigram is a previous word and the current word.
        # yield is used to pause an iterator, "return" a value
        # to the caller code and restore callers execution.
        yield previous, word
        previous = word

Check wheter it works. Note that we don't actually care wheter words is a seuence of strings, it might be a seuence of anything.

In [12]:
list(bigrams([1, 2, 3, 4]))
[(None, 1), (1, 2), (2, 3), (3, 4), (4, None)]


It would be cool to have a general generator for n-grams.

In [13]:
from collections import deque

def ngrams(words, ngram_len=2):
    """Generate ngrmas from a sequence of word.
    :param iter words: the sequence of words
    :param int ngram_len: the lenght of the ngrams to be generated
    :returns: ngrams
    # collecions.deque might be seen as a list of limited size.
    # If an element apended when a deque is full, the elements
    # from the other side of the deque is removed keeping the
    # constant length of the deque.
    # Addend dummy tokens marking the beginning of the sequence.
    ngram = deque([None] * ngram_len, maxlen=ngram_len)
    for word in chain(words, [None] * (ngram_len - 1)):
        yield tuple(ngram)
In [14]:
list(ngrams('abcd', ngram_len=3))
[(None, None, 'a'),
 (None, 'a', 'b'),
 ('a', 'b', 'c'),
 ('b', 'c', 'd'),
 ('c', 'd', None),
 ('d', None, None)]

Read a file word by word

We have a generator that produces ngrmas, but we are not able to get a sequence of words from a corpus, which so far is just a text file.

In [15]:
def read_words(f_name):
    """Read a file word by word."""
    with open(f_name) as f:
        for line in f:
            # Tokenization is a difficult task,
            # a word is anythin between two spaces.
            for word in line.split():
                yield word

Let's use Alice's Adventures in Wonderland by Lewis Carroll as a corpus.

Some other books are:

We download it from the Internet ...

In [16]:
from urllib import urlretrieve

f_name, _ = urlretrieve('')

... and store it somewhere

In [17]:

Now we are ready to get a sequence of words from the corpus. It's too big to be shown here, so ony 10 words are shown.

In [18]:
from itertools import islice

print ' '.join(islice(read_words(f_name), 10))
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll This

We can chain iterators as processing block of the information flow.

In [19]:
for w1, w2 in islice(bigrams(read_words(f_name)), 10):
    print w1, w2
None Project
Project Gutenberg's
Gutenberg's Alice's
Alice's Adventures
Adventures in
in Wonderland,
Wonderland, by
by Lewis
Lewis Carroll
Carroll This

Task: clean up the words

Even though read_words() tokenizaion is very simple, we can filter out non letters. Lowercase words might be a good idea as well.

In [20]:
def clean_words(words):
    for word in words:
        yield word.lower()

Counting ngrams

It would be cool if we managed to pump the bigrams to a database. Then counting elements would be a metter of seconds (given that you know SQL. Luckily, there is Pandas which let process data frames (tables) the way we want

In [21]:
import pandas as pd
In [22]:
In [23]:
bigram_frame = pd.DataFrame(ngrams(clean_words(read_words(f_name)), ngram_len=2), columns=('Word 1', 'Word 2'))
In [24]:
Word 1 Word 2
0 None project
1 project gutenberg's
2 gutenberg's alice's
3 alice's adventures
4 adventures in

Count unique bigrams

In [25]:
bigram_frame['count'] = 1  # Create a new column
In [26]:
bigram_counts = bigram_frame.groupby(('Word 1', 'Word 2')).sum()
In [27]:
bigram_counts.sort('count', ascending=False, inplace=True)
In [28]:
Word 1 Word 2
said the 207
of the 154
in a 101
the 92
to the 84

Dictionary size

In [29]:
dictionary = pd.DataFrame(clean_words(read_words(f_name)), columns=('Word', ))
In [30]:
dictionary['count'] = 1
In [31]:
dictionary = dictionary.groupby('Word').sum().sort('count', ascending=False)
In [32]:
In [33]:
the 1777
and 833
to 782
a 670
of 610
In [44]:
dictionary.plot().loglog()  # Check out Seaborn to get nicer plots.

Count bigrams that were seen more than n times

In [35]:
len(bigram_counts[bigram_counts['count'] > 5])
In [36]:
len(bigram_counts[bigram_counts['count'] > 4])

Joint probability

In [37]:
bigram_counts['Joint probability'] = bigram_counts['count'] / bigram_counts['count'].sum()
In [38]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [39]:
count Joint probability
Word 2
the 84 0.002851
be 49 0.001663
herself, 25 0.000849
see 23 0.000781
get 18 0.000611

Condiitonal probability

In [40]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [41]:
bigram_counts['Word 1 count'] = dictionary.loc[bigram_counts.index.get_level_values('Word 1')]['count'].values
In [42]:
bigram_counts['Conditional probability'] = bigram_counts['count'] / bigram_counts['Word 1 count']
In [43]:
bigram_counts.loc['she'].sort('Conditional probability', ascending=False).head()
count Joint probability Word 1 count Conditional probability
Word 2
had 57 0.001935 518 0.110039
was 51 0.001731 518 0.098456
said 28 0.000950 518 0.054054
went 28 0.000950 518 0.054054
could 19 0.000645 518 0.036680


In [ ]:

Model evaluation

Get two corpora drawen from a different domain (for example, Russian classics, and novels about Clarissa), and divide each to a training and a test set. Build language models based on the training data for each domain. Then calcualte the corss-entropy figures for the test sets using both the language model trained on that domain, and the other language model. How much do the cross-entropy estimates differ?

This is Exercise 6.8 from Manning, Christopher D. "Foundations of statistical natural language processing". Ed. Hinrich Schütze. MIT press, 1999.

In [ ]:

Experiment with different tokenisation methods. Try to implement the experiments in such a way that it is easy to

  • add new models
  • vary model parameters