Iterators

A concept of iteration over containers is very powerfull. Python defines a protocol how iteration is performed. To see it in action, let's get the ASCII letters and enumarate them.

Press shift+enter to execute a cell.

In [1]:
from string import lowercase
In [2]:
lowercase
Out[2]:
'abcdefghijklmnopqrstuvwxyz'

enumerate() enumerates a sequence of things. Appending ? of ?? to a function, module or an object shows its help message.

In [3]:
enumerate?

Create an iterator

In [4]:
enumerated_letters = enumerate(lowercase)

next() returns the next value from an iterator, which in case of enumerate() are (index, value) pairs.

In [5]:
next(enumerated_letters)
Out[5]:
(0, 'a')
In [6]:
next(enumerated_letters)
Out[6]:
(1, 'b')

One way of accessing the values in a tuple is to assign all of them to variables

In [7]:
position, letter = next(enumerated_letters)

If we need to show something to a user which is a mix of text and data .format() is handy. Defining a string template with data placesholders and formating it with values avoid string concatenation using +.

In [8]:
'{letter} is enumerated as {position}'.format(letter=letter, position=position)
Out[8]:
'c is enumerated as 2'

In case a sequence of strings has to be joined use .join().

In [9]:
print ', '.join('{l} is enumerated as {p}'.format(l=l, p=p) for p, l in enumerated_letters)
d is enumerated as 3, e is enumerated as 4, f is enumerated as 5, g is enumerated as 6, h is enumerated as 7, i is enumerated as 8, j is enumerated as 9, k is enumerated as 10, l is enumerated as 11, m is enumerated as 12, n is enumerated as 13, o is enumerated as 14, p is enumerated as 15, q is enumerated as 16, r is enumerated as 17, s is enumerated as 18, t is enumerated as 19, u is enumerated as 20, v is enumerated as 21, w is enumerated as 22, x is enumerated as 23, y is enumerated as 24, z is enumerated as 25

Once an iterator is exhausted, a StopIteration exception is raised. However, it's possible to have infinite iterators.

In [10]:
next(enumerated_letters)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-10-6592de41862c> in <module>()
----> 1 next(enumerated_letters)

StopIteration: 

Bigrams

Given a sequence of words (or an iterable), bigrams() returns (yields) its bigrams. Here is another way of defining an iterator (actually a generator):

In [11]:
from itertools import chain


def bigrams(words):
    """Bigram generated from a seuence of words.
    
    :param iter words: the sequence of words
    :returns: adjacent word pairs
    
    """
    # In the very beginning, ther is no previous word,
    # but we start with a bigram (None, words[0]) to
    # indicate that words[0] has been seen in the beginning
    # of the text.
    previous = None
    
    # We also need to append None to the end of the text.
    # itertools.chain chains iterables togetner
    # (yes, a list in this regard is an iterable, as a
    # string is and many other things.
    for word in chain(words, [None]):
        # A bigram is a previous word and the current word.
        # yield is used to pause an iterator, "return" a value
        # to the caller code and restore callers execution.
        yield previous, word
        
        previous = word
        

Check wheter it works. Note that we don't actually care wheter words is a seuence of strings, it might be a seuence of anything.

In [12]:
list(bigrams([1, 2, 3, 4]))
Out[12]:
[(None, 1), (1, 2), (2, 3), (3, 4), (4, None)]

N-grams

It would be cool to have a general generator for n-grams.

In [13]:
from collections import deque


def ngrams(words, ngram_len=2):
    """Generate ngrmas from a sequence of word.
    
    :param iter words: the sequence of words
    :param int ngram_len: the lenght of the ngrams to be generated
    
    :returns: ngrams
    
    """
    # collecions.deque might be seen as a list of limited size.
    # If an element apended when a deque is full, the elements
    # from the other side of the deque is removed keeping the
    # constant length of the deque.
    #
    # Addend dummy tokens marking the beginning of the sequence.
    ngram = deque([None] * ngram_len, maxlen=ngram_len)
    
    for word in chain(words, [None] * (ngram_len - 1)):
        ngram.append(word)
        yield tuple(ngram)
In [14]:
list(ngrams('abcd', ngram_len=3))
Out[14]:
[(None, None, 'a'),
 (None, 'a', 'b'),
 ('a', 'b', 'c'),
 ('b', 'c', 'd'),
 ('c', 'd', None),
 ('d', None, None)]

Read a file word by word

We have a generator that produces ngrmas, but we are not able to get a sequence of words from a corpus, which so far is just a text file.

In [15]:
def read_words(f_name):
    """Read a file word by word."""
    with open(f_name) as f:
        for line in f:
            line.strip()
            
            # Tokenization is a difficult task,
            # a word is anythin between two spaces.
            for word in line.split():
                yield word

Let's use Alice's Adventures in Wonderland by Lewis Carroll as a corpus.

Some other books are:

We download it from the Internet ...

In [16]:
from urllib import urlretrieve


f_name, _ = urlretrieve('http://eecs.io/static/notebooks/pg11.txt')

... and store it somewhere

In [17]:
f_name
Out[17]:
'/var/folders/j5/m6pcb_9n6c1gqssrbtzmdmcc0000gn/T/tmp0LsuU_.txt'

Now we are ready to get a sequence of words from the corpus. It's too big to be shown here, so ony 10 words are shown.

In [18]:
from itertools import islice


print ' '.join(islice(read_words(f_name), 10))
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll This

We can chain iterators as processing block of the information flow.

In [19]:
for w1, w2 in islice(bigrams(read_words(f_name)), 10):
    print w1, w2
None Project
Project Gutenberg's
Gutenberg's Alice's
Alice's Adventures
Adventures in
in Wonderland,
Wonderland, by
by Lewis
Lewis Carroll
Carroll This

Task: clean up the words

Even though read_words() tokenizaion is very simple, we can filter out non letters. Lowercase words might be a good idea as well.

In [20]:
def clean_words(words):
    for word in words:
        yield word.lower()

Counting ngrams

It would be cool if we managed to pump the bigrams to a database. Then counting elements would be a metter of seconds (given that you know SQL. Luckily, there is Pandas which let process data frames (tables) the way we want http://pandas.pydata.org/pandas-docs/stable/groupby.html.

In [21]:
import pandas as pd
In [22]:
pd.DataFrame?
In [23]:
bigram_frame = pd.DataFrame(ngrams(clean_words(read_words(f_name)), ngram_len=2), columns=('Word 1', 'Word 2'))
In [24]:
bigram_frame.head()
Out[24]:
Word 1 Word 2
0 None project
1 project gutenberg's
2 gutenberg's alice's
3 alice's adventures
4 adventures in

Count unique bigrams

In [25]:
bigram_frame['count'] = 1  # Create a new column
In [26]:
bigram_counts = bigram_frame.groupby(('Word 1', 'Word 2')).sum()
In [27]:
bigram_counts.sort('count', ascending=False, inplace=True)
In [28]:
bigram_counts.head()
Out[28]:
count
Word 1 Word 2
said the 207
of the 154
in a 101
the 92
to the 84

Dictionary size

In [29]:
dictionary = pd.DataFrame(clean_words(read_words(f_name)), columns=('Word', ))
In [30]:
dictionary['count'] = 1
In [31]:
dictionary = dictionary.groupby('Word').sum().sort('count', ascending=False)
In [32]:
len(dictionary)
Out[32]:
5581
In [33]:
dictionary.head()
Out[33]:
count
Word
the 1777
and 833
to 782
a 670
of 610
In [44]:
dictionary.plot().loglog()  # Check out Seaborn http://web.stanford.edu/~mwaskom/software/seaborn/ to get nicer plots.
Out[44]:
[]

Count bigrams that were seen more than n times

In [35]:
len(bigram_counts[bigram_counts['count'] > 5])
Out[35]:
498
In [36]:
len(bigram_counts[bigram_counts['count'] > 4])
Out[36]:
672

Joint probability

In [37]:
bigram_counts['Joint probability'] = bigram_counts['count'] / bigram_counts['count'].sum()
In [38]:
bigram_counts.head()
Out[38]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [39]:
bigram_counts.loc['to'].head()
Out[39]:
count Joint probability
Word 2
the 84 0.002851
be 49 0.001663
herself, 25 0.000849
see 23 0.000781
get 18 0.000611

Condiitonal probability

In [40]:
bigram_counts.head()
Out[40]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [41]:
bigram_counts['Word 1 count'] = dictionary.loc[bigram_counts.index.get_level_values('Word 1')]['count'].values
In [42]:
bigram_counts['Conditional probability'] = bigram_counts['count'] / bigram_counts['Word 1 count']
In [43]:
bigram_counts.loc['she'].sort('Conditional probability', ascending=False).head()
Out[43]:
count Joint probability Word 1 count Conditional probability
Word 2
had 57 0.001935 518 0.110039
was 51 0.001731 518 0.098456
said 28 0.000950 518 0.054054
went 28 0.000950 518 0.054054
could 19 0.000645 518 0.036680

Discounting

In [ ]:
 

Model evaluation

Get two corpora drawen from a different domain (for example, Russian classics, and novels about Clarissa), and divide each to a training and a test set. Build language models based on the training data for each domain. Then calcualte the corss-entropy figures for the test sets using both the language model trained on that domain, and the other language model. How much do the cross-entropy estimates differ?

This is Exercise 6.8 from Manning, Christopher D. "Foundations of statistical natural language processing". Ed. Hinrich Schütze. MIT press, 1999.

In [ ]:
 

Experiment with different tokenisation methods. Try to implement the experiments in such a way that it is easy to

  • add new models
  • vary model parameters