# Iterators¶

A concept of iteration over containers is very powerfull. Python defines a protocol how iteration is performed. To see it in action, let's get the ASCII letters and enumarate them.

Press shift+enter to execute a cell.

In [1]:
from string import lowercase

In [2]:
lowercase

Out[2]:
'abcdefghijklmnopqrstuvwxyz'

enumerate() enumerates a sequence of things. Appending ? of ?? to a function, module or an object shows its help message.

In [3]:
enumerate?


Create an iterator

In [4]:
enumerated_letters = enumerate(lowercase)


next() returns the next value from an iterator, which in case of enumerate() are (index, value) pairs.

In [5]:
next(enumerated_letters)

Out[5]:
(0, 'a')
In [6]:
next(enumerated_letters)

Out[6]:
(1, 'b')

One way of accessing the values in a tuple is to assign all of them to variables

In [7]:
position, letter = next(enumerated_letters)


If we need to show something to a user which is a mix of text and data .format() is handy. Defining a string template with data placesholders and formating it with values avoid string concatenation using +.

In [8]:
'{letter} is enumerated as {position}'.format(letter=letter, position=position)

Out[8]:
'c is enumerated as 2'

In case a sequence of strings has to be joined use .join().

In [9]:
print ', '.join('{l} is enumerated as {p}'.format(l=l, p=p) for p, l in enumerated_letters)

d is enumerated as 3, e is enumerated as 4, f is enumerated as 5, g is enumerated as 6, h is enumerated as 7, i is enumerated as 8, j is enumerated as 9, k is enumerated as 10, l is enumerated as 11, m is enumerated as 12, n is enumerated as 13, o is enumerated as 14, p is enumerated as 15, q is enumerated as 16, r is enumerated as 17, s is enumerated as 18, t is enumerated as 19, u is enumerated as 20, v is enumerated as 21, w is enumerated as 22, x is enumerated as 23, y is enumerated as 24, z is enumerated as 25


Once an iterator is exhausted, a StopIteration exception is raised. However, it's possible to have infinite iterators.

In [10]:
next(enumerated_letters)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-10-6592de41862c> in <module>()
----> 1 next(enumerated_letters)

StopIteration: 

# Bigrams¶

Given a sequence of words (or an iterable), bigrams() returns (yields) its bigrams. Here is another way of defining an iterator (actually a generator):

In [11]:
from itertools import chain

def bigrams(words):
"""Bigram generated from a seuence of words.

:param iter words: the sequence of words

"""
# In the very beginning, ther is no previous word,
# indicate that words[0] has been seen in the beginning
# of the text.
previous = None

# We also need to append None to the end of the text.
# itertools.chain chains iterables togetner
# (yes, a list in this regard is an iterable, as a
# string is and many other things.
for word in chain(words, [None]):
# A bigram is a previous word and the current word.
# yield is used to pause an iterator, "return" a value
# to the caller code and restore callers execution.
yield previous, word

previous = word



Check wheter it works. Note that we don't actually care wheter words is a seuence of strings, it might be a seuence of anything.

In [12]:
list(bigrams([1, 2, 3, 4]))

Out[12]:
[(None, 1), (1, 2), (2, 3), (3, 4), (4, None)]

# N-grams¶

It would be cool to have a general generator for n-grams.

In [13]:
from collections import deque

def ngrams(words, ngram_len=2):
"""Generate ngrmas from a sequence of word.

:param iter words: the sequence of words
:param int ngram_len: the lenght of the ngrams to be generated

:returns: ngrams

"""
# collecions.deque might be seen as a list of limited size.
# If an element apended when a deque is full, the elements
# from the other side of the deque is removed keeping the
# constant length of the deque.
#
# Addend dummy tokens marking the beginning of the sequence.
ngram = deque([None] * ngram_len, maxlen=ngram_len)

for word in chain(words, [None] * (ngram_len - 1)):
ngram.append(word)
yield tuple(ngram)

In [14]:
list(ngrams('abcd', ngram_len=3))

Out[14]:
[(None, None, 'a'),
(None, 'a', 'b'),
('a', 'b', 'c'),
('b', 'c', 'd'),
('c', 'd', None),
('d', None, None)]

# Read a file word by word¶

We have a generator that produces ngrmas, but we are not able to get a sequence of words from a corpus, which so far is just a text file.

In [15]:
def read_words(f_name):
"""Read a file word by word."""
with open(f_name) as f:
for line in f:
line.strip()

# Tokenization is a difficult task,
# a word is anythin between two spaces.
for word in line.split():
yield word


Let's use Alice's Adventures in Wonderland by Lewis Carroll as a corpus.

Some other books are:

In [16]:
from urllib import urlretrieve

f_name, _ = urlretrieve('http://eecs.io/static/notebooks/pg11.txt')


... and store it somewhere

In [17]:
f_name

Out[17]:
'/var/folders/j5/m6pcb_9n6c1gqssrbtzmdmcc0000gn/T/tmp0LsuU_.txt'

Now we are ready to get a sequence of words from the corpus. It's too big to be shown here, so ony 10 words are shown.

In [18]:
from itertools import islice


﻿Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll This


We can chain iterators as processing block of the information flow.

In [19]:
for w1, w2 in islice(bigrams(read_words(f_name)), 10):
print w1, w2

None ﻿Project
﻿Project Gutenberg's
Gutenberg's Alice's
in Wonderland,
Wonderland, by
by Lewis
Lewis Carroll
Carroll This


## Task: clean up the words¶

Even though read_words() tokenizaion is very simple, we can filter out non letters. Lowercase words might be a good idea as well.

In [20]:
def clean_words(words):
for word in words:
yield word.lower()


# Counting ngrams¶

It would be cool if we managed to pump the bigrams to a database. Then counting elements would be a metter of seconds (given that you know SQL. Luckily, there is Pandas which let process data frames (tables) the way we want http://pandas.pydata.org/pandas-docs/stable/groupby.html.

In [21]:
import pandas as pd

In [22]:
pd.DataFrame?

In [23]:
bigram_frame = pd.DataFrame(ngrams(clean_words(read_words(f_name)), ngram_len=2), columns=('Word 1', 'Word 2'))

In [24]:
bigram_frame.head()

Out[24]:
Word 1 Word 2
0 None ﻿project
1 ﻿project gutenberg's
2 gutenberg's alice's

Count unique bigrams

In [25]:
bigram_frame['count'] = 1  # Create a new column

In [26]:
bigram_counts = bigram_frame.groupby(('Word 1', 'Word 2')).sum()

In [27]:
bigram_counts.sort('count', ascending=False, inplace=True)

In [28]:
bigram_counts.head()

Out[28]:
count
Word 1 Word 2
said the 207
of the 154
in a 101
the 92
to the 84

# Dictionary size¶

In [29]:
dictionary = pd.DataFrame(clean_words(read_words(f_name)), columns=('Word', ))

In [30]:
dictionary['count'] = 1

In [31]:
dictionary = dictionary.groupby('Word').sum().sort('count', ascending=False)

In [32]:
len(dictionary)

Out[32]:
5581
In [33]:
dictionary.head()

Out[33]:
count
Word
the 1777
and 833
to 782
a 670
of 610
In [44]:
dictionary.plot().loglog()  # Check out Seaborn http://web.stanford.edu/~mwaskom/software/seaborn/ to get nicer plots.

Out[44]:
[]

# Count bigrams that were seen more than n times¶

In [35]:
len(bigram_counts[bigram_counts['count'] > 5])

Out[35]:
498
In [36]:
len(bigram_counts[bigram_counts['count'] > 4])

Out[36]:
672

# Joint probability¶

In [37]:
bigram_counts['Joint probability'] = bigram_counts['count'] / bigram_counts['count'].sum()

In [38]:
bigram_counts.head()

Out[38]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [39]:
bigram_counts.loc['to'].head()

Out[39]:
count Joint probability
Word 2
the 84 0.002851
be 49 0.001663
herself, 25 0.000849
see 23 0.000781
get 18 0.000611

# Condiitonal probability¶

In [40]:
bigram_counts.head()

Out[40]:
count Joint probability
Word 1 Word 2
said the 207 0.007026
of the 154 0.005227
in a 101 0.003428
the 92 0.003123
to the 84 0.002851
In [41]:
bigram_counts['Word 1 count'] = dictionary.loc[bigram_counts.index.get_level_values('Word 1')]['count'].values

In [42]:
bigram_counts['Conditional probability'] = bigram_counts['count'] / bigram_counts['Word 1 count']

In [43]:
bigram_counts.loc['she'].sort('Conditional probability', ascending=False).head()

Out[43]:
count Joint probability Word 1 count Conditional probability
Word 2
was 51 0.001731 518 0.098456
said 28 0.000950 518 0.054054
went 28 0.000950 518 0.054054
could 19 0.000645 518 0.036680

# Discounting¶

In [ ]:



# Model evaluation¶

Get two corpora drawen from a different domain (for example, Russian classics, and novels about Clarissa), and divide each to a training and a test set. Build language models based on the training data for each domain. Then calcualte the corss-entropy figures for the test sets using both the language model trained on that domain, and the other language model. How much do the cross-entropy estimates differ?

This is Exercise 6.8 from Manning, Christopher D. "Foundations of statistical natural language processing". Ed. Hinrich Schütze. MIT press, 1999.

In [ ]:



Experiment with different tokenisation methods. Try to implement the experiments in such a way that it is easy to