Biggish data processing

Word similarity is the core notion in distributional semantics, where word meaning is represented as vectors. In such a vector space word similarity is modeled as the distance between two vectors. There are many datasets to evaluate distributional models, for example, SimLex-999.

The task

Predict word similarity using co-occurrence based distributional semantic methods.

We are going to exploit Zellig Harris’s intuition, that semantically similar words tend to appear in similar contexts, in the following manner: given a large piece of text, for every word we count its co-occurrence with other words in a symmetric window of N (5 words before the word and 5 words after). The word in the middle of a window is referred as the target word, the words before and after as context words.

Refer to the Idea section for more details.

The data

Download a bunch of books

In [2]:
from urllib import urlretrieve

f_names = tuple(urlretrieve('{}'.format(f))[0]
    for f in ['pg11.txt', 'pg2600.txt ', 'pg2554.txt', 'pg9296.txt', 'pg9798.txt', 'pg9881.txt']

Because of some restrictions, we are going to store functions and generators we define in files. The basic corpus reading generators are stored in

In [3]:
from itertools import chain

def read_words(f_name):
    """Read a file word by word."""
    with open(f_name) as f:
        for line in f:
            # Tokenization is a difficult task,
            # a word is anythin between two spaces.
            for word in line.split():
                yield word

def clean_words(words):
    """Clean up words."""
    for word in words:
        w = ''.join(ch for ch in word.lower() if ch.isalpha())

        if w:
            yield w

def corpus(f_names):
    """Treat a collection of files as a single resource."""
    return chain.from_iterable(clean_words(read_words(f)) for f in f_names)

Count how many words there are in the corpus.

In [4]:
# Before using generators and functionsdefined in files,
# we enable the autoreloead extension, so IPython reloads
# imported things when the source files are changed.
%load_ext autoreload
%autoreload 2

# We import the corpus() function defined previously in the file (module)
from util import corpus


Count how many distinct words there are.

In [5]:


Implement a fucntion that yields co-occurrence pairs for a given window. E.g.

>>> list(co_occurrence('abcde', 2))
    ('a', 'b'),
    ('a', 'c'),
    ('b', 'a'),
    ('b', 'c'),
    ('b', 'd'),
    ('c', 'a'),
    ('c', 'b'),
    ('c', 'd'),
    ('c', 'e'),
    ('d', 'b'),
    ('d', 'c'),
    ('d', 'e'),
    ('e', 'c'),
    ('e', 'd'),
In [6]:
from collections import deque
from itertools import islice, chain

def cooccurrence(words, window_size=5):
    """Yield co-occurence pairs in an iterable of words."""
    words = iter(words)

    before = deque([], maxlen=window_size)
    after = deque(islice(words, window_size))
    while after:
            word = next(words)
        except StopIteration:
            '''There are no more words.'''

        target = after.popleft()

        for context in chain(before, after):
            yield target, context

In [7]:
from cooccurrence import cooccurrence

list(cooccurrence('abcd', 2))
[('a', 'b'),
 ('a', 'c'),
 ('b', 'a'),
 ('b', 'c'),
 ('b', 'd'),
 ('c', 'a'),
 ('c', 'b'),
 ('c', 'd'),
 ('d', 'b'),
 ('d', 'c')]

Count co-occurrence pairs

In [8]:
import pandas as pd

from cooccurrence import cooccurrence

def count_cooccurrence(words):
    """Count co-occrence counts.
    :param iter words: an iterable of words.
    :return: a pandas.DataFrame where `target` and`context`
             are the index columns and `count` is a data column.
    frame = pd.DataFrame(
        columns=('target', 'context'),
    frame['count'] = 1
    return frame.groupby(('target', 'context')).sum()

It takes some time (12 seconds on my machine) to retrieve co-occurrence counts of a relatively small (1 million tokens) collection. In real life, much larger data sets are used, for example Wikipedia is about 2 billion tokes.

In [9]:
from count_cooccurrence import count_cooccurrence

%time cooccurrence_frame = count_cooccurrence(corpus(f_names))
CPU times: user 10.8 s, sys: 1.61 s, total: 12.4 s
Wall time: 12.6 s
In [10]:
cooccurrence_frame.sort('count', ascending=False).head()
target context
the the 26898
of the 23753
the of 23753
and 19269
and the 19269

Parallelizing computation over multiple cores

Most of modern computer CPUs have several cores, meaning that they can perform several computations at the same time.

In our example, we could compute the co-occurrence counts independenly for each file in parallel and then sum them up. Note, however, that the result won't be identical to count_cooccurrence(corpus(f_names)). Why? Does it matter? What approch is better?

Before scaling our implementation to several CPU cores, we need to get familliar with the map() function.

In [11]:

In short, map() takes two arguments: a function and an iterable. It applies the funcion to each element in the passed iterable. For example, to lowercase a list of letters, one could write this:

In [12]:
from string import lower

list(map(lower, ['A', 'B', 'C']))
['a', 'b', 'c']

To spread the computation over several cores, we can used multiprocessing.Pool that provides a map method as well:

In [13]:
from multiprocessing import Pool

pool = Pool()
list(, ['A', 'B', 'C']))
['a', 'b', 'c']

To spread the co-occurrence counting over several cores, we need to come up with at function that takes a file name and return a DataFrame with co-occurence counts.

In [14]:
from count_cooccurrence import count_cooccurrence
from util import corpus

def count_cooccurrence_file(f_name):
    return count_cooccurrence(corpus([f_name]))

Serial implementaton

In [15]:
from count_cooccurrence_file import count_cooccurrence_file

# Read each file twice, to make parallel implementation impovement more evident!
%time len(list(map(count_cooccurrence_file, f_names * 2)))
CPU times: user 20 s, sys: 2.34 s, total: 22.3 s
Wall time: 22.3 s

Parallel implementation

In [16]:
%time len(list(, f_names * 2)))
CPU times: user 212 ms, sys: 222 ms, total: 434 ms
Wall time: 13 s

Merging results together

In [17]:
import pandas as pd

map_result = list(, f_names))
In [18]:
cooccurrence_counts = (
    .groupby(level=('target', 'context'))
In [19]:
cooccurrence_counts.loc[['morning', 'evening']].sort('count', ascending=False).head()
target context
morning the 251
evening the 240
morning in 110
and 97
evening that 94

Building a semantic space

In [20]:
toy_space = (
    cooccurrence_counts.loc[['morning', 'evening', 'john', 'mary', 'red', 'green']]  # select only some target words
    .reset_index()  # get rid of index, so pivoting works
    .pivot(index='target', columns='context', values='count')
In [26]:
toy_space[['a', 'the', 'book', 'run']]
context a the book run
evening 46 240 0 1
green 30 43 0 0
john 2 5 0 0
mary 68 213 2 0
morning 66 251 0 0
red 74 72 0 0

Semantic similarity

In [1]:
from sklearn.metrics import pairwise
In [2]:
In [28]:
pd.DataFrame(pairwise.cosine_similarity(toy_space.values), index=toy_space.index, columns=toy_space.index)
target evening green john mary morning red
evening 1.000000 0.798374 0.307136 0.620545 0.952338 0.710514
green 0.798374 1.000000 0.240387 0.560840 0.785051 0.876843
john 0.307136 0.240387 1.000000 0.286026 0.350444 0.242568
mary 0.620545 0.560840 0.286026 1.000000 0.620141 0.559805
morning 0.952338 0.785051 0.350444 0.620141 1.000000 0.705259
red 0.710514 0.876843 0.242568 0.559805 0.705259 1.000000


In [36]:
simlex = pd.read_csv(
In [37]:
word1 word2 POS SimLex999 conc(w1) conc(w2) concQ Assoc(USF) SimAssoc333 SD(SimLex)
0 old new A 1.58 2.72 2.81 2 7.25 1 0.41
1 smart intelligent A 9.20 1.75 2.46 1 7.11 1 0.67
2 hard difficult A 8.77 3.76 2.21 2 5.94 1 1.19
3 happy cheerful A 9.55 2.56 2.34 1 5.85 1 2.18
4 hard easy A 0.95 3.76 2.07 2 5.82 1 0.93