Python Tidbits for NLP¶

Anoop Sarkar¶

This is an extremely concise introduction to Python for programmers already proficient in at least one other programming language.

A slower and more thorough tutorial is the Python Tutorial by Guido van Rossum. Read it at least upto Chapter 10.

These code fragments are for Python version 3.x

Be agnostic of the operating system¶

Python code, especially for file system interaction, can be written so that it runs on many different operating systems. This makes your code more portable and easier to maintain as well.

In [2]:

import os
import sys
if sys.platform == 'win32':
    ROOT = os.path.splitdrive(os.path.abspath('.'))[0]
elif sys.platform == 'linux2' or sys.platform == 'darwin':
    ROOT = os.sep
else:
    raise ValueError("unknown operating system")
dictfile = os.path.join(ROOT, 'usr','share','dict','words')
print(dictfile)

/usr/share/dict/words

For loops¶

Use built in functions to create ranges.

In [3]:

for i in range(1,10,2):
    print(i)

Opening and closing file handles¶

Always open a file using the with statement because it closes the file at the end of the statement (even if there is an exception during interaction with the file system). A for loop can be used to iterate through lines using the file handle.

In [4]:

with open(dictfile, 'r') as fhandle:
    for line in fhandle:
        line = line.strip()
        if len(line) > 23:
            print(line)

antidisestablishmentarianism
formaldehydesulphoxylate
pathologicopsychological
scientificophilosophical
tetraiodophenolphthalein
thyroparathyroidectomize

List comprehensions¶

List comprehensions are very useful to replace a for-loop. Example below finds unique elements as a one line python program using the built-in set data structure.

In [5]:

x = ['a', 'b', 'c', 'd', 'a', 'b', 'c']
print([ i for i in set(x) ])

['a', 'd', 'b', 'c']

Also, you can use an 'if' statement in a list comprehension.

In [6]:

print([ i for i in set(x) if i != 'a'])

['d', 'b', 'c']

Using list comprehensions, the following Python code prints out the lowercased tokens of length greater than 15 from Sense and Sensibility (note that one of them occurs twice).

In [7]:

import nltk
longwords = [ word.lower() for word in nltk.corpus.gutenberg.words('austen-sense.txt') if len(word) > 15]
print(longwords)

['incomprehensible', 'incomprehensible', 'disinterestedness', 'companionableness', 'disqualifications']

Enumerate¶

enumerate is very useful when you want a counter variable for each element in a list.

In [8]:

x = ['a', 'c', 'b', 'd']
for (index,element) in enumerate(x): print(index, element)

0 a
1 c
2 b
3 d

Dictionary comprehensions¶

Dictionary comprehensions are just like list comprehensions except they let you build a dictionary instead of a list. Say we want to build a dictionary where the dictionary keys are lowercase ASCII characters and the values are the probabilities for each character. In the following we just assign a random probability to each lowercase character.

In [9]:

import string
import numpy
# set up a random probability distribution over lowercase ASCII characters
counts = [ numpy.random.random() for c in string.ascii_lowercase ]
total = sum(counts)
# the following is a dictionary comprehension
prob = { c: (counts[i] / total) for (i,c) in enumerate(string.ascii_lowercase) }
print(prob['e'])
print(prob['z'])

0.040804314032715325
0.04517300968758016

argmax¶

Often we wish to compute the argmax using a probability distribution. The argmax function returns the element that has the highest probability. $$\hat{x} = \arg\max_x P(x)$$

In [10]:

def P(c):
    return prob[c]
# the character with the highest probability is given by argmax_c P(c)
argmax_char = max(string.ascii_lowercase, key=P)
print(argmax_char, P(argmax_char))

w 0.07334002043038697

Formatted Strings¶

Formatted strings, where you want to insert a value into a string, where %s is a string value, %d is a decimal integer, %f is a floating point number.

In [11]:

print("%s = %d and %s = %f" % ("x", 10, "y", 0.0003))

x = 10 and y = 0.000300

In [12]:

print("The %(foo)s is %(bar)i." % {'foo': 'answer', 'bar':42})

The answer is 42.

In [13]:

print("The {foo} is {bar}".format(foo='answer', bar=42))

The answer is 42

Tuples¶

The builtin function 'tuple' can be used to create n-grams from a list of words.

In [14]:

words = ['a', 'good', 'book', 'is', 'all', 'you', 'need', '.']
print("print unigrams aka 1-grams: ", end='')
print(words)

print("print bigrams aka 2-grams: ", end='')
print([ tuple(words[i:i+2]) for i in range(len(words)-1) ])

print("print trigrams aka 3-grams: ", end=''),
print([ tuple(words[i:i+3]) for i in range(len(words)-2) ])

print unigrams aka 1-grams: ['a', 'good', 'book', 'is', 'all', 'you', 'need', '.']
print bigrams aka 2-grams: [('a', 'good'), ('good', 'book'), ('book', 'is'), ('is', 'all'), ('all', 'you'), ('you', 'need'), ('need', '.')]
print trigrams aka 3-grams: [('a', 'good', 'book'), ('good', 'book', 'is'), ('book', 'is', 'all'), ('is', 'all', 'you'), ('all', 'you', 'need'), ('you', 'need', '.')]

Sorting¶

The function itemgetter from the operator module in Python provides a concise way to sort on different tuple elements in a list of tuples. Note that itemgetter(1) is set to the 2nd component of the tuple, and used as a key to sort the tuples.

In [15]:

word_freq = [ ('the', 1223), ('a', 2413), ('Mr.', 450), ('Elton', 10) ]
print(word_freq)
from operator import itemgetter
word_freq.sort(key=itemgetter(1), reverse=True)
print(word_freq)

[('the', 1223), ('a', 2413), ('Mr.', 450), ('Elton', 10)]
[('a', 2413), ('the', 1223), ('Mr.', 450), ('Elton', 10)]

You can also use the built-in 'map' function to get the sorted values.

In [16]:

print(list(map(itemgetter(1), word_freq)))

[2413, 1223, 450, 10]

In [17]:

print(list(map(itemgetter(0), word_freq)))

['a', 'the', 'Mr.', 'Elton']

Classes¶

A class works pretty much like what you would expect from other languages such as C++ or Java. Methods of a class are determined by indentation. Each method that is part of the class must take at least one argument. The first argument of each method in a class is a pointer to the object derived from the class definition. By convention this first argument is typically called self and it is analogous but not exactly the same as the C++ this pointer.

In [18]:

class C:
    def foo(self):
        return self.a
    def bar(self, a):
        self.a = a
x = C()
x.bar('a')
print(x.foo())

Constructor and Destructor methods in a class¶

The magic method __init__ is the constructor method for the class and __del__ is the destructor method which is called by the garbage collector (Python is similar to Java -- it does not require explicit memory management).

In [19]:

from itertools import islice

class FileObject:
    '''Wrapper for file objects to make sure the file gets closed on deletion.'''

    def __init__(self, filename):
        self.file = open(filename, 'r')

    def __del__(self):
        self.file.close()
        del self.file

f = FileObject(dictfile) # dictfile is defined in an earlier cell
for line in islice(f.file, 5):
    print(line,end='')
del f # get rid of f -- this is typically not explicitly done in Python. trust the garbage collector to do it for you.

A
a
aa
aal
aalii

Iterators¶

A class is an iterator if it has a __iter__ and next method defined as shown in this example.

In [20]:

# circular queue 
class cq:
  q = [] # needs to be initialized with a list
  def __init__(self,q): # the argument q is a list 
    self.q = q 
  def __iter__(self): 
    return self 
  def __next__(self): 
    r = self.q[0]
    self.q = self.q[1:] + [r]  # rotate the list
    return r

x = cq([1,2,3])
print(x.__next__())
print(x.__next__())
print(x.__next__())
print(x.__next__())
print(x.__next__())

Magic!¶

Methods like __iter__ in the above code for cq is called a magic method. Here is a guide to all Python magic methods

Iteration tools¶

The function islice allows you to take a slice of an iterator.

In [21]:

from itertools import islice
x = cq([1,2,3])
for i in islice(x, 5):
  print(i)

y = cq([1,2,3,4,5])
for i in islice(y,3): print(i)
z = [i for i in islice(y,10)]
print(z)

1
2
3
1
2
1
2
3
[4, 5, 1, 2, 3, 4, 5, 1, 2, 3]

Convenient Dictionaries¶

The class defaultdict allows convenient insertion into a dictionary. You do not need to check if a key exists first before updating the value when using defaultdict.

In [22]:

from collections import defaultdict
foo = defaultdict(int)
bar = defaultdict(list)
foo['a'] += 1
foo['a'] += 1
bar['b'].append(1)
bar['b'].append(2)
print(foo, bar)

defaultdict(<class 'int'>, {'a': 2}) defaultdict(<class 'list'>, {'b': [1, 2]})

Generators¶

Use generators instead of lists. Generators behave like streams which you can iterate over while lists are statically allocated.

In [23]:

def sum_of_squares(n):
   v = 0
   for i in range(1,n+1):
       v += i*i
       yield v
for i in sum_of_squares(10): print(i)

Generator expressions¶

In [24]:

a = [1,2,3,4] # this is a list
b = [2*x for x in a] # this is a list comprehension
c = (2*x for x in a) # this is a generator, not a list. it creates an iterator object
print(b)
print(c)

[2, 4, 6, 8]
<generator object <genexpr> at 0x17b552730>

In [25]:

n = ((a,b) for a in range(0,2) for b in range(4,6))
for i in n:
    print(i)

(0, 4)
(0, 5)
(1, 4)
(1, 5)

More on Generators¶

Read Generator Tricks for Systems Programmers by David Beazley.

Use built-in functions¶

In [26]:

# from Part 2 of Peter Norvig's excellent essay on xkcd 1313 
# http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb
import re
searcher = re.compile('^a.o').search
data = frozenset('''all particularly just less indeed over soon course still yet before 
    certainly how actually better to finally pretty then around very early nearly now 
    always either where right often hard back home best out even away enough probably 
    ever recently never however here quite alone both about ok ahead of usually already 
    suddenly down simply long directly little fast there only least quickly much forward 
    today more on exactly else up sometimes eventually almost thus tonight as in close 
    clearly again no perhaps that when also instead really most why ago off 
    especially maybe later well together rather so far once'''.split())
%timeit { s for s in data if searcher(s) }
%timeit set(filter(searcher, data))

9.43 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
8.34 µs ± 9.61 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So about 18% faster to use the built-in command filter instead of the set comprehension with an if statement.

Unpacking tuples and dictionaries¶

In [27]:

def concat(x, y): 
    return x + y 

foo = ('A', 'B')
bar = {'y': 'B', 'x': 'A'}

print(concat(*foo))
print(concat(**bar))

AB
AB

Exceptions¶

In [28]:

def doit(x,y):
  if x < 0:
    raise ValueError("x should be >= 0")
  return y

print(doit(0,10))

In [29]:

# This will raise an exception, if you uncomment the following line:
# print doit(-1,10)

Advanced Features¶

Easter Eggs¶

In [30]:

import this, codecs
print(this.s)
print(codecs.encode(this.s, "rot-13")) # -> uryyb

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Gur Mra bs Clguba, ol Gvz Crgref

Ornhgvshy vf orggre guna htyl.
Rkcyvpvg vf orggre guna vzcyvpvg.
Fvzcyr vf orggre guna pbzcyrk.
Pbzcyrk vf orggre guna pbzcyvpngrq.
Syng vf orggre guna arfgrq.
Fcnefr vf orggre guna qrafr.
Ernqnovyvgl pbhagf.
Fcrpvny pnfrf nera'g fcrpvny rabhtu gb oernx gur ehyrf.
Nygubhtu cenpgvpnyvgl orngf chevgl.
Reebef fubhyq arire cnff fvyragyl.
Hayrff rkcyvpvgyl fvyraprq.
Va gur snpr bs nzovthvgl, ershfr gur grzcgngvba gb thrff.
Gurer fubhyq or bar-- naq cersrenoyl bayl bar --boivbhf jnl gb qb vg.
Nygubhtu gung jnl znl abg or boivbhf ng svefg hayrff lbh'er Qhgpu.
Abj vf orggre guna arire.
Nygubhtu arire vf bsgra orggre guna *evtug* abj.
Vs gur vzcyrzragngvba vf uneq gb rkcynva, vg'f n onq vqrn.
Vs gur vzcyrzragngvba vf rnfl gb rkcynva, vg znl or n tbbq vqrn.
Anzrfcnprf ner bar ubaxvat terng vqrn -- yrg'f qb zber bs gubfr!
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Uncomment the following easter eggs to see what happens.

In [31]:

# from __future__ import braces

In [32]:

# import __phello__

In [33]:

import antigravity

Scoping and Namespaces in Python¶

This section is strictly for programming language wonks. Scoping in Python can sometimes be tricky.

In [34]:

x = 'a'
class wat:
    x = 'b'
    def __init__(self):
        print("1:", x)
        print("2:", self.x)
f = wat()
print("3:", f.x)

1: a
2: b
3: b

In [35]:

x = 'a'
print("1:", list(x for x in (1,2)), x)
print("2:", [x for x in (1,2)], x)

1: [1, 2] a
2: [1, 2] a

For more visit http://programmingwats.tumblr.com/

Function Decorators¶

Python has syntactic support for function composition.

In [36]:

## function composition of foo with bar: foo(bar(args)) using a decorator

def foo(f):
    def decorator_func(*args, **keyword_args):
        f(*args, **keyword_args)
        print("Desecration is the smile on my face\n")
    return decorator_func

@foo
def bar(n):
    print(n)
bar("Never in the wrong time or wrong place")

## function composition directly by calling foo(bar_bar(args))

def bar_bar(n):
    print(n)

# notice how I give a function as an argument to foo which returns a function
# and I then provide an argument to that function returned by foo
foo(bar_bar)("My face my face, hey")

Never in the wrong time or wrong place
Desecration is the smile on my face

My face my face, hey
Desecration is the smile on my face

End¶

In [37]:

from IPython.core.display import HTML


def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()

Out[37]: