Lazy sequences working hard

@thomasguestwordaligned.org@clinithinkwales

A lazy sequence defers supplying values to clients until they're needed. This makes it possible to work with large and even infinite sequences using limited memory.

Python provides language and library level support for lazy sequences.

In [2]:
G = 10 ** 100
R = range(G)
R
Out[2]:
range(0, 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
In [3]:
def take(xs, n):
    xs = iter(xs)
    result = [next(xs) for _ in range(n)]
    return result

rs = iter(R)
print(take(rs, 10), '\n', next(rs))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
 10

Procrastination

The call to range() creates a range object. This isn't a collection in memory. Instead, the object stores start, stop, step values, so it can be iterated over etc.

Activity is deferred until necessary.

This is lazy working, not lazy thinking.

Historical note

In Python 2 range() created a list and xrange() behaved lazily. Python 3 simplified things: there's no need to have both. Lazy is sufficient.

In [4]:
list(range(10))
Out[4]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

See also: map(), filter(), dict.keys() ...

Lazy-ish

Range objects are not very lazy or sequential. They give the illusion of being all there all of the time.

  • You can get their length directly
  • You can access the value at an index
  • You can slice them
  • You can check membership
In [5]:
len(R)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-5-3c42657e37a7> in <module>()
----> 1 len(R)

OverflowError: Python int too large to convert to C ssize_t
In [6]:
R[666]  
Out[6]:
666
In [7]:
R[100:200:25]
Out[7]:
range(100, 200, 25)
In [8]:
289 in R[::17]
Out[8]:
True

Generators

Generators offer language level support for lazy lists.

Generators yield values one by one.

You can step using next() or iterate using for.

You cannot go backwards, query the length, etc.

In [9]:
# Generator function
import random

def flip_a_coin():
    'Generate a random sequence of Heads/Tails'
    while True:
        yield random.choice(['Heads', 'Tails'])

flip_a_coin()
Out[9]:
<generator object flip_a_coin at 0x0000002B595309E8>
In [10]:
ht = flip_a_coin()
print(next(ht), next(ht))
' '.join(next(ht) for _ in range(10))
Heads Tails
Out[10]:
'Tails Tails Tails Heads Heads Heads Tails Heads Heads Heads'
In [11]:
M = 1000 * 1000
heads = sum(1 for h, _ in zip(ht, range(M)) if h == 'Heads')
heads, heads/M
Out[11]:
(500536, 0.500536)
In [12]:
import collections

last10 = collections.deque(take(ht, 10), maxlen=10)

tosses = enumerate(ht)
print(take(tosses, 5))
while len(set(last10)) == 2:
    last10.append(next(tosses)[1])

next(tosses)
[(0, 'Tails'), (1, 'Heads'), (2, 'Tails'), (3, 'Tails'), (4, 'Tails')]
Out[12]:
(749, 'Tails')
In [13]:
# Generator expressions look like list comprehensions
# but they create generators, not lists
(x * x for x in R)
Out[13]:
<generator object <genexpr> at 0x0000002B59530A40>
In [14]:
sqs = (x * x for x in R)
print(take(sqs,10))
print(next(sqs))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
100

Standard Generators

Many functions in the Python standard library are generators.

In [15]:
import os

PY = 'C:/Python35'
tree = os.walk(PY)
next(tree)
Out[15]:
('C:/Python35',
 ['DLLs', 'Doc', 'include', 'Lib', 'libs', 'Scripts', 'Tools'],
 ['LICENSE.txt',
  'NEWS.txt',
  'python.exe',
  'python.pdb',
  'python3.dll',
  'python35.dll',
  'python35.pdb',
  'python35_d.dll',
  'python35_d.pdb',
  'python3_d.dll',
  'pythonw.exe',
  'pythonw.pdb',
  'pythonw_d.exe',
  'pythonw_d.pdb',
  'python_d.exe',
  'python_d.pdb',
  'README.txt',
  'vcruntime140.dll'])
In [16]:
subtrees = (r for r, d, f in os.walk(PY))
take(subtrees, 5)
Out[16]:
['C:/Python35',
 'C:/Python35\\DLLs',
 'C:/Python35\\Doc',
 'C:/Python35\\include',
 'C:/Python35\\Lib']
In [17]:
def ilen(xs):
    "Return the length of an iterable"
    return sum(1 for _ in xs)

all_files = (os.path.join(r, f)
             for r, _, ff in os.walk(PY)
             for f in ff)

py_files = (py for py in all_files if py.endswith('.py'))

max(py_files, key=lambda py: ilen(open(py, errors='ignore')))
Out[17]:
'C:/Python35\\Lib\\_pydecimal.py'

Shell Pipeline

$ find /c/Python35 -name "*.py" | xargs -L1 wc -l | sort -rn | head -1
6399 /c/Python35/Lib/_pydecimal.py

Chunked

In [18]:
rs = iter(R)
r3 = [rs, rs, rs]
[next(r) for r in r3]
Out[18]:
[0, 1, 2]
In [19]:
[next(r) for r in r3]
Out[19]:
[3, 4, 5]
In [20]:
z = zip(*r3)
take(z, 3)
Out[20]:
[(6, 7, 8), (9, 10, 11), (12, 13, 14)]
In [21]:
def chunks(xs, n):  
    return zip(*[iter(xs)]*n)

take(chunks(R[::1000], 2), 5)
Out[21]:
[(0, 1000), (2000, 3000), (4000, 5000), (6000, 7000), (8000, 9000)]

Flattened

In [22]:
import itertools
flatten = itertools.chain
flatten('abc', 'def')
Out[22]:
<itertools.chain at 0x2b52c201d0>
In [23]:
list(_)
Out[23]:
['a', 'b', 'c', 'd', 'e', 'f']
In [24]:
xs = chunks(R[::1000], 2)
flatten = itertools.chain.from_iterable
ys = flatten(xs)
take(ys, 5)
Out[24]:
[0, 1000, 2000, 3000, 4000]

Itertools

  • my favourite Python module
  • composable functions for working with lazy lists
  • examples
    • count()
    • islice()
    • cycle()
    • repeat()
In [25]:
import itertools as its
def ilen(xs): return sum(1 for _ in xs)
def run_length_encode(xs):
    ''' Run length encodes the stream of values, xs
    
    The result is a stream of (value, repeat-count) pairs
    '''
    return ((x, ilen(g)) for x, g in its.groupby(xs))
    
rle = run_length_encode(flip_a_coin())
list(its.islice(rle, 5))
Out[25]:
[('Heads', 2), ('Tails', 1), ('Heads', 2), ('Tails', 2), ('Heads', 2)]
In [26]:
def run_length_decode(vc):
    '''Reverses run_length_encode
    
    Expands a stream of (value, repeat-count) pairs into 
    a stream of values.
    '''
    return its.chain.from_iterable(its.starmap(its.repeat, vc))

scream = ('A', 6), ('R', 4), ('G', 5), ('H', 3), ('!', 13)
''.join(run_length_decode(scream))
Out[26]:
'AAAAAARRRRGGGGGHHH!!!!!!!!!!!!!'
In [27]:
import random

def story_points():
    while True:
        yield int(random.gauss(100, 25))

def smooth(xs, n=10):
    b, e = its.tee(xs)
    s = sum(its.islice(e, n))
    for bn, en in zip(b, e):
        s += en - bn
        yield s / n

take(smooth(story_points(), 10000), 10)
Out[27]:
[99.8272,
 99.8263,
 99.828,
 99.8271,
 99.8272,
 99.83,
 99.8312,
 99.8306,
 99.8301,
 99.8264]

Emulating Shell Syntax

You have to read the expression:

take(smooth(story_points(), 10), 10)

inside out.

Wouldn't it be nice to borrow shell syntax?

story_points() | smooth(10) | take(10)
In [28]:
class pipe:
    def __init__(self, fn):
        self._fn = fn

    def __ror__(self, other):
        return self._fn(other)

    def __call__(self, *v, **k):
        return pipe(lambda x: self._fn(x, *v, **k))

# Credit: https://github.com/JulienPalard/Pipe
In [29]:
psmooth = pipe(smooth)
ptake = pipe(take)
story_points() | psmooth(2) | ptake(10) 
Out[29]:
[96.5, 109.0, 102.5, 103.5, 80.0, 69.0, 80.0, 107.5, 112.5, 106.5]
In [30]:
@pipe
def bounds(xs):
    m, M = 1e9, -1e9
    for x in xs:
        m, M = min(m, x), max(M, x)
        yield m, M

story_points() | bounds() | ptake(10)
Out[30]:
[(96, 96),
 (96, 106),
 (96, 118),
 (96, 124),
 (73, 124),
 (73, 124),
 (73, 124),
 (73, 124),
 (73, 124),
 (73, 124)]