Lazy sequences working hard

@thomasguestwordaligned.org@clinithinkwales

Summary

  • A lazy sequence yields values when they're needed
  • This makes it possible to work with large and even infinite sequences using limited memory
  • Lazy sequences can be chained together
  • Python provides language and library level support for lazy sequences
In [2]:
G = 1000**100
R = range(G)
R
Out[2]:
range(0, 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
In [3]:
R[666]
Out[3]:
666
In [4]:
R[::1000]
Out[4]:
range(0, 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, 1000)
In [5]:
42 in R
Out[5]:
True
In [6]:
len(R)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-6-3c42657e37a7> in <module>()
----> 1 len(R)

OverflowError: Python int too large to convert to C ssize_t
In [ ]:
sum(R)
In [7]:
def take(xs, n):
    xs = iter(xs)
    return [next(xs) for _ in range(n)]

rs = iter(R)
print(take(rs, 10), '\n', next(rs))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
 10

Procrastination

range() creates a range object.

It's not a collection in memory. The range stores start, stop, step values, so it can be iterated over etc. Activity is deferred.

Lazy working, not lazy thinking.

Historical note

  • In Python 2 range() created a list and xrange() behaved lazily.
  • Python 3 simplifies things.
  • Lazy is sufficient.
In [8]:
list(range(10))
Out[8]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

See also: map(), filter(), dict.keys() ...

Lazy-ish

Range objects are not very lazy nor particularly sequential. They give the illusion of being all there all the time.

  • You can get their length directly
  • You can access the value at an index
  • You can slice them
  • You can check membership

Generators

Generators offer language level support for lazy sequences.

Generators yield values one at a time.

Step using next(), iterate using for.

You cannot go backwards, query the length, etc.

In [9]:
# Generator function
import random

def flip_a_coin():
    'Generate a random sequence of Heads/Tails'
    while True:
        yield random.choice(['Heads', 'Tails'])

flip_a_coin()
Out[9]:
<generator object flip_a_coin at 0x000000348A978C50>
In [10]:
ht = flip_a_coin()
print(next(ht), next(ht))
' '.join(next(ht) for _ in range(10))
Heads Heads
Out[10]:
'Heads Heads Heads Heads Heads Heads Heads Tails Tails Tails'
In [11]:
M = 1000 * 1000
heads = sum(1 for h, _ in zip(ht, range(M)) if h == 'Heads')
heads, heads/M
Out[11]:
(499857, 0.499857)
In [12]:
import collections

last10 = collections.deque(take(ht, 10), maxlen=10)

tosses = enumerate(ht)

while set(last10) == {'Heads', 'Tails'}:
    last10.append(next(tosses)[1])

next(tosses)
Out[12]:
(282, 'Tails')
In [13]:
# Generator expressions look like list comprehensions
# but they create generators, not lists
(x * x for x in R)
Out[13]:
<generator object <genexpr> at 0x000000348A9B7048>
In [14]:
sqs = (x * x for x in R)
print(take(sqs,10))
print(next(sqs))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
100

Standard Generators

Many functions in the Python standard library are generators.

In [15]:
import os

PY = 'C:/Python36'
tree = os.walk(PY)
next(tree)
Out[15]:
('C:/Python36',
 ['DLLs',
  'Doc',
  'include',
  'Lib',
  'libs',
  'Scripts',
  'selenium',
  'tcl',
  'Tools'],
 ['LICENSE.txt',
  'NEWS.txt',
  'python.exe',
  'python.pdb',
  'python3.dll',
  'python36.dll',
  'python36.pdb',
  'python36_d.dll',
  'python36_d.pdb',
  'python3_d.dll',
  'pythonw.exe',
  'pythonw.pdb',
  'pythonw_d.exe',
  'pythonw_d.pdb',
  'python_d.exe',
  'python_d.pdb',
  'vcruntime140.dll'])
In [16]:
subtrees = (r for r, d, f in os.walk(PY))
take(subtrees, 5)
Out[16]:
['C:/Python36',
 'C:/Python36\\DLLs',
 'C:/Python36\\Doc',
 'C:/Python36\\include',
 'C:/Python36\\Lib']
In [17]:
def ilen(xs):
    "Return the length of an iterable"
    return sum(1 for _ in xs)

all_files = (os.path.join(r, f)
             for r, _, ff in os.walk(PY)
             for f in ff)

py_files = (py for py in all_files if py.endswith('.py'))

max(py_files, key=lambda py: ilen(open(py, errors='ignore')))
Out[17]:
'C:/Python36\\Lib\\pydoc_data\\topics.py'

Shell Pipeline

$  find /c/Python36 -name "*.py" | xargs wc -l | grep -v total$ | sort -rn | head -1
12966 /c/Python36/Lib/pydoc_data/topics.py

Chunked

In [18]:
rs = iter(R)
r3 = [rs, rs, rs]
[next(r) for r in r3]
Out[18]:
[0, 1, 2]
In [19]:
[next(r) for r in r3]
Out[19]:
[3, 4, 5]
In [20]:
z = zip(*r3)
take(z, 3)
Out[20]:
[(6, 7, 8), (9, 10, 11), (12, 13, 14)]
In [21]:
def chunks(xs, n):  
    return zip(*[iter(xs)]*n)

take(chunks(R[::1000], 2), 5)
Out[21]:
[(0, 1000), (2000, 3000), (4000, 5000), (6000, 7000), (8000, 9000)]

Flattened

In [22]:
import itertools
flatten = itertools.chain
flatten('abc', 'def')
Out[22]:
<itertools.chain at 0x3489510c88>
In [23]:
list(_)
Out[23]:
['a', 'b', 'c', 'd', 'e', 'f']
In [24]:
xs = chunks(R[::1000], 2)
flatten = itertools.chain.from_iterable
ys = flatten(xs)
take(ys, 5)
Out[24]:
[0, 1000, 2000, 3000, 4000]

Itertools

  • composable functions for working with lazy sequences
    • chain()
    • count()
    • islice()
    • cycle()
    • repeat()
    • etc
In [25]:
import itertools as its
def ilen(xs): return sum(1 for _ in xs)
def run_length_encode(xs):
    ''' Run length encodes the stream of values, xs
    
    The result is a stream of (value, repeat-count) pairs
    '''
    return ((x, ilen(g)) for x, g in its.groupby(xs))
    
rle = run_length_encode(flip_a_coin())
list(its.islice(rle, 5))
Out[25]:
[('Tails', 1), ('Heads', 1), ('Tails', 2), ('Heads', 1), ('Tails', 3)]
In [26]:
def run_length_decode(vc):
    '''Reverses run_length_encode
    
    Expands a stream of (value, repeat-count) pairs into 
    a stream of values.
    '''
    return its.chain.from_iterable(its.starmap(its.repeat, vc))

scream = ('A', 6), ('R', 4), ('G', 5), ('H', 3), ('!', 13)
''.join(run_length_decode(scream))
Out[26]:
'AAAAAARRRRGGGGGHHH!!!!!!!!!!!!!'
In [27]:
import random

def story_points():
    while True:
        yield int(random.gauss(100, 25))

def smooth(xs, n=10):
    b, e = its.tee(xs)
    s = sum(its.islice(e, n))
    for bn, en in zip(b, e):
        s += en - bn
        yield s / n

take(smooth(story_points(), 10000), 10)
Out[27]:
[99.2981,
 99.3007,
 99.2982,
 99.3028,
 99.3117,
 99.3128,
 99.3161,
 99.3211,
 99.3171,
 99.3166]

Emulating Shell Syntax

You have to read the expression:

take(smooth(story_points(), 10), 10)

inside out.

Wouldn't it be nice to borrow shell syntax?

story_points() | smooth(10) | take(10)
In [28]:
class pipe:
    def __init__(self, fn):
        self._fn = fn

    def __ror__(self, stdin):
        return self._fn(stdin)

    def __call__(self, *v, **k):
        return pipe(lambda x: self._fn(x, *v, **k))

# Credit: https://github.com/JulienPalard/Pipe
In [29]:
psmooth = pipe(smooth)
ptake = pipe(take)
story_points() | psmooth(2) | ptake(10) 
Out[29]:
[113.0, 129.0, 123.0, 110.5, 84.0, 74.5, 63.0, 77.0, 101.0, 117.0]
In [30]:
@pipe
def bounds(xs):
    m, M = 1e9, -1e9
    for x in xs:
        m, M = min(m, x), max(M, x)
        yield m, M

story_points() | bounds() | ptake(10)
Out[30]:
[(101, 101),
 (101, 102),
 (29, 102),
 (29, 128),
 (29, 128),
 (29, 128),
 (29, 128),
 (29, 128),
 (29, 128),
 (29, 128)]