Revisit nested generators from 2013-06-07 道場 Scribbles which was used to solve Problem #2 of Project Euler.
XY mentioned that I could break up the complicated ugly even_fibonacci() generators into three simple generators.
def even_fibonacci(maximum):
a, b = 0, 1
while True:
if a > maximum:
break
if a % 2 == 0:
yield a
a, b = b, a + b
maximum = 4000000 # per Problem #2 of Project Euler
even_fibonacci(maximum)
<generator object even_fibonacci at 0xb41e9644>
list(even_fibonacci(maximum))
[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]
sum(even_fibonacci(maximum))
4613732
even_fibonacci() works. It is correct. It is fast. But it is complicated, hard to read, and ugly. So it is broken up into three little generator functions below.
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
def even(numbers):
for i in numbers:
if i % 2 == 0:
yield i
def lte(numbers, maximum):
for i in numbers:
if i > maximum:
break
yield i
lte(even(fibonacci()), maximum)
<generator object lte at 0xb41e9f2c>
list(lte(even(fibonacci()), maximum))
[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]
sum(lte(even(fibonacci()), maximum))
4613732
The individual generators are simple and easy to understand. It is easy to create many simple generators. Each little generator does just one thing (and does it well). By combining these little generators, one can do powerful things. The magic is in how one chooses to combine the little generators.
However the Python syntax for chaining (nesting) generators gets ugly. One normally reads from left to right, but in
sum(lte(even(fibonacci()), maximum))
One starts in the middle:
fibonacci()
reads left, adding the even() generator:
even(fibonacci())
then the lte() generator:
lte(even(fibonacci()), maximum)
then sum():
sum(lte(even(fibonacci()), maximum))
Even worse, is that one has to think about which generator the maximum argument is for.
So although chaining generators in Python can be amazingly powerful, the syntax can be hard to read.
UNIX solved this over forty years ago with pipes. In UNIX, an obvious solution would look like:
fibonacci | even | lte $maximum | sum
It reads from left to right, and the argument for lte is right after it. This is easy to read. So I implemented the pipe for generators in Python. Passing arguments requires parentheses in Python, so the Python syntax is:
fibonacci() | even() | lte(maximum) | fsum()
Below is my code to implement that. It is very green and naïve. There is plenty that it can not do, but this crude naïve code is a start.
class Filter():
def __init__(self, g, *args, **kwargs):
# print('Filter: g %r, args %r, kwargs %r' % (g, args, kwargs))
# print(' self %r' % (self))
self.g = g
self.args = args
self.kwargs = kwargs
# self.gen = g(*args, **kwargs)
def not__or__(self, other):
print('__or__ self %s, other %s' % (self.g, other.g))
return other.g(self.g(*self.args, **self.kwargs), *other.args, **other.kwargs)
def __ror__(self, left):
# print('__ror__ self %s, left %s' % (self.g, left))
# print('__ror__ left %s, right (self) %s' % (left, self.g))
g = self.g(left, *self.args, **self.kwargs)
# print(' -> %r' % g)
return g
def piperize(f):
def g(*args, **kwargs):
return Filter(f, *args, **kwargs)
return g
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
@piperize
def even(numbers):
for i in numbers:
if i % 2 == 0:
yield i
@piperize
def lte(numbers, maximum):
for i in numbers:
if i > maximum:
break
yield i
fibonacci() | even() | lte(maximum)
<generator object lte at 0xb41e948c>
list(fibonacci() | even() | lte(maximum))
[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]
sum(fibonacci() | even() | lte(maximum))
4613732
@piperize
def fsum(iterable):
return sum(iterable) # I am surprised I got away with a return statement.
fibonacci() | even() | lte(maximum) | fsum()
4613732
Compare the readability of the above cell with the original invocation repeated below:
sum(lte(even(fibonacci()), maximum))
Let's play a little bit more with sum().
F = Filter
fibonacci() | even() | lte(maximum) | F(sum)
4613732
sum = F(sum)
fibonacci() | even() | lte(maximum) | sum
4613732
Hmmm. That's curious. Are no parentheses a good thing or a bad thing?
Now to revisit a previous month's word counting challenge. In UNIX shell, it can be done in one long line as follows.
cat pg84.txt | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n' | sort | uniq -c | sort -n -r | head -n 5
Now let's implement that in python. First we'll do it with regular nested generators.
from itertools import islice
def tolower(text):
for s in text:
yield s.lower()
def chop_into_words(text):
for s in text:
for word in s.split():
yield word.strip()
def sort(words, key=None, reverse=False):
yield from sorted(words, key=key, reverse=reverse)
def uniq_count(words):
counts = {}
old_word = None
for word in words:
if word != old_word:
counts[word] = 0
old_word = word
counts[word] += 1
for word, count in counts.items():
yield count, word
def head(iterable, n=10):
yield from islice(iterable, n)
text_filename = 'pg84.txt'
list(head(sort(
uniq_count(sort(chop_into_words(tolower(open(text_filename))))),
key=lambda x: x[0],
reverse=True
), 5))
[(4327, 'the'), (3004, 'and'), (2754, 'of'), (2720, 'i'), (2160, 'to')]
Yuch! It works correctly, but it sure is hard to read. It is ugly!!!
Let's try it with the pipe syntax.
from itertools import islice
@piperize
def tolower(text):
for s in text:
yield s.lower()
@piperize
def chop_into_words(text):
for s in text:
for word in s.split():
yield word.strip()
@piperize
def sort(words, key=None, reverse=False):
yield from sorted(words, key=key, reverse=reverse)
@piperize
def uniq_count(words):
counts = {}
old_word = None
for word in words:
if word != old_word:
counts[word] = 0
old_word = word
counts[word] += 1
for word, count in counts.items():
yield count, word
@piperize
def head(iterable, n=10):
yield from islice(iterable, n)
text_filename = 'pg84.txt'
list(
open(text_filename) |
tolower() |
chop_into_words() |
sort() |
uniq_count() |
sort(key=lambda x: x[0], reverse=True) |
head(5)
)
[(4327, 'the'), (3004, 'and'), (2754, 'of'), (2720, 'i'), (2160, 'to')]
Compare the readability of the pipe syntax with that of the nested parentheses.
The last four filters can be replaced with a Counter object and its most_common() method.
from collections import Counter
@piperize
def count_words(words, n=10):
yield from Counter(words).most_common(n)
list(
open(text_filename) |
tolower() |
chop_into_words() |
count_words(5)
)
[('the', 4327), ('and', 3004), ('of', 2754), ('i', 2720), ('to', 2160)]
That's short enough to fit on one line, so:
list(open(text_filename) | tolower() | chop_into_words() | count_words(5))
[('the', 4327), ('and', 3004), ('of', 2754), ('i', 2720), ('to', 2160)]
Using syntax similar to UNIX pipe is so obvious to me that I wonder why I have not seen other people do this. Why have other people not already done this? Have other people already done this and I just don't know about it? Have other people not done this because it is a bad idea? What am I missing?
Could generalize the even() generator by passing arbitrary modulus and remainder, or even more broadly by passing a function (which could be a lambda function). That reinvents filter() with the arguments swapped.
Could generalize lte() to be generic quitter generator by passing a (lambda) function. That reinvents itertools.takewhile() with the arguments swapped.
Need to be able to mix with standard generators like islice, without needing to create wrapper functions like I did above with head(). Perhaps create function or class for in-line wrapping.
Consider using the extended generator function protocol: send versus next (and not pass iterable as first argument).
How about implementing something for Hartmann pipelines (later)? (pull not push?)
Pay attention to iter() and functools.partial().
Ponder whether parentheses are good or bad.
!fibonacci | head
0 1 1 2 3 5 8 13 21 34
!seq 10
1 2 3 4 5 6 7 8 9 10
!seq 10 | even
2 4 6 8 10
!seq 10 | lte 4
1 2 3 4
!seq 10 | lte 4 | sum
10
!fibonacci | even | lte 4000000 | sum
4613732