Note: Click on "Kernel" > "Restart Kernel and Run All" in JupyterLab after finishing the exercises to ensure that your solution runs top to bottom without any errors. If you cannot run this file on your machine, you may want to open it in the cloud .

Chapter 8: Map, Filter, & Reduce (Coding Exercises)¶

The exercises below assume that you have read the first and second part of Chapter 8.

The ...'s in the code cells indicate where you need to fill in code snippets. The number of ...'s within a code cell give you a rough idea of how many lines of code are needed to solve the task. You should not need to create any additional code cells for your final solution. However, you may want to use temporary code cells to try out some ideas.

Removing Outliers in Streaming Data¶

Let's say we are given a list object with random integers like sample below, and we want to calculate some basic statistics on them.

In [ ]:

sample = [
    45, 46, 40, 49, 36, 53, 49, 42, 25, 40, 39, 36, 38, 40, 40, 52, 36, 52, 40, 41,
    35, 29, 48, 43, 42, 30, 29, 33, 55, 33, 38, 50, 39, 56, 52, 28, 37, 56, 45, 37,
    41, 41, 37, 30, 51, 32, 23, 40, 53, 40, 45, 39, 99, 42, 34, 42, 34, 39, 39, 53,
    43, 37, 46, 36, 45, 42, 32, 38, 57, 34, 36, 44, 47, 51, 46, 39, 28, 40, 35, 46,
    41, 51, 41, 23, 46, 40, 40, 51, 50, 32, 47, 36, 38, 29, 32, 53, 34, 43, 39, 41,
    40, 34, 44, 40, 41, 43, 47, 57, 50, 42, 38, 25, 45, 41, 58, 37, 45, 55, 44, 53,
    82, 31, 45, 33, 32, 39, 46, 48, 42, 47, 40, 45, 51, 35, 31, 46, 40, 44, 61, 57,
    40, 36, 35, 55, 40, 56, 36, 35, 86, 36, 51, 40, 54, 50, 49, 36, 41, 37, 48, 41,
    42, 44, 40, 43, 51, 47, 46, 50, 40, 23, 40, 39, 28, 38, 42, 46, 46, 42, 46, 31,
    32, 40, 48, 27, 40, 40, 30, 32, 25, 31, 30, 43, 44, 29, 45, 41, 63, 32, 33, 58,
]

In [ ]:

len(sample)

Q1: list objects are sequences. What four behaviors do they always come with?

< your answer >

Q2: Write a function mean() that calculates the simple arithmetic mean of a given sequence with numbers!

Hints: You can solve this task with built-in functions only. A for-loop is not needed.

In [ ]:

def mean(sequence):
    ...

In [ ]:

sample_mean = mean(sample)

In [ ]:

sample_mean

Q3: Write a function std() that calculates the standard deviation of a sequence of numbers! Integrate your mean() version from before and the sqrt() function from the math module in the standard library provided to you below. Make sure std() calls mean() only once internally! Repeated calls to mean() would be a waste of computational resources.

Hints: Parts of the code are probably too long to fit within the suggested 79 characters per line. So, use temporary variables inside your function. Instead of a for-loop, you may want to use a list comprehension or, even better, a memoryless generator expression.

In [ ]:

from math import sqrt

In [ ]:

def std(sequence):
    ...

In [ ]:

sample_std = std(sample)

In [ ]:

sample_std

Q4: Complete standardize() below that takes a sequence of numbers and returns a list object with the z-scores of these numbers! A z-score is calculated by subtracting the mean and dividing by the standard deviation. Re-use mean() and std() from before. Again, ensure that standardize() calls mean() and std() only once! Further, round all z-scores with the built-in round() function and pass on the keyword-only argument digits to it.

Hint: You may want to use a list comprehension instead of a for-loop.

In [ ]:

def standardize(sequence, *, digits=3):
    ...

In [ ]:

z_scores = standardize(sample)

The pprint() function from the pprint module in the standard library allows us to "pretty print" long list objects compactly.

In [ ]:

from pprint import pprint

In [ ]:

pprint(z_scores, compact=True)

We know that standardize() works correctly if the resulting z-scores' mean and standard deviation approach 0 and 1 for a long enough sequence.

In [ ]:

mean(z_scores), std(z_scores)

Even though standardize() calls mean() and std() only once each, mean() is called twice! That is so because std() internally also re-uses mean()!

Q5.1: Rewrite std() to take an optional keyword-only argument seq_mean, defaulting to None. If provided, seq_mean is used instead of the result of calling mean(). Otherwise, the latter is called.

Hint: You must check if seq_mean is still the default value.

In [ ]:

def std(sequence, *, seq_mean=None):
    ...

std() continues to work as before.

In [ ]:

sample_std = std(sample)

In [ ]:

sample_std

Q5.2: Now, rewrite standardize() to pass on the return value of mean() to std()! In summary, standardize() calculates the z-scores for the numbers in the sequence with as few computational steps as possible.

In [ ]:

def standardize(sequence, *, digits=3):
    ...

In [ ]:

z_scores = standardize(sample)

In [ ]:

mean(z_scores), std(z_scores)

Q6: With both sample and z_scores being materialized list objects, we can loop over pairs consisting of a number from sample and its corresponding z-score. Write a for-loop that prints out all the "outliers," as which we define numbers with an absolute z-score above 1.96. There are four of them in the sample.

Hint: Use the abs() and zip() built-ins.

In [ ]:

...

We provide a stream module with a data object that models an infinite stream of data (cf., the stream.py file in the repository).

In [ ]:

from stream import data

In [ ]:

data

data is of type generator and has no length.

In [ ]:

type(data)

In [ ]:

len(data)

So, the only thing we can do with it is to pass it to the built-in next() function and go over the numbers it streams one by one.

In [ ]:

next(data)

Q7: What happens if you call mean() with data as the argument? What is the problem?

Hints: If you try it out, you may have to press the "Stop" button in the toolbar at the top. Your computer should not crash, but you will have to restart this Jupyter notebook with "Kernel" > "Restart" and import data again.

< your answer >

In [ ]:

mean(data)

Q8: Write a function take_sample() that takes an iterable as its argument, like data, and creates a materialized list object out of its first n elements, defaulting to 1_000!

Hints: next() and the range() built-in may be helpful. You may want to use a list comprehension instead of a for-loop and write a one-liner. Audacious students may want to look at isclice() in the itertools module in the standard library .

In [ ]:

def take_sample(iterable, *, n=1_000):
    ...

We take a new_sample from the stream of data, and its statistics are similar to the initial sample.

In [ ]:

new_sample = take_sample(data)

In [ ]:

len(new_sample)

In [ ]:

mean(new_sample)

In [ ]:

std(new_sample)

Q9: Convert standardize() into a new function standardized() that implements the same logic but works on a possibly infinite stream of data, provided as an iterable, instead of a finite sequence.

To calculate a z-score, we need the stream's overall mean and standard deviation, and that is impossible to calculate if we do not know how long the stream is, and, in particular, if it is infinite. So, standardized() first takes a sample from the iterable internally, and uses the sample's mean and standard deviation to calculate the z-scores.

Hint: standardized() must return a generator object. So, use a generator expression as the return value; unless you know about the yield statement already (cf., reference ).

In [ ]:

def standardized(iterable, *, digits=3):
    ...

standardized() works almost like standardize() except that we use it with next() to obtain the z-scores one by one.

In [ ]:

z_scores = standardized(data)

In [ ]:

z_scores

In [ ]:

type(z_scores)

In [ ]:

next(z_scores)

Q10.1: standardized() allows us to go over an infinite stream of z-scores. What we want to do instead is to loop over the stream's raw numbers and skip the outliers. In the remainder of this exercise, you look at the parts that make up the skip_outliers() function below to achieve precisely that.

The first steps in skip_outliers() are the same as in standardized(): We take a sample from the stream of data and calculate its statistics.

In [ ]:

sample = ...
seq_mean = ...
seq_std = ...

Q10.2: Just as in standardized(), write a generator expression that produces z-scores one by one! However, instead of just generating a z-score, the resulting generator object should produce tuple objects consisting of a "raw" number from data and its z-score.

Hint: Look at the revisited "Averaging Even Numbers" example in Chapter 7 for some inspiration, which also contains a generator expression producing tuple objects.

In [ ]:

standardizer = (... for ... in data)

standardizer should produce tuple objects.

In [ ]:

next(standardizer)

Q10.3: Write another generator expression that loops over standardizer. It contains an if-clause that keeps only numbers with an absolute z-score below the threshold_z. If you fancy, use tuple unpacking.

In [ ]:

threshold_z = 1.96

In [ ]:

no_outliers = (... for ... in standardizer if ...)

no_outliers should produce int objects.

In [ ]:

next(no_outliers)

Q10.4: Lastly, put everything together in the skip_outliers() function! Make sure you refer to iterable inside the function and not the global data.

In [ ]:

def skip_outliers(iterable, *, threshold_z=1.96):
    sample = ...
    seq_mean = ...
    seq_std = ...
    standardizer = ...
    no_outliers = ...
    return no_outliers

Now, we can create a generator object and loop over the data in the stream with outliers skipped. Instead of the default 1.96, we use a threshold_z of only 0.05: That filters out all numbers except 42.

In [ ]:

skipper = skip_outliers(data, threshold_z=0.05)

In [ ]:

skipper

In [ ]:

type(skipper)

In [ ]:

next(skipper)

Q11: You implemented the functions mean(), std(), standardize(), standardized(), and skip_outliers(). Which of them are eager, and which are lazy? How do these two concepts relate to finite and infinite data?

< your answer >