Note: Click on "Kernel" > "Restart Kernel and Run All" in JupyterLab after finishing the exercises to ensure that your solution runs top to bottom without any errors. If you cannot run this file on your machine, you may want to open it in the cloud .
The exercises below assume that you have read the first and second
part of Chapter 8.
The ...
's in the code cells indicate where you need to fill in code snippets. The number of ...
's within a code cell give you a rough idea of how many lines of code are needed to solve the task. You should not need to create any additional code cells for your final solution. However, you may want to use temporary code cells to try out some ideas.
Let's say we are given a list
object with random integers like sample
below, and we want to calculate some basic statistics on them.
sample = [
45, 46, 40, 49, 36, 53, 49, 42, 25, 40, 39, 36, 38, 40, 40, 52, 36, 52, 40, 41,
35, 29, 48, 43, 42, 30, 29, 33, 55, 33, 38, 50, 39, 56, 52, 28, 37, 56, 45, 37,
41, 41, 37, 30, 51, 32, 23, 40, 53, 40, 45, 39, 99, 42, 34, 42, 34, 39, 39, 53,
43, 37, 46, 36, 45, 42, 32, 38, 57, 34, 36, 44, 47, 51, 46, 39, 28, 40, 35, 46,
41, 51, 41, 23, 46, 40, 40, 51, 50, 32, 47, 36, 38, 29, 32, 53, 34, 43, 39, 41,
40, 34, 44, 40, 41, 43, 47, 57, 50, 42, 38, 25, 45, 41, 58, 37, 45, 55, 44, 53,
82, 31, 45, 33, 32, 39, 46, 48, 42, 47, 40, 45, 51, 35, 31, 46, 40, 44, 61, 57,
40, 36, 35, 55, 40, 56, 36, 35, 86, 36, 51, 40, 54, 50, 49, 36, 41, 37, 48, 41,
42, 44, 40, 43, 51, 47, 46, 50, 40, 23, 40, 39, 28, 38, 42, 46, 46, 42, 46, 31,
32, 40, 48, 27, 40, 40, 30, 32, 25, 31, 30, 43, 44, 29, 45, 41, 63, 32, 33, 58,
]
len(sample)
Q1: list
objects are sequences. What four behaviors do they always come with?
< your answer >
Q2: Write a function mean()
that calculates the simple arithmetic mean of a given sequence
with numbers!
Hints: You can solve this task with built-in functions only. A
for
-loop is not needed.
def mean(sequence):
...
sample_mean = mean(sample)
sample_mean
Q3: Write a function std()
that calculates the standard deviation of a
sequence
of numbers! Integrate your mean()
version from before and the sqrt() function from the math
module in the standard library
provided to you below. Make sure
std()
calls mean()
only once internally! Repeated calls to mean()
would be a waste of computational resources.
Hints: Parts of the code are probably too long to fit within the suggested 79 characters per line. So, use temporary variables inside your function. Instead of a for
-loop, you may want to use a list
comprehension or, even better, a memoryless generator
expression.
from math import sqrt
def std(sequence):
...
sample_std = std(sample)
sample_std
Q4: Complete standardize()
below that takes a sequence
of numbers and returns a list
object with the z-scores of these numbers! A z-score is calculated by subtracting the mean and dividing by the standard deviation. Re-use
mean()
and std()
from before. Again, ensure that standardize()
calls mean()
and std()
only once! Further, round all z-scores with the built-in round() function and pass on the keyword-only argument
digits
to it.
Hint: You may want to use a list
comprehension instead of a for
-loop.
def standardize(sequence, *, digits=3):
...
z_scores = standardize(sample)
The pprint() function from the pprint
module in the standard library
allows us to "pretty print" long
list
objects compactly.
from pprint import pprint
pprint(z_scores, compact=True)
We know that standardize()
works correctly if the resulting z-scores' mean and standard deviation approach 0
and 1
for a long enough sequence
.
mean(z_scores), std(z_scores)
Even though standardize()
calls mean()
and std()
only once each, mean()
is called twice! That is so because std()
internally also re-uses mean()
!
Q5.1: Rewrite std()
to take an optional keyword-only argument seq_mean
, defaulting to None
. If provided, seq_mean
is used instead of the result of calling mean()
. Otherwise, the latter is called.
Hint: You must check if seq_mean
is still the default value.
def std(sequence, *, seq_mean=None):
...
std()
continues to work as before.
sample_std = std(sample)
sample_std
Q5.2: Now, rewrite standardize()
to pass on the return value of mean()
to std()
! In summary, standardize()
calculates the z-scores for the numbers in the sequence
with as few computational steps as possible.
def standardize(sequence, *, digits=3):
...
z_scores = standardize(sample)
mean(z_scores), std(z_scores)
Q6: With both sample
and z_scores
being materialized list
objects, we can loop over pairs consisting of a number from sample
and its corresponding z-score. Write a for
-loop that prints out all the "outliers," as which we define numbers with an absolute z-score above 1.96
. There are four of them in the sample
.
...
We provide a stream
module with a data
object that models an infinite stream of data (cf., the stream.py file in the repository).
from stream import data
data
data
is of type generator
and has no length.
type(data)
len(data)
So, the only thing we can do with it is to pass it to the built-in next() function and go over the numbers it streams one by one.
next(data)
Q7: What happens if you call mean()
with data
as the argument? What is the problem?
Hints: If you try it out, you may have to press the "Stop" button in the toolbar at the top. Your computer should not crash, but you will have to restart this Jupyter notebook with "Kernel" > "Restart" and import data
again.
< your answer >
mean(data)
Q8: Write a function take_sample()
that takes an iterable
as its argument, like data
, and creates a materialized list
object out of its first n
elements, defaulting to 1_000
!
Hints: next() and the range()
built-in may be helpful. You may want to use a
list
comprehension instead of a for
-loop and write a one-liner. Audacious students may want to look at isclice() in the itertools
module in the standard library
.
def take_sample(iterable, *, n=1_000):
...
We take a new_sample
from the stream of data
, and its statistics are similar to the initial sample
.
new_sample = take_sample(data)
len(new_sample)
mean(new_sample)
std(new_sample)
Q9: Convert standardize()
into a new function standardized()
that implements the same logic but works on a possibly infinite stream of data, provided as an iterable
, instead of a finite sequence
.
To calculate a z-score, we need the stream's overall mean and standard deviation, and that is impossible to calculate if we do not know how long the stream is, and, in particular, if it is infinite. So, standardized()
first takes a sample from the iterable
internally, and uses the sample's mean and standard deviation to calculate the z-scores.
Hint: standardized()
must return a generator
object. So, use a generator
expression as the return value; unless you know about the yield
statement already (cf., reference ).
def standardized(iterable, *, digits=3):
...
standardized()
works almost like standardize()
except that we use it with next() to obtain the z-scores one by one.
z_scores = standardized(data)
z_scores
type(z_scores)
next(z_scores)
Q10.1: standardized()
allows us to go over an infinite stream of z-scores. What we want to do instead is to loop over the stream's raw numbers and skip the outliers. In the remainder of this exercise, you look at the parts that make up the skip_outliers()
function below to achieve precisely that.
The first steps in skip_outliers()
are the same as in standardized()
: We take a sample
from the stream of data
and calculate its statistics.
sample = ...
seq_mean = ...
seq_std = ...
Q10.2: Just as in standardized()
, write a generator
expression that produces z-scores one by one! However, instead of just generating a z-score, the resulting generator
object should produce tuple
objects consisting of a "raw" number from data
and its z-score.
Hint: Look at the revisited "Averaging Even Numbers" example in Chapter 7 for some inspiration, which also contains a
generator
expression producing tuple
objects.
standardizer = (... for ... in data)
standardizer
should produce tuple
objects.
next(standardizer)
Q10.3: Write another generator
expression that loops over standardizer
. It contains an if
-clause that keeps only numbers with an absolute z-score below the threshold_z
. If you fancy, use tuple
unpacking.
threshold_z = 1.96
no_outliers = (... for ... in standardizer if ...)
no_outliers
should produce int
objects.
next(no_outliers)
Q10.4: Lastly, put everything together in the skip_outliers()
function! Make sure you refer to iterable
inside the function and not the global data
.
def skip_outliers(iterable, *, threshold_z=1.96):
sample = ...
seq_mean = ...
seq_std = ...
standardizer = ...
no_outliers = ...
return no_outliers
Now, we can create a generator
object and loop over the data
in the stream with outliers skipped. Instead of the default 1.96
, we use a threshold_z
of only 0.05
: That filters out all numbers except 42
.
skipper = skip_outliers(data, threshold_z=0.05)
skipper
type(skipper)
next(skipper)
Q11: You implemented the functions mean()
, std()
, standardize()
, standardized()
, and skip_outliers()
. Which of them are eager, and which are lazy? How do these two concepts relate to finite and infinite data?
< your answer >