Sometimes you do a thing that leaves you scratching your head for a few minutes until you realize "Yes, of course that's how it works."
Today at the (virtual) SciPy 2020 xarray tutorial, one of the exercises involved taking the mean of a 2D array that had some missing values. For some reason, my initial inclination was to take the mean of the 2D array over the first dimension and then taking the mean of the resulting 1D array. But that actually gave a different result from taking the mean over the entire array at once!
The reason in the end is a very simple answer: because the data set has missing values that get skipped, calculating the means successively effectively weights measurements differently.
But I thought it was a fun example of a seemingly innocuous mistake that could lead to large errors, particularly when analyzing large datasets. So let's break it down!
To start, let's import
numpy and build a random 4 by 4 array:
import numpy as np b=np.random.rand(4,4)
To find the mean, we can just use
but it's also true that if take the means of just the column and then take a mean of the result, you'll get the same answer. When you take the mean along an axis, it will reduce the dimensionality by 1, so we'll get a 1D array here:
array([0.67246434, 0.68927864, 0.7035238 , 0.31920885])
And if we take the mean of that result, we'll get the same answer as before (with a tiiiiiiny bit of error in the final digit):
And we get the same answer if we take the mean along the other axis first:
But what happens when we have some missing values? Let's put some in!
b[2,3]=np.nan b[2,1]=np.nan b[3,3]=np.nan
Since we have
nan values, we can use
Ok, and when we take the mean of the mean?
Whoa! Very different results! What's going on here?
Let's think a minute about what a mean is. Given a vector, $B$, with $N$ elements, we add up the elements and divide by $N$:
But when we have
nan values, $N$ changes when we take the mean over axes successively!
Let's take a look at
array([[0.80532163, 0.36377322, 0.72169393, 0.13140395], [0.94655188, 0.91627271, 0.64348503, 0.10323544], [0.39911748, nan, 0.74945358, nan], [0.53886636, 0.85992325, 0.69946265, nan]])
So when we take the mean over the rows:
array([0.67246434, 0.71332306, 0.7035238 , 0.1173197 ])
The first column uses N=4, the second column uses N=3, which we can confirm:
This means the values in the second column have a larger weight than the values in the frist column! Which is not what we want. There may be situations in which you want to do that, but most cases you just want the overall mean.
What we could do is sum over the axes successively but divide by the number of total not-
nan values in the original array:
np.nansum(b) / NnotNan
np.nansum(b,axis=0).sum() / NnotNan
np.nansum(b,axis=1).sum() / NnotNan
And we see these answers agree once again!
But, just use
and remember: means of means won't give you the mean when you have missing values.