Taking means of arrays with missing values

Sometimes you do a thing that leaves you scratching your head for a few minutes until you realize "Yes, of course that's how it works."

Today at the (virtual) SciPy 2020 xarray tutorial, one of the exercises involved taking the mean of a 2D array that had some missing values. For some reason, my initial inclination was to take the mean of the 2D array over the first dimension and then taking the mean of the resulting 1D array. But that actually gave a different result from taking the mean over the entire array at once!

The reason in the end is a very simple answer: because the data set has missing values that get skipped, calculating the means successively effectively weights measurements differently.

But I thought it was a fun example of a seemingly innocuous mistake that could lead to large errors, particularly when analyzing large datasets. So let's break it down!

To start, let's import numpy and build a random 4 by 4 array:

In [1]:
import numpy as np 
b=np.random.rand(4,4)

To find the mean, we can just use b.mean():

In [2]:
b.mean()
Out[2]:
0.5961189081171596

but it's also true that if take the means of just the column and then take a mean of the result, you'll get the same answer. When you take the mean along an axis, it will reduce the dimensionality by 1, so we'll get a 1D array here:

In [3]:
b.mean(axis=0)
Out[3]:
array([0.67246434, 0.68927864, 0.7035238 , 0.31920885])

And if we take the mean of that result, we'll get the same answer as before (with a tiiiiiiny bit of error in the final digit):

In [4]:
b.mean(axis=0).mean()
Out[4]:
0.5961189081171595

And we get the same answer if we take the mean along the other axis first:

In [5]:
print(b.mean(axis=1).mean())
0.5961189081171596

But what happens when we have some missing values? Let's put some in!

In [6]:
b[2,3]=np.nan 
b[2,1]=np.nan 
b[3,3]=np.nan 

Since we have nan values, we can use np.nanmean:

In [7]:
np.nanmean(b)
Out[7]:
0.6060431635413454

Ok, and when we take the mean of the mean?

In [8]:
np.nanmean(b,axis=0).mean()
Out[8]:
0.5516577241118834
In [9]:
np.nanmean(b,axis=1).mean()
Out[9]:
0.6079093503218946

Whoa! Very different results! What's going on here?

Let's think a minute about what a mean is. Given a vector, $B$, with $N$ elements, we add up the elements and divide by $N$:

$\frac{1}{N}\sum_i^N(B_i)$

But when we have nan values, $N$ changes when we take the mean over axes successively!

Let's take a look at b:

In [10]:
b
Out[10]:
array([[0.80532163, 0.36377322, 0.72169393, 0.13140395],
       [0.94655188, 0.91627271, 0.64348503, 0.10323544],
       [0.39911748,        nan, 0.74945358,        nan],
       [0.53886636, 0.85992325, 0.69946265,        nan]])

So when we take the mean over the rows:

In [11]:
np.nanmean(b,axis=0)
Out[11]:
array([0.67246434, 0.71332306, 0.7035238 , 0.1173197 ])

The first column uses N=4, the second column uses N=3, which we can confirm:

In [12]:
b[:,0].sum()/4
Out[12]:
0.6724643365974177
In [13]:
col2=b[:,1]
col2[~np.isnan(col2)].sum()/3
Out[13]:
0.7133230623983705

This means the values in the second column have a larger weight than the values in the frist column! Which is not what we want. There may be situations in which you want to do that, but most cases you just want the overall mean.

What we could do is sum over the axes successively but divide by the number of total not-nan values in the original array:

In [14]:
NnotNan=(1-np.isnan(b)).sum()
NnotNan
Out[14]:
13
In [15]:
np.nansum(b) / NnotNan
Out[15]:
0.6060431635413454
In [16]:
np.nansum(b,axis=0).sum() / NnotNan
Out[16]:
0.6060431635413455
In [17]:
np.nansum(b,axis=1).sum() / NnotNan
Out[17]:
0.6060431635413455

And we see these answers agree once again!

But, just use

In [18]:
np.nanmean(b)
Out[18]:
0.6060431635413454

and remember: means of means won't give you the mean when you have missing values.

In [ ]: