In [1]:
%load_ext watermark

In [2]:
%watermark -v -p numpy -d -u

Last updated: 31/07/2014

CPython 3.4.1
IPython 2.1.0

numpy 1.8.1


[More information](https://github.com/rasbt/watermark) about the watermark magic command extension.

# Quick guide for dealing with missing numbers in NumPy¶

This is just a quick overview of how to deal with missing values (i.e., "NaN"s for "Not-a-Number") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!

## Sample data from a CSV file¶

Let's assume that we have a CSV file with missing elements like the one shown below.

In [3]:
%%file example.csv
1,2,3,4
5,6,,8
10,11,12,

Writing example.csv


The np.genfromtxt function has a missing_values parameters which translates missing values into np.nan objects by default. This allows us to construct a new NumPy ndarray object, even if elements are missing.

In [4]:
import numpy as np
ary = np.genfromtxt('./example.csv', delimiter=',')

print('%s x %s array:\n' %(ary.shape[0], ary.shape[1]))
print(ary)

3 x 4 array:

[[  1.   2.   3.   4.]
[  5.   6.  nan   8.]
[ 10.  11.  12.  nan]]


## Determining if a value is missing¶

A handy function to test whether a value is a NaN or not is to use the np.isnan function.

In [5]:
np.isnan(np.nan)

Out[5]:
True

It is especially useful to create boolean masks for the so-called "fancy indexing" of NumPy arrays, which we will come back to later.

In [6]:
np.isnan(ary)

Out[6]:
array([[False, False, False, False],
[False, False,  True, False],
[False, False, False,  True]], dtype=bool)

## Counting the number of missing values¶

In order to find out how many elements are missing in our array, we can use the np.isnan function that we have seen in the previous section.

In [7]:
np.count_nonzero(np.isnan(ary))

Out[7]:
2

If we want to determine the number of non-missing elements, we can simply revert the returned Boolean mask via the handy "tilde" sign.

In [8]:
np.count_nonzero(~np.isnan(ary))

Out[8]:
10

## Calculating the sum of an array that contains NaNs¶

As we will find out via the following code snippet, we can't use NumPy's regular sum function to calculate the sum of an array.

In [9]:
np.sum(ary)

Out[9]:
nan

Since the np.sum function does not work, use np.nansum instead:

In [10]:
print('total sum:', np.nansum(ary))

total sum: 62.0

In [11]:
print('column sums:', np.nansum(ary, axis=0))

column sums: [ 16.  19.  15.  12.]

In [12]:
print('row sums:', np.nansum(ary, axis=1))

row sums: [ 10.  19.  33.]


## Removing all rows that contain missing values¶

Here, we will use the Boolean mask again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain NaNs, we could simply drop the ~.

In [14]:
ary[~np.isnan(ary).any(1)]

Out[14]:
array([[ 1.,  2.,  3.,  4.]])

## Convert missing values to 0¶

Certain operations, algorithms, and other analyses might not work with NaN objects in our data array. But that's not a problem: We can use the convenient np.nan_to_num function will convert it to the value 0.

In [15]:
ary0 = np.nan_to_num(ary)
ary0

Out[15]:
array([[  1.,   2.,   3.,   4.],
[  5.,   6.,   0.,   8.],
[ 10.,  11.,  12.,   0.]])

## Converting certain numbers to NaN¶

Vice versa, we can also convert any number to a np.NaN object. Here, we use the array that we created in the previous section and convert the 0s back to np.nan objects.

In [16]:
ary0[ary0==0] = np.nan
ary0

Out[16]:
array([[  1.,   2.,   3.,   4.],
[  5.,   6.,  nan,   8.],
[ 10.,  11.,  12.,  nan]])

## Remove all missing elements from an array¶

This is one is a little bit more tricky. We can remove missing values via a combination of the Boolean mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array).

In [17]:
ary[~np.isnan(ary)]

Out[17]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   8.,  10.,  11.,  12.])

Thus, this is a method that would better work on individual rows:

In [21]:
x = np.array([1,2,np.nan])

x[~np.isnan(np.array(x))]

Out[21]:
array([ 1.,  2.])