%watermark -v -p numpy -d -u
Last updated: 31/07/2014 CPython 3.4.1 IPython 2.1.0 numpy 1.8.1
[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension.
This is just a quick overview of how to deal with missing values (i.e., "NaN"s for "Not-a-Number") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!
Let's assume that we have a CSV file with missing elements like the one shown below.
%%file example.csv 1,2,3,4 5,6,,8 10,11,12,
np.genfromtxt function has a
missing_values parameters which translates missing values into
np.nan objects by default. This allows us to construct a new NumPy
ndarray object, even if elements are missing.
import numpy as np ary = np.genfromtxt('./example.csv', delimiter=',') print('%s x %s array:\n' %(ary.shape, ary.shape)) print(ary)
3 x 4 array: [[ 1. 2. 3. 4.] [ 5. 6. nan 8.] [ 10. 11. 12. nan]]
A handy function to test whether a value is a
NaN or not is to use the
It is especially useful to create boolean masks for the so-called "fancy indexing" of NumPy arrays, which we will come back to later.
array([[False, False, False, False], [False, False, True, False], [False, False, False, True]], dtype=bool)
In order to find out how many elements are missing in our array, we can use the
np.isnan function that we have seen in the previous section.
If we want to determine the number of non-missing elements, we can simply revert the returned
Boolean mask via the handy "tilde" sign.
As we will find out via the following code snippet, we can't use NumPy's regular
sum function to calculate the sum of an array.
np.sum function does not work, use
print('total sum:', np.nansum(ary))
total sum: 62.0
print('column sums:', np.nansum(ary, axis=0))
column sums: [ 16. 19. 15. 12.]
print('row sums:', np.nansum(ary, axis=1))
row sums: [ 10. 19. 33.]
Here, we will use the
Boolean mask again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain
NaNs, we could simply drop the
array([[ 1., 2., 3., 4.]])
Certain operations, algorithms, and other analyses might not work with
NaN objects in our data array. But that's not a problem: We can use the convenient
np.nan_to_num function will convert it to the value 0.
ary0 = np.nan_to_num(ary) ary0
array([[ 1., 2., 3., 4.], [ 5., 6., 0., 8.], [ 10., 11., 12., 0.]])
Vice versa, we can also convert any number to a
np.NaN object. Here, we use the array that we created in the previous section and convert the
0s back to
ary0[ary0==0] = np.nan ary0
array([[ 1., 2., 3., 4.], [ 5., 6., nan, 8.], [ 10., 11., 12., nan]])
This is one is a little bit more tricky. We can remove missing values via a combination of the
Boolean mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array).
array([ 1., 2., 3., 4., 5., 6., 8., 10., 11., 12.])
Thus, this is a method that would better work on individual rows:
x = np.array([1,2,np.nan]) x[~np.isnan(np.array(x))]
array([ 1., 2.])