import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object
string_data.isnull()
0 False 1 False 2 True 3 False dtype: bool
在pandas中,我们使用了R语言中的一些传统,把缺失值表示为NA(not available)。在统计应用里,NA数据别是要么是数据不存在,要么是存在但不能被检测到。做数据清理的时候,对缺失值做分析是很重要的,我们要确定是否是数据收集的问题,或者缺失值是否会带来潜在的偏见。
内建的Python None值也被当做NA:
string_data[0] = None
string_data.isnull()
0 True 1 False 2 True 3 False dtype: bool
这里有一些用来处理缺失值的函数:
有一些方法来过滤缺失值。可以使用pandas.isnull和boolean indexing, 配合使用dropna。对于series,只会返回non-null数据和index values:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
0 1.0 2 3.5 4 7.0 dtype: float64
上面的等同于:
data[data.notnull()]
0 1.0 2 3.5 4 7.0 dtype: float64
对于DataFrame,会复杂一些。你可能想要删除包含有NA的row和column。dropna默认会删除包含有缺失值的row:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
cleaned = data.dropna()
cleaned
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
设定how=all
只会删除那些全是NA的行:
data.dropna(how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
删除列也一样,设置axis=1:
data[4] = NA
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
data.dropna(axis=1, how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
一种删除DataFrame row的相关应用是是time series data。假设你想要保留有特定数字的观测结果,可以使用thresh参数:
df = pd.DataFrame(np.random.randn(7, 3))
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | 0.487466 | -0.251823 |
1 | 2.008704 | -0.177133 | 1.827761 |
2 | 2.240856 | -0.587865 | 0.273062 |
3 | 0.777182 | -0.629568 | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
df.iloc[:4, 1] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | NaN | -0.251823 |
1 | 2.008704 | NaN | 1.827761 |
2 | 2.240856 | NaN | 0.273062 |
3 | 0.777182 | NaN | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
df.iloc[:2, 2] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | NaN | NaN |
1 | 2.008704 | NaN | NaN |
2 | 2.240856 | NaN | 0.273062 |
3 | 0.777182 | NaN | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
df.dropna()
0 | 1 | 2 | |
---|---|---|---|
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
df.dropna(thresh=2)
0 | 1 | 2 | |
---|---|---|---|
2 | 2.240856 | NaN | 0.273062 |
3 | 0.777182 | NaN | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
不是删除缺失值,而是用一些数字填补。对于大部分目的,fillna是可以用的。调用fillna的时候设置好一个常用用来替换缺失值:
df.fillna(0)
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | 0.000000 | 0.000000 |
1 | 2.008704 | 0.000000 | 0.000000 |
2 | 2.240856 | 0.000000 | 0.273062 |
3 | 0.777182 | 0.000000 | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
给fillna传入一个dict,可以给不同列替换不同的值:
df.fillna({1: 0.5, 2: 0})
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | 0.500000 | 0.000000 |
1 | 2.008704 | 0.500000 | 0.000000 |
2 | 2.240856 | 0.500000 | 0.273062 |
3 | 0.777182 | 0.500000 | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
fillna返回一个新对象,但你可以使用in-place来直接更改原有的数据:
_ = df.fillna(0, inplace=True)
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.986575 | 0.000000 | 0.000000 |
1 | 2.008704 | 0.000000 | 0.000000 |
2 | 2.240856 | 0.000000 | 0.273062 |
3 | 0.777182 | 0.000000 | -0.220044 |
4 | 0.327522 | 0.781662 | -0.651949 |
5 | 1.454611 | -0.170581 | -1.740959 |
6 | -0.711897 | 0.074983 | 1.343807 |
在使用fillna的时候,这种插入法同样能用于reindexing:
df = pd.DataFrame(np.random.randn(6, 3))
df
0 | 1 | 2 | |
---|---|---|---|
0 | -1.151508 | 1.185176 | -1.766933 |
1 | 0.544729 | -0.807814 | 0.696087 |
2 | -1.461950 | 0.448852 | 0.189045 |
3 | 0.559766 | 0.341335 | 1.469807 |
4 | -0.362789 | 1.117338 | -0.383870 |
5 | -0.452329 | -0.282040 | -0.541759 |
df.iloc[2:, 1] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -1.151508 | 1.185176 | -1.766933 |
1 | 0.544729 | -0.807814 | 0.696087 |
2 | -1.461950 | NaN | 0.189045 |
3 | 0.559766 | NaN | 1.469807 |
4 | -0.362789 | NaN | -0.383870 |
5 | -0.452329 | NaN | -0.541759 |
df.iloc[4:, 2] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -1.151508 | 1.185176 | -1.766933 |
1 | 0.544729 | -0.807814 | 0.696087 |
2 | -1.461950 | NaN | 0.189045 |
3 | 0.559766 | NaN | 1.469807 |
4 | -0.362789 | NaN | NaN |
5 | -0.452329 | NaN | NaN |
df.fillna(method='ffill')
0 | 1 | 2 | |
---|---|---|---|
0 | -1.151508 | 1.185176 | -1.766933 |
1 | 0.544729 | -0.807814 | 0.696087 |
2 | -1.461950 | -0.807814 | 0.189045 |
3 | 0.559766 | -0.807814 | 1.469807 |
4 | -0.362789 | -0.807814 | 1.469807 |
5 | -0.452329 | -0.807814 | 1.469807 |
df.fillna(method='ffill', limit=2)
0 | 1 | 2 | |
---|---|---|---|
0 | -1.151508 | 1.185176 | -1.766933 |
1 | 0.544729 | -0.807814 | 0.696087 |
2 | -1.461950 | -0.807814 | 0.189045 |
3 | 0.559766 | -0.807814 | 1.469807 |
4 | -0.362789 | NaN | 1.469807 |
5 | -0.452329 | NaN | 1.469807 |
使用fillna可以我们做一些颇有创造力的事情。比如,可以传入一个series的平均值或中位数:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
下面是fillna的一些参数: