统计运算¶

这一章包含数据分析用得最多的函数操作。

@auther: sunzhenhang
@zhihu: https://www.zhihu.com/people/HANGZS/activities

In [2]:

!cd

E:\ML\实战\pandas实用教程 - 副本

In [3]:

import numpy as np
import pandas as pd

1. 数值型统计运算¶

这些统计操作只对元素类型为数值型的列有效，返回以列索引或行索引为索引的Series。

1.1 一元统计¶

顾名思义，这些统计只是自身分布情况的反映。

1.1.1 `.sum()`¶

`DataFrame.sum(axis='index')`¶

axis：'index'-沿列加，'columns'-沿行加

In [12]:

df = pd.DataFrame([[1,2],[3,5]], index = ['a','b'],columns = ['A','B'])
df

Out[12]:

	A	B
a	1	2
b	3	5

In [13]:

df.sum()  # 按列加

Out[13]:

A    4
B    7
dtype: int64

In [14]:

df.sum(axis = 'columns')  # 按行加

Out[14]:

a    3
b    8
dtype: int64

1.1.2 `.mean(), .std(), .var()`¶

均值、标准差、方差

1.1.3 `.max(), .min(), .median()`¶

最大、最小、中值

In [16]:

df.mad(axis = 'index')

Out[16]:

A    0.75
B    0.75
dtype: float64

In [ ]:

1.2 二元统计¶

计算任意两列直接的统计量，返回以列索引为新行索引和列索引的DataFrame

1.2.1 `.cov()`¶

`DataFrame.cov(min_periods=None)`¶

min_periods：每一列去除NaN后，要求能够参与运算的最少元素个数。

In [41]:

df1 = pd.DataFrame([[1,2],[2,0]],columns = ['B','C'])
df1

Out[41]:

	B	C
0	1	2
1	2	0

In [42]:

df1.cov()

Out[42]:

	B	C
B	0.5	-1.0
C	-1.0	2.0

1.2.2 `.corr()`¶

1.2.3 `.corrwith()`¶

corr是自身列之间的关系，而这个函数可以对不同的DataFrame进行运算，不要要记得运算发生在同名列和同索引的行之间。

`DataFrame.corrwith(other, axis=0, drop=False)`¶

other：另一个DataFrame或Series
axis：'index'或'columns'
drop：是否丢掉结果中的NaN

In [62]:

df1 = pd.DataFrame([[1,2],[2,0],[2,3]],index = [0,1,2],columns = ['B','C'])
df1

Out[62]:

	B	C
0	1	2
1	2	0
2	2	3

In [63]:

df

Out[63]:

	A	B
0	1	2
1	2	0

In [65]:

df.corrwith(df1)  #只对 同名列 和 同名行 进行计算

Out[65]:

A    NaN
B   -1.0
C    NaN
dtype: float64

In [69]:

s = pd.Series([1,2], index = [0,1], name = 'B')
s

Out[69]:

0    1
1    2
Name: B, dtype: int64

In [67]:

df

Out[67]:

	A	B
0	1	2
1	2	0

In [72]:

df.corrwith(s)

Out[72]:

A    1.0
B   -1.0
dtype: float64

2. 类型型统计运算¶

2.1. `value_counts()`¶

不适合DataFrame。

`Series/Index.value_counts(normalize=False, ascending=False, bins=None)`¶

normalize：True or False，计算频次或者频率比；
ascending：True or False，排序方式，默认降序；
bins：int，pd.cut的一种快捷操作，对连续数值型效果好；

In [12]:

s = pd.Series([1,2,1,2,1,3])
s

Out[12]:

0    1
1    2
2    1
3    2
4    1
5    3
dtype: int64

In [13]:

s.value_counts()

Out[13]:

1    3
2    2
3    1
dtype: int64

In [14]:

s.value_counts(ascending = True)

Out[14]:

3    1
2    2
1    3
dtype: int64

In [36]:

s.value_counts( bins = 2)   # bins按照int平均分割，左开右闭，左侧外延1%以包含最左值

Out[36]:

(0.997, 2.0]    5
(2.0, 3.0]      1
dtype: int64

2.2 `.count()`¶

计算统计每一类non-NaN元素个数，这个函数可以快速了解哪些特征或哪些样本缺失比较严重。

`DataFrame.count(axis=0)`¶

axis: 0-查看列，1-查看行；

In [73]:

df

Out[73]:

	A	B
0	1	2
1	2	0

In [74]:

df.count(axis = 0)

Out[74]:

A    2
B    2
dtype: int64

In [75]:

df.count(axis = 1)

Out[75]:

0    2
1    2
dtype: int64

In [ ]:

统计运算¶

1. 数值型统计运算¶

1.1 一元统计¶

1.1.1 .sum()¶

DataFrame.sum(axis='index')¶

1.1.2 .mean(), .std(), .var()¶

1.1.3 .max(), .min(), .median()¶

1.2 二元统计¶

1.2.1 .cov()¶

DataFrame.cov(min_periods=None)¶

1.2.2 .corr()¶

1.2.3 .corrwith()¶

DataFrame.corrwith(other, axis=0, drop=False)¶