pandas¶

pandas特征与导入¶

包含高级的数据结构和精巧的工具
pandas建造在NumPy之上
导入：

from pandas import Series, DataFrame
import pandas as pd

pandas数据结构¶

SERIES

一维的类似的数组对象

包含一个数组的数据（任何NumPy的数据类型）和一个与数组关联的索引

不指定索引：a = Series([1,2,3]) ，输出为

0    1
1    2
2    3

包含属性a.index,a.values，对应索引和值

指定索引：a = Series([1,2,3],index=['a','b','c'])

可以通过索引访问a['b']

判断某个索引是否存在：'b' in a

通过字典建立Series

dict = {'china':10,'america':30,'indian':20}
print Series(dict)

输出：

america    30
china      10
indian     20
dtype: int64

判断哪个索引值缺失：

dict = {'china':10,'america':30,'indian':20}
state = ['china','america','test']
a = Series(dict,state)
print a.isnull()

输出：（test索引没有对应值）

china      False
america    False
test        True
dtype: bool

在算术运算中它会自动对齐不同索引的数据

a = Series([10,20],['china','test'])
b = Series([10,20],['test','china'])
print a+b

输出：

china    30
test     30
dtype: int64

指定Series对象的name和index的name属性

a = Series([10,20],['china','test'])
a.index.name = 'state'
a.name = 'number'
print a

输出：

state
china    10
test     20
Name: number, dtype: int64

DATAFRAME

Datarame表示一个表格，类似电子表格的数据结构

包含一个经过排序的列表集（按列名排序）

每一个都可以有不同的类型值（数字，字符串，布尔等等）

DataFrame在内部把数据存储为一个二维数组的格式，因此你可以采用分层索引以表格格式来表示高维的数据

创建：

通过字典

data = {'state': ['a', 'b', 'c', 'd', 'd'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print frame

输出：(按照列名排好序的[若是手动分配列名，会按照你设定的]，并且索引会自动分配)

    pop state  year
0  1.5     a  2000
1  1.7     b  2001
2  3.6     c  2002
3  2.4     d  2001
4  2.9     d  2002

访问

列：与Series一样，通过列名访问：frame['state']或者frame.state

行：ix 索引成员（field），frame.ix[2]，返回每一列的第3行数据

赋值：`frame2['debt'] = np.arange(5.)，若没有debt列名，则会新增一列

删除某一列：`del frame2['eastern']

像Series一样， values 属性返回一个包含在DataFrame中的数据的二维ndarray

返回所有的列信息：frame.columns

转置：frame2.T

索引对象

pandas的索引对象用来保存坐标轴标签和其它元数据（如坐标轴名或名称）

索引对象是不可变的，因此不能由用户改变

创建index = pd.Index([1,2,3])

常用操作

append–>链接额外的索引对象，产生一个新的索引

diff –>计算索引的差集

intersection –>计算交集

union –>计算并集

isin –>计算出一个布尔数组表示每一个值是否包含在所传递的集合里

delete –>计算删除位置i的元素的索引

drop –>计算删除所传递的值后的索引

insert –>计算在位置i插入元素后的索引

is_monotonic –>返回True，如果每一个元素都比它前面的元素大或相等

is_unique –>返回True，如果索引没有重复的值

unique –>计算索引的唯一值数组

重新索引reindex¶

SERIES

重新排列

a = Series([2,3,1],index=['b','a','c'])
b = a.reindex(['a','b','c'])
print b

重新排列，没有的索引补充为0,b=a.reindex(['a','b','c','d'],fill_value=0)
重建索引时对值进行内插或填充

a = Series(['a','b','c'],index=[0,2,4])
b = a.reindex(range(6),method='ffill')
print b

输出：

0    a
1    a
2    b
3    b
4    c
5    cdata_link
dtype: object

method的参数

ffill或pad—->前向（或进位）填充

bfill或backfill—->后向（或进位）填充

DATAFRAME

与Series一样，reindex index 还可以reindex column列，frame.reindex(columns=['a','b'])

从一个坐标轴删除条目¶

SERIES

a.drop(['a','b'])删除a，b索引项

DATAFRAME

索引项的删除与Series一样

删除column—>a.drop(['one'], axis=1)删除column名为one的一列

索引，挑选和过滤¶

SERIES

可以通过index值或者整数值来访问数据，eg：对于a = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])，a['b']和a[1]是一样的使用标签来切片和正常的Python切片并不一样，它会把结束点也包括在内

a = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print a['b':'c']

输出包含c索引对应的值

DATAFRAME

显示前两行：a[:2] 布尔值访问：a[a['two']>5] 索引字段 ix 的使用 index为2，column为’one’和’two’—>a.ix[[2],['one','two']] index为2的一行：a.ix[2]

DataFrame和Series运算¶

DataFrame每一行都减去一个Series

a = pd.DataFrame(np.arange(16).reshape(4,4),index=[0,1,2,3],columns=['one',    'two','three','four'])
print a
b = Series([0,1,2,3],index=['one','two','three','four'])
print b
print a-b

输出：

   one  two  three  four
0    0    1      2     3
1    4    5      6     7
2    8    9     10    11
3   12   13     14    15
one      0
two      1
three    2
four     3
dtype: int64
   one  two  three  four
0    0    0      0     0
1    4    4      4     4
2    8    8      8     8
3   12   12     12    12

读取文件¶

csv文件

pd.read_csv(r"data/train.csv")，返回的数据类型是DataFrame类型

查看DataFrame的信息¶

train_data.describe()

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000

定位到一列并替换¶

df.loc[df.Age.isnull(),'Age'] = 23 #'Age'列为空的内容补上数字23

将分类变量转化为指示变量`get_dummies()`¶

s = pd.Series(list('abca'))
pd.get_dummies(s)

list和string互相转化¶

string转list

>>> str = 'abcde'
>>> list = list(str)
>>> list
['a', 'b', 'c', 'd', 'e']

list转string

>>> str_convert = ','.join(list)
>>> str_convert
'a,b,c,d,e'

删除原来的索引，重新从0-n索引¶

x = x.reset_index(drop=True)

apply函数¶

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, …..

df.apply(numpy.sqrt) # returns DataFrame

等价==》df.apply(lambda x : numpy.sqrt(x))==>使用更灵活

df.apply(numpy.sum, axis=0) # equiv to df.sum(0)

df.apply(numpy.sum, axis=1) # equiv to df.sum(1)

`re.search().group()`函数¶

re.search(pattern, string, flags=0)

group(num=0)函数返回匹配的字符，默认num=0,可以指定多个组号，例如group(0,1)

pandas.cut()函数¶

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

x为一维数组
bins可以是int值或者序列
- 若是int值就根据x分为bins个数的区间
- 若是序列就是自己指定的区间
right包含最右边的区间，默认为True
labels 数组或者一个布尔值
- 若是数组，需要与对应bins的结果一致
- 若是布尔值False，返回bin中的一个值

eg:pd.cut(full[“FamilySize”], bins=[0,1,4,20], labels=[0,1,2])

添加一行数据¶

定义空的dataframe: data_process = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'])

定义一行新的数据，new = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'],index=[j])

这里index可以随意设置，若是想指定就指定

添加：data_process = data_process.append(new, ignore_index=True)，注意这里是data_process = data_process.......