In [3]:

import numpy as np
import pandas as pd

Creating DataFrames¶

The easiest way to create a DataFrame is using an ndarray. If an index is not specified, a default integer index will be automatically created.

In [4]:

dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])
dframe

Out[4]:

	A	B	C	D
0	-1.329538	-1.252777	-1.888587	-0.008552
1	-0.601241	0.665922	-0.342204	0.707170
2	0.724993	-0.200555	0.226544	0.751363
3	0.888530	0.292810	-1.039384	1.020441
4	-1.330474	0.194110	1.433966	0.335734
5	1.009005	0.620863	0.360362	-1.552132
6	1.139423	-0.888833	0.512584	-1.293783
7	1.152392	2.318118	1.918517	-1.064239

To create an index just add the index argument. Lenght of the index should match the number of rows, of course.

In [5]:

dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'], index = list('abcdefgh'))
dframe

Out[5]:

	A	B	C	D
a	-0.088492	0.837208	-0.206235	0.308445
b	0.514290	1.551145	1.129503	0.706509
c	-0.289319	-0.665703	-0.752778	1.089550
d	0.241300	-1.148401	-0.282621	1.000189
e	-0.080722	-0.824329	1.102019	0.845919
f	-1.732742	0.901447	-2.782270	-0.206729
g	0.632871	0.826829	-0.452925	0.435213
h	0.813164	-0.036353	0.462501	-0.309223

Another convenient way to create a DataFrame is using a python dict. The keys of the dict are used as the column names, and the values as the columns themselves. The values of the dict need to be collections of the same length.

In [6]:

dfdict = pd.DataFrame({'A':[1,2,3,4],'B': [5,6,7,8]},index = ['one','two','three','four'])
dfdict

Out[6]:

	A	B
one	1	5
two	2	6
three	3	7
four	4	8

Selecting rows from DataFrames:¶

Location based selection:¶

Selecting by integer index - using the standard numpy/R notation:

In [7]:

dframe[0:4]

Out[7]:

	A	B	C	D
a	-0.088492	0.837208	-0.206235	0.308445
b	0.514290	1.551145	1.129503	0.706509
c	-0.289319	-0.665703	-0.752778	1.089550
d	0.241300	-1.148401	-0.282621	1.000189

But using the .iloc method is more efficient and optimized. iloc stands for 'integer location' and accepts indices for rows and columns as integer ranges:

In [8]:

dframe.iloc[0:4,0:2]

Out[8]:

	A	B
a	-0.088492	0.837208
b	0.514290	1.551145
c	-0.289319	-0.665703
d	0.241300	-1.148401

Label based selection:¶

In [9]:

dframe.loc[list('bca'),['A','B']]

Out[9]:

	A	B
b	0.514290	1.551145
c	-0.289319	-0.665703
a	-0.088492	0.837208

Criteria based selection:¶

How do you select rows based on values satisfying certain criteria? This is likely the part where most time is spent while doing exploratory data analysis: slicing the dataframe based on certain criteria.

In [10]:

dframe[dframe['C']>0]

Out[10]:

	A	B	C	D
b	0.514290	1.551145	1.129503	0.706509
e	-0.080722	-0.824329	1.102019	0.845919
h	0.813164	-0.036353	0.462501	-0.309223

It is possible to combine boolean criteria using '&', '|' and '~', however these must be grouped by using parentheses.

In [11]:

dframe[(dframe['C']>0) & (dframe['D']<0)]

Out[11]:

	A	B	C	D
h	0.813164	-0.036353	0.462501	-0.309223

Applying functions to a group of values:¶

Using map¶

Applying a function to an entire column can be done using the map method:

In [12]:

dframe['A'] = dframe['A'].map(lambda x: x+1)
dframe

Out[12]:

	A	B	C	D
a	0.911508	0.837208	-0.206235	0.308445
b	1.514290	1.551145	1.129503	0.706509
c	0.710681	-0.665703	-0.752778	1.089550
d	1.241300	-1.148401	-0.282621	1.000189
e	0.919278	-0.824329	1.102019	0.845919
f	-0.732742	0.901447	-2.782270	-0.206729
g	1.632871	0.826829	-0.452925	0.435213
h	1.813164	-0.036353	0.462501	-0.309223

Grouping by a property can be done as follows:

Groupby¶

In [14]:

gb = dframe.groupby(dframe['A']>0)
gb.size()

Out[14]:

A
False    1
True     7
dtype: int64

Aggregate functions can be applied to these groups.

In [15]:

gb.sum()

Out[15]:

	A	B	C	D
A
False	-0.732742	0.901447	-2.782270	-0.206729
True	8.743092	0.540395	0.999464	4.076603

In [16]:

gb.median()

Out[16]:

	A	B	C	D
A
False	-0.732742	0.901447	-2.782270	-0.206729
True	1.241300	-0.036353	-0.206235	0.706509

In [17]:

gb.groups

Out[17]:

{False: ['f'], True: ['a', 'b', 'c', 'd', 'e', 'g', 'h']}

Apply¶

If you need to square an entire column of a Dataframe, here is how you would do it with apply:

In [19]:

def square(x):
    return pd.Series([x, x**2])

dframe['A'].apply(square)

Out[19]:

	0	1
a	0.911508	0.830848
b	1.514290	2.293074
c	0.710681	0.505067
d	1.241300	1.540824
e	0.919278	0.845072
f	-0.732742	0.536911
g	1.632871	2.666266
h	1.813164	3.287565