In [3]:
import numpy as np
import pandas as pd

Creating DataFrames

The easiest way to create a DataFrame is using an ndarray. If an index is not specified, a default integer index will be automatically created.

In [4]:
dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])
dframe
Out[4]:
A B C D
0 -1.329538 -1.252777 -1.888587 -0.008552
1 -0.601241 0.665922 -0.342204 0.707170
2 0.724993 -0.200555 0.226544 0.751363
3 0.888530 0.292810 -1.039384 1.020441
4 -1.330474 0.194110 1.433966 0.335734
5 1.009005 0.620863 0.360362 -1.552132
6 1.139423 -0.888833 0.512584 -1.293783
7 1.152392 2.318118 1.918517 -1.064239

To create an index just add the index argument. Lenght of the index should match the number of rows, of course.

In [5]:
dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'], index = list('abcdefgh'))
dframe
Out[5]:
A B C D
a -0.088492 0.837208 -0.206235 0.308445
b 0.514290 1.551145 1.129503 0.706509
c -0.289319 -0.665703 -0.752778 1.089550
d 0.241300 -1.148401 -0.282621 1.000189
e -0.080722 -0.824329 1.102019 0.845919
f -1.732742 0.901447 -2.782270 -0.206729
g 0.632871 0.826829 -0.452925 0.435213
h 0.813164 -0.036353 0.462501 -0.309223

Another convenient way to create a DataFrame is using a python dict. The keys of the dict are used as the column names, and the values as the columns themselves. The values of the dict need to be collections of the same length.

In [6]:
dfdict = pd.DataFrame({'A':[1,2,3,4],'B': [5,6,7,8]},index = ['one','two','three','four'])
dfdict
Out[6]:
A B
one 1 5
two 2 6
three 3 7
four 4 8

Selecting rows from DataFrames:

Location based selection:

Selecting by integer index - using the standard numpy/R notation:

In [7]:
dframe[0:4]
Out[7]:
A B C D
a -0.088492 0.837208 -0.206235 0.308445
b 0.514290 1.551145 1.129503 0.706509
c -0.289319 -0.665703 -0.752778 1.089550
d 0.241300 -1.148401 -0.282621 1.000189

But using the .iloc method is more efficient and optimized. iloc stands for 'integer location' and accepts indices for rows and columns as integer ranges:

In [8]:
dframe.iloc[0:4,0:2]
Out[8]:
A B
a -0.088492 0.837208
b 0.514290 1.551145
c -0.289319 -0.665703
d 0.241300 -1.148401

Label based selection:

The .loc method provides a way to use the labels of the dataframe effectively. It accepts integers as arguments but these are strictly related to the labels, not the position of a row/column.
In [9]:
dframe.loc[list('bca'),['A','B']]
Out[9]:
A B
b 0.514290 1.551145
c -0.289319 -0.665703
a -0.088492 0.837208

Criteria based selection:

How do you select rows based on values satisfying certain criteria? This is likely the part where most time is spent while doing exploratory data analysis: slicing the dataframe based on certain criteria.

In [10]:
dframe[dframe['C']>0]
Out[10]:
A B C D
b 0.514290 1.551145 1.129503 0.706509
e -0.080722 -0.824329 1.102019 0.845919
h 0.813164 -0.036353 0.462501 -0.309223

It is possible to combine boolean criteria using '&', '|' and '~', however these must be grouped by using parentheses.

In [11]:
dframe[(dframe['C']>0) & (dframe['D']<0)]
Out[11]:
A B C D
h 0.813164 -0.036353 0.462501 -0.309223

Applying functions to a group of values:

Using map

Applying a function to an entire column can be done using the map method:

In [12]:
dframe['A'] = dframe['A'].map(lambda x: x+1)
dframe
Out[12]:
A B C D
a 0.911508 0.837208 -0.206235 0.308445
b 1.514290 1.551145 1.129503 0.706509
c 0.710681 -0.665703 -0.752778 1.089550
d 1.241300 -1.148401 -0.282621 1.000189
e 0.919278 -0.824329 1.102019 0.845919
f -0.732742 0.901447 -2.782270 -0.206729
g 1.632871 0.826829 -0.452925 0.435213
h 1.813164 -0.036353 0.462501 -0.309223

Grouping by a property can be done as follows:

Groupby

In [14]:
gb = dframe.groupby(dframe['A']>0)
gb.size()
Out[14]:
A
False    1
True     7
dtype: int64

Aggregate functions can be applied to these groups.

In [15]:
gb.sum()
Out[15]:
A B C D
A
False -0.732742 0.901447 -2.782270 -0.206729
True 8.743092 0.540395 0.999464 4.076603
In [16]:
gb.median()
Out[16]:
A B C D
A
False -0.732742 0.901447 -2.782270 -0.206729
True 1.241300 -0.036353 -0.206235 0.706509
In [17]:
gb.groups
Out[17]:
{False: ['f'], True: ['a', 'b', 'c', 'd', 'e', 'g', 'h']}

Apply

If you need to square an entire column of a Dataframe, here is how you would do it with apply:

In [19]:
def square(x):
    return pd.Series([x, x**2])

dframe['A'].apply(square)
Out[19]:
0 1
a 0.911508 0.830848
b 1.514290 2.293074
c 0.710681 0.505067
d 1.241300 1.540824
e 0.919278 0.845072
f -0.732742 0.536911
g 1.632871 2.666266
h 1.813164 3.287565