import numpy as np
import pandas as pd
The easiest way to create a DataFrame is using an ndarray. If an index is not specified, a default integer index will be automatically created.
dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])
dframe
A | B | C | D | |
---|---|---|---|---|
0 | -1.329538 | -1.252777 | -1.888587 | -0.008552 |
1 | -0.601241 | 0.665922 | -0.342204 | 0.707170 |
2 | 0.724993 | -0.200555 | 0.226544 | 0.751363 |
3 | 0.888530 | 0.292810 | -1.039384 | 1.020441 |
4 | -1.330474 | 0.194110 | 1.433966 | 0.335734 |
5 | 1.009005 | 0.620863 | 0.360362 | -1.552132 |
6 | 1.139423 | -0.888833 | 0.512584 | -1.293783 |
7 | 1.152392 | 2.318118 | 1.918517 | -1.064239 |
To create an index just add the index argument. Lenght of the index should match the number of rows, of course.
dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'], index = list('abcdefgh'))
dframe
A | B | C | D | |
---|---|---|---|---|
a | -0.088492 | 0.837208 | -0.206235 | 0.308445 |
b | 0.514290 | 1.551145 | 1.129503 | 0.706509 |
c | -0.289319 | -0.665703 | -0.752778 | 1.089550 |
d | 0.241300 | -1.148401 | -0.282621 | 1.000189 |
e | -0.080722 | -0.824329 | 1.102019 | 0.845919 |
f | -1.732742 | 0.901447 | -2.782270 | -0.206729 |
g | 0.632871 | 0.826829 | -0.452925 | 0.435213 |
h | 0.813164 | -0.036353 | 0.462501 | -0.309223 |
Another convenient way to create a DataFrame is using a python dict. The keys of the dict are used as the column names, and the values as the columns themselves. The values of the dict need to be collections of the same length.
dfdict = pd.DataFrame({'A':[1,2,3,4],'B': [5,6,7,8]},index = ['one','two','three','four'])
dfdict
A | B | |
---|---|---|
one | 1 | 5 |
two | 2 | 6 |
three | 3 | 7 |
four | 4 | 8 |
Selecting by integer index - using the standard numpy/R notation:
dframe[0:4]
A | B | C | D | |
---|---|---|---|---|
a | -0.088492 | 0.837208 | -0.206235 | 0.308445 |
b | 0.514290 | 1.551145 | 1.129503 | 0.706509 |
c | -0.289319 | -0.665703 | -0.752778 | 1.089550 |
d | 0.241300 | -1.148401 | -0.282621 | 1.000189 |
But using the .iloc method is more efficient and optimized. iloc stands for 'integer location' and accepts indices for rows and columns as integer ranges:
dframe.iloc[0:4,0:2]
A | B | |
---|---|---|
a | -0.088492 | 0.837208 |
b | 0.514290 | 1.551145 |
c | -0.289319 | -0.665703 |
d | 0.241300 | -1.148401 |
dframe.loc[list('bca'),['A','B']]
A | B | |
---|---|---|
b | 0.514290 | 1.551145 |
c | -0.289319 | -0.665703 |
a | -0.088492 | 0.837208 |
How do you select rows based on values satisfying certain criteria? This is likely the part where most time is spent while doing exploratory data analysis: slicing the dataframe based on certain criteria.
dframe[dframe['C']>0]
A | B | C | D | |
---|---|---|---|---|
b | 0.514290 | 1.551145 | 1.129503 | 0.706509 |
e | -0.080722 | -0.824329 | 1.102019 | 0.845919 |
h | 0.813164 | -0.036353 | 0.462501 | -0.309223 |
It is possible to combine boolean criteria using '&', '|' and '~', however these must be grouped by using parentheses.
dframe[(dframe['C']>0) & (dframe['D']<0)]
A | B | C | D | |
---|---|---|---|---|
h | 0.813164 | -0.036353 | 0.462501 | -0.309223 |
Applying a function to an entire column can be done using the map method:
dframe['A'] = dframe['A'].map(lambda x: x+1)
dframe
A | B | C | D | |
---|---|---|---|---|
a | 0.911508 | 0.837208 | -0.206235 | 0.308445 |
b | 1.514290 | 1.551145 | 1.129503 | 0.706509 |
c | 0.710681 | -0.665703 | -0.752778 | 1.089550 |
d | 1.241300 | -1.148401 | -0.282621 | 1.000189 |
e | 0.919278 | -0.824329 | 1.102019 | 0.845919 |
f | -0.732742 | 0.901447 | -2.782270 | -0.206729 |
g | 1.632871 | 0.826829 | -0.452925 | 0.435213 |
h | 1.813164 | -0.036353 | 0.462501 | -0.309223 |
Grouping by a property can be done as follows:
gb = dframe.groupby(dframe['A']>0)
gb.size()
A False 1 True 7 dtype: int64
Aggregate functions can be applied to these groups.
gb.sum()
A | B | C | D | |
---|---|---|---|---|
A | ||||
False | -0.732742 | 0.901447 | -2.782270 | -0.206729 |
True | 8.743092 | 0.540395 | 0.999464 | 4.076603 |
gb.median()
A | B | C | D | |
---|---|---|---|---|
A | ||||
False | -0.732742 | 0.901447 | -2.782270 | -0.206729 |
True | 1.241300 | -0.036353 | -0.206235 | 0.706509 |
gb.groups
{False: ['f'], True: ['a', 'b', 'c', 'd', 'e', 'g', 'h']}
If you need to square an entire column of a Dataframe, here is how you would do it with apply:
def square(x):
return pd.Series([x, x**2])
dframe['A'].apply(square)
0 | 1 | |
---|---|---|
a | 0.911508 | 0.830848 |
b | 1.514290 | 2.293074 |
c | 0.710681 | 0.505067 |
d | 1.241300 | 1.540824 |
e | 0.919278 | 0.845072 |
f | -0.732742 | 0.536911 |
g | 1.632871 | 2.666266 |
h | 1.813164 | 3.287565 |