Pandas

Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data.

Importing Data

Importing data from text files is easy.

In [1]:
import pandas as pd
data = pd.read_csv('../data/examp-data.txt')
print(data)
   x    y  z
0  1  2.0  3
1  2  2.4  6
2  3  1.9  8

We can also choose to use the first row of the data as an the index (the numbers to the left of the table). The index has special properties and is particularly useful for time-series data. Values in index columns should be unique.

In [2]:
data_w_index = pd.read_csv('../data/examp-data.txt', index_col=0)
print(data_w_index)
     y  z
x        
1  2.0  3
2  2.4  6
3  1.9  8

Exporting Data

You can export to a number of different formats directly from Pandas including: csv, Excel, json, hdf, and SQL. You can export things to using the .to_* methods on a data frame. So, if you want to export to csv:

In [3]:
data.to_csv('../data/pandas_output.csv')
!cat ./data/pandas_output.csv
,x,y,z
0,1,2.0,3
1,2,2.4,6
2,3,1.9,8

The !cat command is just a quick way to show what the resulting datafile looks like.

If you don't want the index information you can easily get rid of it.

In [4]:
data.to_csv('../data/pandas_output.csv', index=False)
!cat ./data/pandas_output.csv
x,y,z
1,2.0,3
2,2.4,6
3,1.9,8

Selecting columns

Specific columns can be selected using their names.

In [5]:
data['y']
Out[5]:
0    2.0
1    2.4
2    1.9
Name: y, dtype: float64

Multiple columns can selected using a list of names in the desired order.

In [6]:
data[['z', 'x']]
Out[6]:
z x
0 3 1
1 6 2
2 8 3

Selecting rows

Rows can be selected by their index (the values to the left-hand side of the table) or by position.

In [7]:
data_w_index
Out[7]:
y z
x
1 2.0 3
2 2.4 6
3 1.9 8
In [8]:
data_w_index.loc[2]
Out[8]:
y    2.4
z    6.0
Name: 2, dtype: float64
In [9]:
data_w_index.iloc[0]
Out[9]:
y    2
z    3
Name: 1, dtype: float64
In [10]:
data.iloc[0]
Out[10]:
x    1
y    2
z    3
Name: 0, dtype: float64

You can also grab a slice of rows from the dataframe.

In [11]:
data_w_index[0:2]
Out[11]:
y z
x
1 2.0 3
2 2.4 6

Looping over rows

Looping over the rows in a dataframe involves looping over the data itself, which is stored in data.values.

In [12]:
for row in data.values:
    print row * 2
[ 2.  4.  6.]
[  4.    4.8  12. ]
[  6.    3.8  16. ]
In [13]:
data.values[0]
data.ix[1]
Out[13]:
x    2.0
y    2.4
z    6.0
Name: 1, dtype: float64

This is because often when working with data frames we won't want to loop over rows but instead use vectorized operations for things like subsetting and math (see below)

Subsetting

You can get subsets of the data by giving conditions inside [].

In [14]:
data[data['y'] > 2.0]
Out[14]:
x y z
1 2 2.4 6

The syntax for specifying multiple conditions for subsetting is to include each condition in () and separate them with the & symbol.

In [15]:
data[(data['x'] > 1) & (data['y'] > 2)]
Out[15]:
x y z
1 2 2.4 6

Math

Math is done by treating columsn from the data from just like they are variables.

In [16]:
data['y'] * data['z'] + 2
Out[16]:
0     8.0
1    16.4
2    17.2
dtype: float64

If you want to use mathematical functions use the ones in numpy not the ones in math.

In [17]:
import numpy as np
np.log(data['y']) * np.sqrt(data['z'])
Out[17]:
0    1.200566
1    2.144452
2    1.815437
dtype: float64

Grouping

To work on chunks of data in automated ways we can use grouping. First, let's grab some data.

In [18]:
url = "http://esapubs.org/archive/ecol/E084/093/Mammal_lifehistories_v2.txt"
data = pd.read_csv(url, delimiter="\t")
data.head()
Out[18]:
order family Genus species mass(g) gestation(mo) newborn(g) weaning(mo) wean mass(g) AFR(mo) max. life(mo) litter size litters/year refs
0 Artiodactyla Antilocapridae Antilocapra americana 45375 8.13 3246.36 3.00 8900 13.53 142 1.85 1.00 1,2,6,9,23,26,27
1 Artiodactyla Bovidae Addax nasomaculatus 182375 9.39 5480.00 6.50 -999 27.27 308 1.00 0.99 1,2,17,23,26
2 Artiodactyla Bovidae Aepyceros melampus 41480 6.35 5093.00 5.63 15900 16.66 213 1.00 0.95 1,2,8,9,23,29
3 Artiodactyla Bovidae Alcelaphus buselaphus 150000 7.90 10166.67 6.50 -999 23.02 240 1.00 -999.00 1,2,17,23
4 Artiodactyla Bovidae Ammodorcas clarkei 28500 6.80 -999.00 -999.00 -999 -999.00 -999 1.00 -999.00 1,2

The command data.head() shows us the first few rows of the dataset.

The following code groups the data based on the value in the order column, calculates the average of the mass(g) column for every order, and prints out the results.

In [19]:
data_by_order = data.groupby('order')
for order, order_data in data_by_order:
    avg_mass = np.mean(order_data['mass(g)'])
    print "The average mass of {} is {} grams".format(order, avg_mass)
The average mass of Artiodactyla is 112939.74354 grams
The average mass of Carnivora is 42705.6617766 grams
The average mass of Cetacea is 9115442.46673 grams
The average mass of Dermoptera is 0.5 grams
The average mass of Hyracoidea is 3030.835 grams
The average mass of Insectivora is 58.0276923077 grams
The average mass of Lagomorpha is 1316.13261905 grams
The average mass of Macroscelidea is 11.522 grams
The average mass of Perissodactyla is 694486.666667 grams
The average mass of Pholidota is 7980.0 grams
The average mass of Primates is 5066.5575641 grams
The average mass of Proboscidea is 3342500.0 grams
The average mass of Rodentia is 496.813082707 grams
The average mass of Scandentia is 190.357142857 grams
The average mass of Sirenia is 1169400.0 grams
The average mass of Tubulidentata is 60000.0 grams
The average mass of Xenarthra is 7238.5 grams