Pandas¶

Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data.

Importing Data¶

Importing data from text files is easy.

In [1]:

import pandas as pd
data = pd.read_csv('../data/examp-data.txt')
print(data)

   x    y  z
0  1  2.0  3
1  2  2.4  6
2  3  1.9  8

We can also choose to use the first row of the data as an the index (the numbers to the left of the table). The index has special properties and is particularly useful for time-series data. Values in index columns should be unique.

In [2]:

data_w_index = pd.read_csv('../data/examp-data.txt', index_col=0)
print(data_w_index)

Exporting Data¶

You can export to a number of different formats directly from Pandas including: csv, Excel, json, hdf, and SQL. You can export things to using the .to_* methods on a data frame. So, if you want to export to csv:

In [3]:

data.to_csv('../data/pandas_output.csv')
!cat ./data/pandas_output.csv

,x,y,z
0,1,2.0,3
1,2,2.4,6
2,3,1.9,8

The !cat command is just a quick way to show what the resulting datafile looks like.

If you don't want the index information you can easily get rid of it.

In [4]:

data.to_csv('../data/pandas_output.csv', index=False)
!cat ./data/pandas_output.csv

x,y,z
1,2.0,3
2,2.4,6
3,1.9,8

Selecting columns¶

Specific columns can be selected using their names.

In [5]:

data['y']

Out[5]:

0    2.0
1    2.4
2    1.9
Name: y, dtype: float64

Multiple columns can selected using a list of names in the desired order.

In [6]:

data[['z', 'x']]

Out[6]:

	z	x
0	3	1
1	6	2
2	8	3

Selecting rows¶

Rows can be selected by their index (the values to the left-hand side of the table) or by position.

In [7]:

data_w_index

Out[7]:

	y	z
x
1	2.0	3
2	2.4	6
3	1.9	8

In [8]:

data_w_index.loc[2]

Out[8]:

y    2.4
z    6.0
Name: 2, dtype: float64

In [9]:

data_w_index.iloc[0]

Out[9]:

y    2
z    3
Name: 1, dtype: float64

In [10]:

data.iloc[0]

Out[10]:

x    1
y    2
z    3
Name: 0, dtype: float64

You can also grab a slice of rows from the dataframe.

In [11]:

data_w_index[0:2]

Out[11]:

	y	z
x
1	2.0	3
2	2.4	6

Looping over rows¶

Looping over the rows in a dataframe involves looping over the data itself, which is stored in data.values.

In [12]:

for row in data.values:
    print row * 2

[ 2.  4.  6.]
[  4.    4.8  12. ]
[  6.    3.8  16. ]

In [13]:

data.values[0]
data.ix[1]

Out[13]:

x    2.0
y    2.4
z    6.0
Name: 1, dtype: float64

This is because often when working with data frames we won't want to loop over rows but instead use vectorized operations for things like subsetting and math (see below)

Subsetting¶

You can get subsets of the data by giving conditions inside [].

In [14]:

data[data['y'] > 2.0]

Out[14]:

	x	y	z
1	2	2.4	6

The syntax for specifying multiple conditions for subsetting is to include each condition in () and separate them with the & symbol.

In [15]:

data[(data['x'] > 1) & (data['y'] > 2)]

Out[15]:

	x	y	z
1	2	2.4	6

Math¶

Math is done by treating columsn from the data from just like they are variables.

In [16]:

data['y'] * data['z'] + 2

Out[16]:

0     8.0
1    16.4
2    17.2
dtype: float64

If you want to use mathematical functions use the ones in numpy not the ones in math.

In [17]:

import numpy as np
np.log(data['y']) * np.sqrt(data['z'])

Out[17]:

0    1.200566
1    2.144452
2    1.815437
dtype: float64

Grouping¶

To work on chunks of data in automated ways we can use grouping. First, let's grab some data.

In [18]:

url = "http://esapubs.org/archive/ecol/E084/093/Mammal_lifehistories_v2.txt"
data = pd.read_csv(url, delimiter="\t")
data.head()

Out[18]:

	order	family	Genus	species	mass(g)	gestation(mo)	newborn(g)	weaning(mo)	wean mass(g)	AFR(mo)	max. life(mo)	litter size	litters/year	refs
0	Artiodactyla	Antilocapridae	Antilocapra	americana	45375	8.13	3246.36	3.00	8900	13.53	142	1.85	1.00	1,2,6,9,23,26,27
1	Artiodactyla	Bovidae	Addax	nasomaculatus	182375	9.39	5480.00	6.50	-999	27.27	308	1.00	0.99	1,2,17,23,26
2	Artiodactyla	Bovidae	Aepyceros	melampus	41480	6.35	5093.00	5.63	15900	16.66	213	1.00	0.95	1,2,8,9,23,29
3	Artiodactyla	Bovidae	Alcelaphus	buselaphus	150000	7.90	10166.67	6.50	-999	23.02	240	1.00	-999.00	1,2,17,23
4	Artiodactyla	Bovidae	Ammodorcas	clarkei	28500	6.80	-999.00	-999.00	-999	-999.00	-999	1.00	-999.00	1,2

The command data.head() shows us the first few rows of the dataset.

The following code groups the data based on the value in the order column, calculates the average of the mass(g) column for every order, and prints out the results.

In [19]:

data_by_order = data.groupby('order')
for order, order_data in data_by_order:
    avg_mass = np.mean(order_data['mass(g)'])
    print "The average mass of {} is {} grams".format(order, avg_mass)

The average mass of Artiodactyla is 112939.74354 grams
The average mass of Carnivora is 42705.6617766 grams
The average mass of Cetacea is 9115442.46673 grams
The average mass of Dermoptera is 0.5 grams
The average mass of Hyracoidea is 3030.835 grams
The average mass of Insectivora is 58.0276923077 grams
The average mass of Lagomorpha is 1316.13261905 grams
The average mass of Macroscelidea is 11.522 grams
The average mass of Perissodactyla is 694486.666667 grams
The average mass of Pholidota is 7980.0 grams
The average mass of Primates is 5066.5575641 grams
The average mass of Proboscidea is 3342500.0 grams
The average mass of Rodentia is 496.813082707 grams
The average mass of Scandentia is 190.357142857 grams
The average mass of Sirenia is 1169400.0 grams
The average mass of Tubulidentata is 60000.0 grams
The average mass of Xenarthra is 7238.5 grams