Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data.
Importing data from text files is easy.
import pandas as pd
data = pd.read_csv('../data/examp-data.txt')
print(data)
x y z 0 1 2.0 3 1 2 2.4 6 2 3 1.9 8
We can also choose to use the first row of the data as an the index (the numbers to the left of the table). The index has special properties and is particularly useful for time-series data. Values in index columns should be unique.
data_w_index = pd.read_csv('../data/examp-data.txt', index_col=0)
print(data_w_index)
y z x 1 2.0 3 2 2.4 6 3 1.9 8
You can export to a number of different formats directly from Pandas including: csv, Excel, json, hdf, and SQL. You can export things to using the .to_*
methods on a data frame. So, if you want to export to csv:
data.to_csv('../data/pandas_output.csv')
!cat ./data/pandas_output.csv
,x,y,z 0,1,2.0,3 1,2,2.4,6 2,3,1.9,8
The !cat
command is just a quick way to show what the resulting datafile looks like.
If you don't want the index information you can easily get rid of it.
data.to_csv('../data/pandas_output.csv', index=False)
!cat ./data/pandas_output.csv
x,y,z 1,2.0,3 2,2.4,6 3,1.9,8
Specific columns can be selected using their names.
data['y']
0 2.0 1 2.4 2 1.9 Name: y, dtype: float64
Multiple columns can selected using a list of names in the desired order.
data[['z', 'x']]
z | x | |
---|---|---|
0 | 3 | 1 |
1 | 6 | 2 |
2 | 8 | 3 |
Rows can be selected by their index (the values to the left-hand side of the table) or by position.
data_w_index
y | z | |
---|---|---|
x | ||
1 | 2.0 | 3 |
2 | 2.4 | 6 |
3 | 1.9 | 8 |
data_w_index.loc[2]
y 2.4 z 6.0 Name: 2, dtype: float64
data_w_index.iloc[0]
y 2 z 3 Name: 1, dtype: float64
data.iloc[0]
x 1 y 2 z 3 Name: 0, dtype: float64
You can also grab a slice of rows from the dataframe.
data_w_index[0:2]
y | z | |
---|---|---|
x | ||
1 | 2.0 | 3 |
2 | 2.4 | 6 |
Looping over the rows in a dataframe involves looping over the data itself, which is stored in data.values
.
for row in data.values:
print row * 2
[ 2. 4. 6.] [ 4. 4.8 12. ] [ 6. 3.8 16. ]
data.values[0]
data.ix[1]
x 2.0 y 2.4 z 6.0 Name: 1, dtype: float64
This is because often when working with data frames we won't want to loop over rows but instead use vectorized operations for things like subsetting and math (see below)
You can get subsets of the data by giving conditions inside []
.
data[data['y'] > 2.0]
x | y | z | |
---|---|---|---|
1 | 2 | 2.4 | 6 |
The syntax for specifying multiple conditions for subsetting is to include each condition in ()
and separate them with the &
symbol.
data[(data['x'] > 1) & (data['y'] > 2)]
x | y | z | |
---|---|---|---|
1 | 2 | 2.4 | 6 |
Math is done by treating columsn from the data from just like they are variables.
data['y'] * data['z'] + 2
0 8.0 1 16.4 2 17.2 dtype: float64
If you want to use mathematical functions use the ones in numpy
not the ones in math
.
import numpy as np
np.log(data['y']) * np.sqrt(data['z'])
0 1.200566 1 2.144452 2 1.815437 dtype: float64
To work on chunks of data in automated ways we can use grouping. First, let's grab some data.
url = "http://esapubs.org/archive/ecol/E084/093/Mammal_lifehistories_v2.txt"
data = pd.read_csv(url, delimiter="\t")
data.head()
order | family | Genus | species | mass(g) | gestation(mo) | newborn(g) | weaning(mo) | wean mass(g) | AFR(mo) | max. life(mo) | litter size | litters/year | refs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Artiodactyla | Antilocapridae | Antilocapra | americana | 45375 | 8.13 | 3246.36 | 3.00 | 8900 | 13.53 | 142 | 1.85 | 1.00 | 1,2,6,9,23,26,27 |
1 | Artiodactyla | Bovidae | Addax | nasomaculatus | 182375 | 9.39 | 5480.00 | 6.50 | -999 | 27.27 | 308 | 1.00 | 0.99 | 1,2,17,23,26 |
2 | Artiodactyla | Bovidae | Aepyceros | melampus | 41480 | 6.35 | 5093.00 | 5.63 | 15900 | 16.66 | 213 | 1.00 | 0.95 | 1,2,8,9,23,29 |
3 | Artiodactyla | Bovidae | Alcelaphus | buselaphus | 150000 | 7.90 | 10166.67 | 6.50 | -999 | 23.02 | 240 | 1.00 | -999.00 | 1,2,17,23 |
4 | Artiodactyla | Bovidae | Ammodorcas | clarkei | 28500 | 6.80 | -999.00 | -999.00 | -999 | -999.00 | -999 | 1.00 | -999.00 | 1,2 |
The command data.head()
shows us the first few rows of the dataset.
The following code groups the data based on the value in the order
column, calculates the average of the mass(g)
column for every order, and prints out the results.
data_by_order = data.groupby('order')
for order, order_data in data_by_order:
avg_mass = np.mean(order_data['mass(g)'])
print "The average mass of {} is {} grams".format(order, avg_mass)
The average mass of Artiodactyla is 112939.74354 grams The average mass of Carnivora is 42705.6617766 grams The average mass of Cetacea is 9115442.46673 grams The average mass of Dermoptera is 0.5 grams The average mass of Hyracoidea is 3030.835 grams The average mass of Insectivora is 58.0276923077 grams The average mass of Lagomorpha is 1316.13261905 grams The average mass of Macroscelidea is 11.522 grams The average mass of Perissodactyla is 694486.666667 grams The average mass of Pholidota is 7980.0 grams The average mass of Primates is 5066.5575641 grams The average mass of Proboscidea is 3342500.0 grams The average mass of Rodentia is 496.813082707 grams The average mass of Scandentia is 190.357142857 grams The average mass of Sirenia is 1169400.0 grams The average mass of Tubulidentata is 60000.0 grams The average mass of Xenarthra is 7238.5 grams