Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
In [1]:
import pandas as pd
import numpy as np

Pandas Data Structures

Series

A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.

In [2]:
counts = pd.Series([632, 1638, 569, 115])
counts
Out[2]:
0     632
1    1638
2     569
3     115
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.

In [3]:
counts.values
Out[3]:
array([ 632, 1638,  569,  115])
In [4]:
counts.index
Out[4]:
RangeIndex(start=0, stop=4, step=1)

We can assign meaningful labels to the index, if they are available:

In [5]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria
Out[5]:
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

These labels can be used to refer to the values in the Series.

In [6]:
bacteria['Actinobacteria']
Out[6]:
569
In [7]:
bacteria[[name.endswith('bacteria') for name in bacteria.index]]
Out[7]:
Proteobacteria    1638
Actinobacteria     569
dtype: int64
In [8]:
[name.endswith('bacteria') for name in bacteria.index]
Out[8]:
[False, True, True, False]

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [9]:
bacteria[0]
Out[9]:
632

We can give both the array of values and the index meaningful labels themselves:

In [10]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria
Out[10]:
phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [11]:
np.log(bacteria)
Out[11]:
phylum
Firmicutes        6.448889
Proteobacteria    7.401231
Actinobacteria    6.343880
Bacteroidetes     4.744932
Name: counts, dtype: float64

We can also filter according to the values in the Series:

In [12]:
bacteria[bacteria>1000]
Out[12]:
phylum
Proteobacteria    1638
Name: counts, dtype: int64

A Series can be thought of as an ordered key-value store. In fact, we can create one from a dict:

In [13]:
bacteria_dict = {'Proteobacteria': 1638, 'Actinobacteria': 569,'Firmicutes': 632, 
                 'Bacteroidetes': 115}
pd.Series(bacteria_dict)
Out[13]:
Proteobacteria    1638
Actinobacteria     569
Firmicutes         632
Bacteroidetes      115
dtype: int64

If we pass a custom index to Series, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the NaN (not a number) type for missing values.

In [14]:
bacteria2 = pd.Series(bacteria_dict, 
                      index=['Cyanobacteria','Firmicutes',
                             'Proteobacteria','Actinobacteria'])
bacteria2
Out[14]:
Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
In [15]:
bacteria2.isnull()
Out[15]:
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

Critically, the labels are used to align data when used in operations with other Series objects:

In [16]:
bacteria + bacteria2
Out[16]:
Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

DataFrame

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.

In [17]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data
Out[17]:
value patient phylum
0 632 1 Firmicutes
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
3 115 1 Bacteroidetes
4 433 2 Firmicutes
5 1130 2 Proteobacteria
6 754 2 Actinobacteria
7 555 2 Bacteroidetes

We can change the order by indexing them in the order we desire:

In [18]:
data[['phylum','value','patient']]
Out[18]:
phylum value patient
0 Firmicutes 632 1
1 Proteobacteria 1638 1
2 Actinobacteria 569 1
3 Bacteroidetes 115 1
4 Firmicutes 433 2
5 Proteobacteria 1130 2
6 Actinobacteria 754 2
7 Bacteroidetes 555 2

A DataFrame has a second index, representing the columns:

In [19]:
data.columns
Out[19]:
Index(['value', 'patient', 'phylum'], dtype='object')

The dtypes attribute reveals the data type for each column in our DataFrame.

  • int64 is numeric integer values
  • object strings (letters and numbers)
  • float64 floating-point values
In [20]:
data.dtypes
Out[20]:
value       int64
patient     int64
phylum     object
dtype: object

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [21]:
data['patient']
Out[21]:
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: patient, dtype: int64
In [22]:
data.patient
Out[22]:
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: patient, dtype: int64
In [23]:
type(data.value)
Out[23]:
pandas.core.series.Series
In [24]:
data[['value']]
Out[24]:
value
0 632
1 1638
2 569
3 115
4 433
5 1130
6 754
7 555

Notice this is different than with Series, where dict-like indexing retrieved a particular element (row).

If we want access to a row in a DataFrame, we index its loc attribute.

In [25]:
data.loc[1]
Out[25]:
value                1638
patient                 1
phylum     Proteobacteria
Name: 1, dtype: object

Exercise

Try out these commands to see what they return:

  • data.head()
  • data.tail(3)
  • data.shape

An alternative way of initializing a DataFrame is with a list of dicts:

In [26]:
data = pd.DataFrame([{'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                    {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}])
In [27]:
data
Out[27]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 1130
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

Its important to note that the Series returned when a DataFrame is indexed is merely a view on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [28]:
vals = data.value
vals
Out[28]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64
In [29]:
vals[5] = 0
vals
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[29]:
0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64

Now look at the original DataFrame!

In [30]:
data
Out[30]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 0
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

If we plan on modifying an extracted Series, its a good idea to make a copy.

In [31]:
vals = data.value.copy()
vals[5] = 1000
data
Out[31]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 115
4 2 Firmicutes 433
5 2 Proteobacteria 0
6 2 Actinobacteria 754
7 2 Bacteroidetes 555

We can create or modify columns by assignment:

In [32]:
data.value[[3,4,6]] = [14, 21, 5]
data
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[32]:
patient phylum value
0 1 Firmicutes 632
1 1 Proteobacteria 1638
2 1 Actinobacteria 569
3 1 Bacteroidetes 14
4 2 Firmicutes 21
5 2 Proteobacteria 0
6 2 Actinobacteria 5
7 2 Bacteroidetes 555
In [33]:
data['year'] = 2013
data
Out[33]:
patient phylum value year
0 1 Firmicutes 632 2013
1 1 Proteobacteria 1638 2013
2 1 Actinobacteria 569 2013
3 1 Bacteroidetes 14 2013
4 2 Firmicutes 21 2013
5 2 Proteobacteria 0 2013
6 2 Actinobacteria 5 2013
7 2 Bacteroidetes 555 2013

But note, we cannot use the attribute indexing method to add a new column:

In [34]:
data.treatment = 1
data
Out[34]:
patient phylum value year
0 1 Firmicutes 632 2013
1 1 Proteobacteria 1638 2013
2 1 Actinobacteria 569 2013
3 1 Bacteroidetes 14 2013
4 2 Firmicutes 21 2013
5 2 Proteobacteria 0 2013
6 2 Actinobacteria 5 2013
7 2 Bacteroidetes 555 2013
In [35]:
data.treatment
Out[35]:
1

Exercise

From the data table above, create an index to return all rows for which the phylum name ends in "bacteria" and the value is greater than 1000.

In [36]:
data[data.phylum.str.endswith('bacteria') & (data.value > 1000)]
Out[36]:
patient phylum value year
1 1 Proteobacteria 1638 2013

Specifying a Series as a new columns cause its values to be added according to the DataFrame's index:

In [37]:
treatment = pd.Series([0]*4 + [1]*2)
treatment
Out[37]:
0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64
In [38]:
data['treatment'] = treatment
data
Out[38]:
patient phylum value year treatment
0 1 Firmicutes 632 2013 0.0
1 1 Proteobacteria 1638 2013 0.0
2 1 Actinobacteria 569 2013 0.0
3 1 Bacteroidetes 14 2013 0.0
4 2 Firmicutes 21 2013 1.0
5 2 Proteobacteria 0 2013 1.0
6 2 Actinobacteria 5 2013 NaN
7 2 Bacteroidetes 555 2013 NaN

Other Python data structures (ones without an index) need to be the same length as the DataFrame:

In [39]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
data['month'] = month
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-1fbc7bd65338> in <module>
      1 month = ['Jan', 'Feb', 'Mar', 'Apr']
----> 2 data['month'] = month

~/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index
In [40]:
data['month'] = ['Jan']*len(data)
data
Out[40]:
patient phylum value year treatment month
0 1 Firmicutes 632 2013 0.0 Jan
1 1 Proteobacteria 1638 2013 0.0 Jan
2 1 Actinobacteria 569 2013 0.0 Jan
3 1 Bacteroidetes 14 2013 0.0 Jan
4 2 Firmicutes 21 2013 1.0 Jan
5 2 Proteobacteria 0 2013 1.0 Jan
6 2 Actinobacteria 5 2013 NaN Jan
7 2 Bacteroidetes 555 2013 NaN Jan

We can use the drop method to remove rows or columns, which by default drops rows. We can be explicit by using the axis argument:

In [41]:
data.drop('month', axis=1, inplace=True)
In [42]:
data
Out[42]:
patient phylum value year treatment
0 1 Firmicutes 632 2013 0.0
1 1 Proteobacteria 1638 2013 0.0
2 1 Actinobacteria 569 2013 0.0
3 1 Bacteroidetes 14 2013 0.0
4 2 Firmicutes 21 2013 1.0
5 2 Proteobacteria 0 2013 1.0
6 2 Actinobacteria 5 2013 NaN
7 2 Bacteroidetes 555 2013 NaN

We can extract the underlying data as a simple ndarray by accessing the values attribute:

In [43]:
data.values
Out[43]:
array([[1, 'Firmicutes', 632, 2013, 0.0],
       [1, 'Proteobacteria', 1638, 2013, 0.0],
       [1, 'Actinobacteria', 569, 2013, 0.0],
       [1, 'Bacteroidetes', 14, 2013, 0.0],
       [2, 'Firmicutes', 21, 2013, 1.0],
       [2, 'Proteobacteria', 0, 2013, 1.0],
       [2, 'Actinobacteria', 5, 2013, nan],
       [2, 'Bacteroidetes', 555, 2013, nan]], dtype=object)

Notice that because of the mix of string and integer (and NaN) values, the dtype of the array is object. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

In [44]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})
df.values
Out[44]:
array([[ 1. ,  0.4],
       [ 2. , -1. ],
       [ 3. ,  4.5]])

Pandas uses a custom data structure to represent the indices of Series and DataFrames.

In [45]:
data.index
Out[45]:
RangeIndex(start=0, stop=8, step=1)

Index objects are immutable:

In [46]:
data.index[0] = 15
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-90b10fa0e9b5> in <module>
----> 1 data.index[0] = 15

~/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2063 
   2064     def __setitem__(self, key, value):
-> 2065         raise TypeError("Index does not support mutable operations")
   2066 
   2067     def __getitem__(self, key):

TypeError: Index does not support mutable operations

This is so that Index objects can be shared between data structures without fear that they will be changed.

In [47]:
bacteria2.index = bacteria.index
In [48]:
bacteria2
Out[48]:
phylum
Firmicutes           NaN
Proteobacteria     632.0
Actinobacteria    1638.0
Bacteroidetes      569.0
dtype: float64

Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Let's start with some more bacteria data, stored in csv format.

In [49]:
!cat ../data/microbiome.csv
Taxon,Patient,Group,Tissue,Stool
Firmicutes,1,0,136,4182
Firmicutes,2,1,1174,703
Firmicutes,3,0,408,3946
Firmicutes,4,1,831,8605
Firmicutes,5,0,693,50
Firmicutes,6,1,718,717
Firmicutes,7,0,173,33
Firmicutes,8,1,228,80
Firmicutes,9,0,162,3196
Firmicutes,10,1,372,32
Firmicutes,11,0,4255,4361
Firmicutes,12,1,107,1667
Firmicutes,13,0,96,223
Firmicutes,14,1,281,2377
Proteobacteria,1,0,2469,1821
Proteobacteria,2,1,839,661
Proteobacteria,3,0,4414,18
Proteobacteria,4,1,12044,83
Proteobacteria,5,0,2310,12
Proteobacteria,6,1,3053,547
Proteobacteria,7,0,395,2174
Proteobacteria,8,1,2651,767
Proteobacteria,9,0,1195,76
Proteobacteria,10,1,6857,795
Proteobacteria,11,0,483,666
Proteobacteria,12,1,2950,3994
Proteobacteria,13,0,1541,816
Proteobacteria,14,1,1307,53
Actinobacteria,1,0,1590,4
Actinobacteria,2,1,25,2
Actinobacteria,3,0,259,300
Actinobacteria,4,1,568,7
Actinobacteria,5,0,1102,9
Actinobacteria,6,1,678,377
Actinobacteria,7,0,260,58
Actinobacteria,8,1,424,233
Actinobacteria,9,0,548,21
Actinobacteria,10,1,201,83
Actinobacteria,11,0,42,75
Actinobacteria,12,1,109,59
Actinobacteria,13,0,51,183
Actinobacteria,14,1,310,204
Bacteroidetes,1,0,67,0
Bacteroidetes,2,1,0,0
Bacteroidetes,3,0,85,5
Bacteroidetes,4,1,143,7
Bacteroidetes,5,0,678,2
Bacteroidetes,6,1,4829,209
Bacteroidetes,7,0,74,651
Bacteroidetes,8,1,169,254
Bacteroidetes,9,0,106,10
Bacteroidetes,10,1,73,381
Bacteroidetes,11,0,30,359
Bacteroidetes,12,1,51,51
Bacteroidetes,13,0,2473,2314
Bacteroidetes,14,1,102,33
Other,1,0,195,18
Other,2,1,42,2
Other,3,0,316,43
Other,4,1,202,40
Other,5,0,116,0
Other,6,1,527,12
Other,7,0,357,11
Other,8,1,106,11
Other,9,0,67,14
Other,10,1,203,6
Other,11,0,392,6
Other,12,1,28,25
Other,13,0,12,22
Other,14,1,305,32

This table can be read into a DataFrame using read_csv:

In [50]:
mb = pd.read_csv("../data/microbiome.csv")
mb
Out[50]:
Taxon Patient Group Tissue Stool
0 Firmicutes 1 0 136 4182
1 Firmicutes 2 1 1174 703
2 Firmicutes 3 0 408 3946
3 Firmicutes 4 1 831 8605
4 Firmicutes 5 0 693 50
5 Firmicutes 6 1 718 717
6 Firmicutes 7 0 173 33
7 Firmicutes 8 1 228 80
8 Firmicutes 9 0 162 3196
9 Firmicutes 10 1 372 32
10 Firmicutes 11 0 4255 4361
11 Firmicutes 12 1 107 1667
12 Firmicutes 13 0 96 223
13 Firmicutes 14 1 281 2377
14 Proteobacteria 1 0 2469 1821
15 Proteobacteria 2 1 839 661
16 Proteobacteria 3 0 4414 18
17 Proteobacteria 4 1 12044 83
18 Proteobacteria 5 0 2310 12
19 Proteobacteria 6 1 3053 547
20 Proteobacteria 7 0 395 2174
21 Proteobacteria 8 1 2651 767
22 Proteobacteria 9 0 1195 76
23 Proteobacteria 10 1 6857 795
24 Proteobacteria 11 0 483 666
25 Proteobacteria 12 1 2950 3994
26 Proteobacteria 13 0 1541 816
27 Proteobacteria 14 1 1307 53
28 Actinobacteria 1 0 1590 4
29 Actinobacteria 2 1 25 2
... ... ... ... ... ...
40 Actinobacteria 13 0 51 183
41 Actinobacteria 14 1 310 204
42 Bacteroidetes 1 0 67 0
43 Bacteroidetes 2 1 0 0
44 Bacteroidetes 3 0 85 5
45 Bacteroidetes 4 1 143 7
46 Bacteroidetes 5 0 678 2
47 Bacteroidetes 6 1 4829 209
48 Bacteroidetes 7 0 74 651
49 Bacteroidetes 8 1 169 254
50 Bacteroidetes 9 0 106 10
51 Bacteroidetes 10 1 73 381
52 Bacteroidetes 11 0 30 359
53 Bacteroidetes 12 1 51 51
54 Bacteroidetes 13 0 2473 2314
55 Bacteroidetes 14 1 102 33
56 Other 1 0 195 18
57 Other 2 1 42 2
58 Other 3 0 316 43
59 Other 4 1 202 40
60 Other 5 0 116 0
61 Other 6 1 527 12
62 Other 7 0 357 11
63 Other 8 1 106 11
64 Other 9 0 67 14
65 Other 10 1 203 6
66 Other 11 0 392 6
67 Other 12 1 28 25
68 Other 13 0 12 22
69 Other 14 1 305 32

70 rows × 5 columns

Notice that read_csv automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like header, names or index_col.

In [51]:
pd.read_csv("../data/microbiome.csv", header=None).head()
Out[51]:
0 1 2 3 4
0 Taxon Patient Group Tissue Stool
1 Firmicutes 1 0 136 4182
2 Firmicutes 2 1 1174 703
3 Firmicutes 3 0 408 3946
4 Firmicutes 4 1 831 8605

read_csv is just a convenience function for read_table, since csv is such a common format:

In [52]:
mb = pd.read_table("../data/microbiome.csv", sep=',')

The sep argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats:

sep='\s+'

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

In [53]:
mb = pd.read_csv("../data/microbiome.csv", index_col=['Patient','Taxon'])
mb.head()
Out[53]:
Group Tissue Stool
Patient Taxon
1 Firmicutes 0 136 4182
2 Firmicutes 1 1174 703
3 Firmicutes 0 408 3946
4 Firmicutes 1 831 8605
5 Firmicutes 0 693 50

This is called a hierarchical index, which we will revisit later in the section.

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the skiprows argument:

In [54]:
pd.read_csv("../data/microbiome.csv", skiprows=[3,4,6]).head()
Out[54]:
Taxon Patient Group Tissue Stool
0 Firmicutes 1 0 136 4182
1 Firmicutes 2 1 1174 703
2 Firmicutes 5 0 693 50
3 Firmicutes 7 0 173 33
4 Firmicutes 8 1 228 80

If we only want to import a small number of rows from, say, a very large data file we can use nrows:

In [55]:
pd.read_csv("../data/microbiome.csv", nrows=4)
Out[55]:
Taxon Patient Group Tissue Stool
0 Firmicutes 1 0 136 4182
1 Firmicutes 2 1 1174 703
2 Firmicutes 3 0 408 3946
3 Firmicutes 4 1 831 8605

Alternately, if we want to process our data in reasonable chunks, the chunksize argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:

In [56]:
data_chunks = pd.read_csv("../data/microbiome.csv", chunksize=14)

mean_tissue = pd.Series({chunk.iloc[0].Taxon:chunk.Tissue.mean() for chunk in data_chunks})
    
mean_tissue
Out[56]:
Firmicutes         688.142857
Proteobacteria    3036.285714
Actinobacteria     440.500000
Bacteroidetes      634.285714
Other              204.857143
dtype: float64

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including NA and NULL.

In [57]:
!cat ../data/microbiome_missing.csv
Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA
Firmicutes,10,162,3196
Firmicutes,11,372,-99999
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,?,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18
Proteobacteria,5,12044,83
Proteobacteria,6,2310,12
Proteobacteria,7,3053,547
Proteobacteria,8,395,2174
Proteobacteria,9,2651,767
Proteobacteria,10,1195,76
Proteobacteria,11,6857,795
Proteobacteria,12,483,666
Proteobacteria,13,2950,3994
Proteobacteria,14,1541,816
Proteobacteria,15,1307,53
Actinobacteria,1,569,648
Actinobacteria,2,1590,4
Actinobacteria,3,25,2
Actinobacteria,4,259,300
Actinobacteria,5,568,7
Actinobacteria,6,1102,9
Actinobacteria,7,678,377
Actinobacteria,8,260,58
Actinobacteria,9,424,233
Actinobacteria,10,548,21
Actinobacteria,11,201,83
Actinobacteria,12,42,75
Actinobacteria,13,109,59
Actinobacteria,14,51,183
Actinobacteria,15,310,204
Bacteroidetes,1,115,380
Bacteroidetes,2,67,0
Bacteroidetes,3,0,0
Bacteroidetes,4,85,5
Bacteroidetes,5,143,7
Bacteroidetes,6,678,2
Bacteroidetes,7,4829,209
Bacteroidetes,8,74,651
Bacteroidetes,9,169,254
Bacteroidetes,10,106,10
Bacteroidetes,11,73,381
Bacteroidetes,12,30,359
Bacteroidetes,13,51,51
Bacteroidetes,14,2473,2314
Bacteroidetes,15,102,33
Other,1,114,277
Other,2,195,18
Other,3,42,2
Other,4,316,43
Other,5,202,40
Other,6,116,0
Other,7,527,12
Other,8,357,11
Other,9,106,11
Other,10,67,14
Other,11,203,6
Other,12,392,6
Other,13,28,25
Other,14,12,22
Other,15,305,32
In [58]:
pd.read_csv("../data/microbiome_missing.csv").head(20)
Out[58]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305.0
1 Firmicutes 2 136 4182.0
2 Firmicutes 3 NaN 703.0
3 Firmicutes 4 408 3946.0
4 Firmicutes 5 831 8605.0
5 Firmicutes 6 693 50.0
6 Firmicutes 7 718 717.0
7 Firmicutes 8 173 33.0
8 Firmicutes 9 228 NaN
9 Firmicutes 10 162 3196.0
10 Firmicutes 11 372 -99999.0
11 Firmicutes 12 4255 4361.0
12 Firmicutes 13 107 1667.0
13 Firmicutes 14 ? 223.0
14 Firmicutes 15 281 2377.0
15 Proteobacteria 1 1638 3886.0
16 Proteobacteria 2 2469 1821.0
17 Proteobacteria 3 839 661.0
18 Proteobacteria 4 4414 18.0
19 Proteobacteria 5 12044 83.0

Above, Pandas recognized NA and an empty field as missing data.

In [59]:
pd.isnull(pd.read_csv("../data/microbiome_missing.csv")).head(20)
Out[59]:
Taxon Patient Tissue Stool
0 False False False False
1 False False False False
2 False False True False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 False False False False
15 False False False False
16 False False False False
17 False False False False
18 False False False False
19 False False False False

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values argument:

In [60]:
pd.read_csv("../data/microbiome_missing.csv", na_values=['?', -99999]).head(20)
Out[60]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632.0 305.0
1 Firmicutes 2 136.0 4182.0
2 Firmicutes 3 NaN 703.0
3 Firmicutes 4 408.0 3946.0
4 Firmicutes 5 831.0 8605.0
5 Firmicutes 6 693.0 50.0
6 Firmicutes 7 718.0 717.0
7 Firmicutes 8 173.0 33.0
8 Firmicutes 9 228.0 NaN
9 Firmicutes 10 162.0 3196.0
10 Firmicutes 11 372.0 NaN
11 Firmicutes 12 4255.0 4361.0
12 Firmicutes 13 107.0 1667.0
13 Firmicutes 14 NaN 223.0
14 Firmicutes 15 281.0 2377.0
15 Proteobacteria 1 1638.0 3886.0
16 Proteobacteria 2 2469.0 1821.0
17 Proteobacteria 3 839.0 661.0
18 Proteobacteria 4 4414.0 18.0
19 Proteobacteria 5 12044.0 83.0

These can be specified on a column-wise basis using an appropriate dict as the argument for na_values.

Microsoft Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), Pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: xlrd and openpyxl (these may be installed with either pip or easy_install).

The read_excel convenience function in pandas imports a specific sheet from an Excel file

In [61]:
mb = pd.read_excel('../data/microbiome/MID2.xls', sheetname='Sheet 1', header=None)
mb.head()
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  **kwds)
Out[61]:
0 1
0 Archaea "Crenarchaeota" Thermoprotei Acidiloba... 2
1 Archaea "Crenarchaeota" Thermoprotei Acidiloba... 14
2 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 23
3 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 1
4 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2

There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. These include JSON, XML, HDF5, relational and non-relational databases, and various web APIs. These are beyond the scope of this tutorial, but are covered in Python for Data Analysis.


References

Python for Data Analysis Wes McKinney