About Pandas

From the pandas homepage:

What problem does pandas solve?

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

pandas is probably overkill for what we want to do, but it does make the process of handling and organizing data, and then turning into visualizations, less tedious. Its data structures are based off of structures fundamental and familiar to plain Python: lists and dictionaries.

Pandas was created by Wes McKinney, who also authored Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, which I recommend and also quote from in this guide.

Here's a quick 10-minute tour of pandas from McKinney:

More documentation and guides

The official pandas homepage is pandas.pydata.org. Its docs live here.

Sections most relevant to us:

Greg Reda has an excellent tutorial that covers roughly the same scope of concepts I'm cover here. Check them out for another perspective:

Importing the pandas package

For the rest of this lesson, assume that pandas has been imported into the environment like so:

In [2]:
import pandas as pd

In virtually every tutorial, including this one, you'll see the convention of pd being used as shorthand for pandas.

The pandas.Series data structure

Pandas docs: pandas.Series

The pandas.Series structure is pretty similar to a list, in that it holds an ordered list of values:

import pandas as pd
myseries = pd.Series(['alpha', 1, 2, 3, 'zeta'])
print("The first value is:", myseries[0])
# The first value is: alpha

However, you can also think of a Series object as a dictionary, in that its elements can also be indexed by keys. Via Wes McKinney's Python for Data Analysis:

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

Constructing a Series object

Since Series is so similar to a dictionary, we can construct a new Series object by passing in a dictionary to its constructor function:

In [16]:
mydict = {'gamma': 42, 'beta': 30, 'delta' : 101}
myseries = pd.Series(mydict)

print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
This is gamma: 42
This is the 3rd element: 42

As indicated in the above example, when creating the Series from a dictionary, pandas will sort the keys by default. Unlike a typical dictionary, this ordering will be enforced, as it would be in a list.

Specifying the index

The concept of an ordered index is central to Python lists and for pandas Series as well. In lieu of constructing a Series using a dictionary, you can create a new Series by passing in a list of values and then specifying the index argument.

Note in the example below how the Series will not sort the index alphabetically and will keep it in the same order as we've specified in the index argument:

In [13]:
mylist = [42, 30, 10]
myseries = pd.Series(mylist, index = ['gamma', 'beta', 'delta'])
print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
This is gamma: 42
This is the 3rd element: 10

Indexing and slices

As seen above, referring to the rows of a Series uses the same bracket [] notation as used for lists and dictionaries. However, there are a few more options in selecting multiple values when working with Series.

While using an individual index/key will return the single corresponding value, e.g. index[0] gets you 42, the following examples of multiple indexes/keys will create new Series objects:

In [40]:
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])

myseries[[0, 2]]
Out[40]:
gamma    42
delta    10
dtype: int64
In [41]:
myseries[['alpha', 'beta']]
Out[41]:
alpha    99
beta     30
dtype: int64

A note about the output above: The pandas data structures come with metadata; the dtype: int64 refers to the fact that every value in the resulting Series objects from the above commands is an integer.

Making slices

Pandas docs: Slicing ranges

This is similar to slicing plain Python lists:

In [47]:
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])

myseries[2:]
Out[47]:
delta    10
alpha    99
dtype: int64

However, we can also slice a Series by its index values:

In [45]:
myseries['gamma':'beta']
Out[45]:
gamma    42
beta     30
dtype: int64

A couple of things to note:

  1. Unlike ranges that use integers, ranges with index values are inclusive, i.e. the row indexed as beta is also included in the resulting sub-Series.
  2. You can see how confusing things get if you choose to index your series in non-traditional order, which, in the above scenario, would be alphabetical order.

Series-wide operations and transformations

Assuming that everything in a given Series is of a single type, we can perform mass operations (better known as scalar and array operations) on that Series, creating a new Series object in the process.

Here's how to mass-assign the string 'fluffy' to the same range of rows accessed in the previous example:

In [24]:
myseries[2:] = 'fluffy'
myseries
Out[24]:
gamma        42
beta         30
delta    fluffy
alpha    fluffy
dtype: object

Note how the dtype of myseries changed to object; the series no longer contains just integers. Although Series are just like lists and dictionaries in that they can contain a sequence of any type of objects, we normally strive to keep them all of one type in most data-wrangling/analysis scenarios.

Here's an example of scalar arithmetic, e.g. multiplying all of the elements in the Series with a single value:

In [48]:
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries * 10
Out[48]:
gamma    420
beta     300
delta    100
alpha    990
dtype: int64

And here's an example of array arithmetic, e.g. adding elements of a Series with corresponding elements from another sequence (i.e. a list or Series) to produce a new Series:

In [50]:
myseries + [1000, 2000, 3000, -4000]
Out[50]:
gamma    1042
beta     2030
delta    3010
alpha   -3901
dtype: int64

Attempting to perform array operations against differently-sized sequences will result in an error:

myseries + [1000, 2000]

Boolean operations

Boolean expressions are also possible. The following command creates a new Series object that contains the result of the boolean expression as it is applied to each row:

In [30]:
myseries > 20
Out[30]:
gamma     True
beta      True
delta    False
alpha     True
dtype: bool

Aggregating series

The pandas Series objects come with a variety of useful and familiar aggregation methods and attributes:

In [65]:
myseries = pd.Series({'a': 20, 'b': -5, 'c': 42})
myseries.size
Out[65]:
3
In [67]:
myseries.sum()
Out[67]:
57
In [68]:
myseries.mean()
Out[68]:
19.0
In [69]:
myseries.min()
Out[69]:
-5

Filtering series

One of the biggest advantages for us in using pandas data structures are their many methods for filtering data.

Consider how we filter a dictionary in plain Python:

In [62]:
mydict = {'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7}
newdict = {}
for k, v in mydict.items():
    if v > 10:
        newdict[k] = v

Or if you prefer using a comprehension:

In [56]:
{k: v for k, v in mydict.items() if v > 10}
Out[56]:
{'apples': 42, 'carrots': 303}

Filtering a Series can be done using the same bracket [] notation used in indexing a Series:

In [93]:
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7})

myseries[myseries > 10]
Out[93]:
apples      42
carrots    303
dtype: int64

Changing the index

It can't be emphasized enough that while a pandas Series is as accessible as a Python dictionary, it still maintains its specified order of keys, i.e. its index, and the values they map to.

When creating a Series from a Python list, its index is simply the numerical position of each element. And in creating a Series from a dictionary, its index are the keys of the dictionary:

In [82]:
series_a = pd.Series([4, 10, 3])
series_a.index
Out[82]:
Int64Index([0, 1, 2], dtype='int64')
In [81]:
series_b = pd.Series({'x': 500, 'y': 600, 'z': 700})
series_b.index
Out[81]:
Index(['x', 'y', 'z'], dtype='object')

The index= method can be used to change the index, effectively relabeling the data, by substituting a list of new index values (the Series and the list of new index values have to be the same size):

In [92]:
myseries = pd.Series([4, 10, 3])
myseries.index = ['a', 'b', 'c']
myseries.index
Out[92]:
Index(['a', 'b', 'c'], dtype='object')

Reindexing

The reindex() method allows you to rearrange the values of a Series; note that it does not change the Series object but creates a new one:

In [96]:
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303})
myseries.reindex(['bagels', 'apples', 'carrots'])
Out[96]:
bagels       9
apples      42
carrots    303
dtype: int64

Reindexing the series using values not found in the original index will return a Series that contains the new index values pointing to null/NaN values:

In [98]:
myseries.reindex(['oranges', 'bagels', 'apples', 'carrots'])
Out[98]:
oranges    NaN
bagels       9
apples      42
carrots    303
dtype: float64

The pandas.Dataframes structures

The DataFrame structure can be thought of as series of Series objects, except that they share two indices: the index that represents the row order, and the index that represents the columns order. In other words, it's pretty much like a spreadsheet.

Constructing a DataFrame

You can create a DataFrame by passing in a list-of-lists, i.e. a 2-dimensional array. The more common route is to use the read_csv() method, which can read from a local filename or, very conveniently, a URL:

In [3]:
# data from Yahoo Finance Historical Prices:
# http://finance.yahoo.com/q/hp?s=AAPL
csvurl = "http://real-chart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=1&c=2013&d=04&e=1&f=2015&g=d"
prices = pd.read_csv(csvurl, parse_dates = [0])
prices.head()
Out[3]:
Date Open High Low Close Volume Adj Close
0 2015-05-01 126.10000 130.13000 125.30000 128.95000 57195000 128.95000
1 2015-04-30 128.64000 128.64000 124.58000 125.15000 82475900 125.15000
2 2015-04-29 130.16000 131.59000 128.30000 128.64000 62410800 128.64000
3 2015-04-28 134.46001 134.53999 129.57001 130.56000 118580700 130.56000
4 2015-04-27 132.31000 133.13000 131.14999 132.64999 84783100 132.64999

In the read_csv() call, I make use of the parse_dates argument and specify that the first column contains a date that should be converted to a proper timestamp object. Again, something you could write a for-loop with the datetime.strptime() function, but a very handy conveneince that pandas brings to the table.

By default, loading data via a CSV file (or via any other format besides a dictionary) will create a DataFrame in which the index is a sequential list of integers. But, as a DataFrame is a group of series, we can actually specify that one of its columns be the index.

In the following snippet, I create a new copy of the prices DataFrame, but using the Date column as the index:

In [5]:
dprices = prices.set_index(prices['Date'])
dprices.head()
Out[5]:
Date Open High Low Close Volume Adj Close
Date
2015-05-01 2015-05-01 126.10000 130.13000 125.30000 128.95000 57195000 128.95000
2015-04-30 2015-04-30 128.64000 128.64000 124.58000 125.15000 82475900 125.15000
2015-04-29 2015-04-29 130.16000 131.59000 128.30000 128.64000 62410800 128.64000
2015-04-28 2015-04-28 134.46001 134.53999 129.57001 130.56000 118580700 130.56000
2015-04-27 2015-04-27 132.31000 133.13000 131.14999 132.64999 84783100 132.64999

The practical implication is that I can now use a date string as a key to get rows:

In [6]:
dprices['2015-04-27']
Out[6]:
Date Open High Low Close Volume Adj Close
Date
2015-04-27 2015-04-27 132.31 133.13 131.14999 132.64999 84783100 132.64999

I've made a separate walkthrough using more Yahoo Finance data with data frames here.

For more thorough walkthroughs with DataFrames, check out Greg Reda's work: