From the pandas homepage:

What problem does pandas solve?Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

*pandas* is probably overkill for what we want to do, but it does make the process of handling and organizing data, and then turning into visualizations, less tedious. Its data structures are based off of structures fundamental and familiar to plain Python: **lists** and **dictionaries**.

Pandas was created by Wes McKinney, who also authored Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, which I recommend and also quote from in this guide.

Here's a quick 10-minute tour of pandas from McKinney:

The official pandas homepage is pandas.pydata.org. Its docs live here.

Sections most relevant to us:

- 10 Minutes to Pandas
- Intro to Data Structures
- Essential Basic Functionality
- Indexing and Selecting Data
- Reshaping and pivot tables
- A "Cookbook"

Greg Reda has an excellent tutorial that covers roughly the same scope of concepts I'm cover here. Check them out for another perspective:

`pandas`

package¶For the rest of this lesson, assume that **pandas** has been imported into the environment like so:

In [2]:

```
import pandas as pd
```

In virtually every tutorial, including this one, you'll see the convention of `pd`

being used as *shorthand* for `pandas`

.

`pandas.Series`

data structure¶Pandas docs: pandas.Series

The `pandas.Series`

structure is pretty similar to a *list*, in that it holds an ordered list of values:

```
import pandas as pd
myseries = pd.Series(['alpha', 1, 2, 3, 'zeta'])
print("The first value is:", myseries[0])
# The first value is: alpha
```

However, you can also think of a `Series`

object as a *dictionary*, in that its elements can also be indexed by *keys*. Via Wes McKinney's Python for Data Analysis:

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

`Series`

object¶Since `Series`

is so similar to a dictionary, we can construct a new `Series`

object by passing in a dictionary to its constructor function:

In [16]:

```
mydict = {'gamma': 42, 'beta': 30, 'delta' : 101}
myseries = pd.Series(mydict)
print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
```

As indicated in the above example, when creating the `Series`

from a dictionary, pandas will sort the keys by default. Unlike a typical dictionary, this ordering will be enforced, as it would be in a list.

The concept of an ordered **index** is central to Python lists and for pandas Series as well. In lieu of constructing a Series using a dictionary, you can create a new Series by passing in a list of values and then specifying the **index** argument.

Note in the example below how the Series will *not* sort the index alphabetically and will keep it in the same order as we've specified in the **index** argument:

In [13]:

```
mylist = [42, 30, 10]
myseries = pd.Series(mylist, index = ['gamma', 'beta', 'delta'])
print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
```

As seen above, referring to the rows of a Series uses the same bracket `[]`

notation as used for lists and dictionaries. However, there are a few more options in selecting *multiple* values when working with Series.

While using an individual index/key will return the single corresponding value, e.g. `index[0]`

gets you `42`

, the following examples of multiple indexes/keys will create new Series objects:

In [40]:

```
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries[[0, 2]]
```

Out[40]:

In [41]:

```
myseries[['alpha', 'beta']]
```

Out[41]:

**A note about the output above:** The `pandas`

data structures come with metadata; the `dtype: int64`

refers to the fact that every value in the resulting Series objects from the above commands is an integer.

In [47]:

```
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries[2:]
```

Out[47]:

However, we can also slice a Series by its index values:

In [45]:

```
myseries['gamma':'beta']
```

Out[45]:

A couple of things to note:

- Unlike ranges that use integers, ranges with index values are
*inclusive*, i.e. the row indexed as`beta`

is also included in the resulting sub-Series. - You can see how confusing things get if you choose to index your series in non-traditional order, which, in the above scenario, would be alphabetical order.

Assuming that everything in a given Series is of a single type, we can perform mass operations (better known as scalar and array operations) on that Series, creating a new Series object in the process.

Here's how to mass-assign the string `'fluffy'`

to the same range of rows accessed in the previous example:

In [24]:

```
myseries[2:] = 'fluffy'
myseries
```

Out[24]:

Note how the `dtype`

of `myseries`

changed to `object`

; the series no longer contains just integers. Although *Series* are just like lists and dictionaries in that they can contain a sequence of any type of objects, we normally strive to keep them all of one type in most data-wrangling/analysis scenarios.

Here's an example of **scalar arithmetic**, e.g. multiplying all of the elements in the Series with a single value:

In [48]:

```
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries * 10
```

Out[48]:

And here's an example of **array arithmetic**, e.g. adding elements of a Series with corresponding elements from another sequence (i.e. a list or Series) to produce a new Series:

In [50]:

```
myseries + [1000, 2000, 3000, -4000]
```

Out[50]:

Attempting to perform array operations against differently-sized sequences will result in an error:

```
myseries + [1000, 2000]
```

Boolean expressions are also possible. The following command creates a new Series object that contains the result of the boolean expression as it is applied to each row:

In [30]:

```
myseries > 20
```

Out[30]:

The *pandas* Series objects come with a variety of useful and familiar aggregation methods and attributes:

In [65]:

```
myseries = pd.Series({'a': 20, 'b': -5, 'c': 42})
myseries.size
```

Out[65]:

In [67]:

```
myseries.sum()
```

Out[67]:

In [68]:

```
myseries.mean()
```

Out[68]:

In [69]:

```
myseries.min()
```

Out[69]:

One of the biggest advantages for us in using *pandas* data structures are their many methods for filtering data.

Consider how we filter a dictionary in plain Python:

In [62]:

```
mydict = {'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7}
newdict = {}
for k, v in mydict.items():
if v > 10:
newdict[k] = v
```

Or if you prefer using a comprehension:

In [56]:

```
{k: v for k, v in mydict.items() if v > 10}
```

Out[56]:

Filtering a Series can be done using the same bracket `[]`

notation used in indexing a Series:

In [93]:

```
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7})
myseries[myseries > 10]
```

Out[93]:

It can't be emphasized enough that while a *pandas* Series is as accessible as a Python dictionary, it still maintains its specified order of keys, i.e. its **index**, and the values they map to.

When creating a Series from a Python list, its index is simply the numerical position of each element. And in creating a Series from a dictionary, its index are the keys of the dictionary:

In [82]:

```
series_a = pd.Series([4, 10, 3])
series_a.index
```

Out[82]:

In [81]:

```
series_b = pd.Series({'x': 500, 'y': 600, 'z': 700})
series_b.index
```

Out[81]:

The `index=`

method can be used to change the index, effectively **relabeling** the data, by substituting a list of new index values (the Series and the list of new index values have to be the same size):

In [92]:

```
myseries = pd.Series([4, 10, 3])
myseries.index = ['a', 'b', 'c']
myseries.index
```

Out[92]:

The `reindex()`

method allows you to rearrange the values of a Series; note that it **does not change** the Series object but creates a new one:

In [96]:

```
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303})
myseries.reindex(['bagels', 'apples', 'carrots'])
```

Out[96]:

Reindexing the series using values not found in the original index will return a Series that contains the new index values pointing to null/`NaN`

values:

In [98]:

```
myseries.reindex(['oranges', 'bagels', 'apples', 'carrots'])
```

Out[98]:

`pandas.Dataframes`

structures¶The **DataFrame** structure can be thought of as series of *Series* objects, except that they share two indices: the index that represents the *row* order, and the index that represents the *columns* order. In other words, it's pretty much like a spreadsheet.

You can create a DataFrame by passing in a list-of-lists, i.e. a 2-dimensional array. The more common route is to use the **read_csv()** method, which can read from a local filename or, very conveniently, a URL:

In [3]:

```
# data from Yahoo Finance Historical Prices:
# http://finance.yahoo.com/q/hp?s=AAPL
csvurl = "http://real-chart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=1&c=2013&d=04&e=1&f=2015&g=d"
prices = pd.read_csv(csvurl, parse_dates = [0])
prices.head()
```

Out[3]:

In the `read_csv()`

call, I make use of the `parse_dates`

argument and specify that the *first* column contains a date that should be converted to a proper timestamp object. Again, something you could write a for-loop with the `datetime.strptime()`

function, but a very handy conveneince that `pandas`

brings to the table.

By default, loading data via a CSV file (or via any other format besides a dictionary) will create a *DataFrame* in which the *index* is a sequential list of integers. But, as a DataFrame is a group of series, we can actually specify that one of its *columns* be the index.

In the following snippet, I create a new copy of the `prices`

DataFrame, but using the `Date`

column as the index:

In [5]:

```
dprices = prices.set_index(prices['Date'])
dprices.head()
```

Out[5]:

The practical implication is that I can now use a date string as a key to get rows:

In [6]:

```
dprices['2015-04-27']
```

Out[6]:

I've made a separate walkthrough using more Yahoo Finance data with data frames here.

For more thorough walkthroughs with DataFrames, check out Greg Reda's work: