Pandas

Pandas is a primary data analysis library in Python. It offers a number of operations to aid in data exploration, cleaning and transformation, making it one of the most popular data science tools. To name a few examples of these operations, Pandas enables various methods to handle missing data and data pivoting, easy data sorting and description capabilities, fast generation of data plots, and Boolean indexing for fast image processing and other masking operations.

Some of the key features of Pandas are:

Ingestion and manipulation of heterogeneous data types
Generating descriptive statistics on data to support exploration and communication
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging and joining multiple datasets using dataframes
Working with timestamps and time-series data

Pandas also builds upon numpy and other Python packages to provide easy-to-use data structures and data manipulation functions with integrated indexing.

Additional Recommended Resources:

Pandas Documentation
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Introduction to Pandas Data Structures

Pandas uses two different data structures: Series and DataFrames. They are extremely powerful and fundamental to the Pandas package.

Series in Pandas

Pandas Series are one-dimensional labeled arrays. Since they act like ndarrays, they are valid arguments to most Numpy methods. Series support many data types, including integers, strings, floating point numbers, Python objects, etc. Their axis labels are collectively referred to as the index, and we can get and set values using these index labels. You can think of a Series as a flexible dictionary-like object.

Let's look at some code examples with the Pandas Series.

In [1]:

# import the Pandas package
import pandas as pd

In [2]:

# create a Series called sr
# syntax is: pd.Series([data elements], [index elements])
# note that the elements in the data and index sets do not have to be the same
sr = pd.Series([10, 'foo', 30, 90.4], ['peach', 'plum', 'dog', 'band'])

In [3]:

# view the Series
sr

Out[3]:

peach      10
plum      foo
dog        30
band     90.4
dtype: object

In [4]:

# view the indices
sr.index

Out[4]:

Index(['peach', 'plum', 'dog', 'band'], dtype='object')

In [5]:

# access the data at an index
sr['plum']

Out[5]:

'foo'

In [6]:

# OR
sr.loc['plum']

Out[6]:

'foo'

In [7]:

# access the data at multiple indices
sr[['peach', 'band']]

Out[7]:

peach      10
band     90.4
dtype: object

In [8]:

# OR
sr.loc[['peach', 'band']]

Out[8]:

peach      10
band     90.4
dtype: object

You can see that the data is represented so that you can access it like a list with numeric indices (list[x]) or more like a dictionary (dic['key']).

In [9]:

# access a data element by position in the list
sr[2]

Out[9]:

In [10]:

# OR
sr.iloc[2]

Out[10]:

In [11]:

# access multiple data elements by positions in the list
sr[[0, 1, 2]]

Out[11]:

peach     10
plum     foo
dog       30
dtype: object

In [12]:

# OR
sr.iloc[[1, 2, 3]]

Out[12]:

plum     foo
dog       30
band    90.4
dtype: object

In [13]:

# is the index 'peach' in the Series?
'peach' in sr

Out[13]:

True

We can also use basic Python operations like multiplication on a Series. In the code below, we multiply the whole Series by 2. Note that this operation is performed on all data types, even strings, where the string is doubled.

In [18]:

sr *2 #Notice foo turns to foofoo when multiplied by 2

Out[18]:

peach        20
plum     foofoo
dog          60
band      180.8
dtype: object

In [16]:

sr # Because we did no set sr = sr*2, sr doesn't change values

Out[16]:

peach      10
plum      foo
dog        30
band     90.4
dtype: object

We can square the numerical index values in a Series. If we tried to square an index that's not a numeric data type, however, we would get an error.

In [20]:

sr[['peach', 'band']] ** 2 # you cannot square a string, so if you include 'foo' you will get an error

Out[20]:

peach        100
band     8172.16
dtype: object

DataFrames in Pandas

Pandas DataFrames are flexible 2-dimensional labeled data structures. They also support heterogeneous data and have labeled axes for rows and columns. We can think of a DataFrame as a container for Series objects, where each row is a Series.

Below we give some examples of things you can do with the Pandas DataFrame. You can find the full documentation for the DataFrames here.

Creating a DataFrame

There are many ways to create Pandas DataFrames. We often just read and ingest data into a data frame, but in this example, we create the DataFrame manually by starting with a dictionary of Series. Note that we are adding another dimensions to our data structure, so we need to label each Series. Here, we label the first Series 'a' and the second 'b'.

In [59]:

# create a dictionary called df_data
df_data = {'a' : pd.Series([1., 2., 3., 4.], index=['dog', 'cat', 'fruit', 'bird']),
     'b' : pd.Series([10., 20., 30.], index=['cake', 'fruit', 'ice cream'])}

In [60]:

# create and output the DataFrame
df = pd.DataFrame(df_data)
df

Out[60]:

	a	b
bird	4.0	NaN
cake	NaN	10.0
cat	2.0	NaN
dog	1.0	NaN
fruit	3.0	20.0
ice cream	NaN	30.0

Series 'a' and 'b' don't share the all of same indices. When we print the DataFrame, we see NaN values, which indicate that the Series does not contain a certain index.

In [61]:

df.index

Out[61]:

Index(['bird', 'cake', 'cat', 'dog', 'fruit', 'ice cream'], dtype='object')

In [62]:

df.columns

Out[62]:

Index(['a', 'b'], dtype='object')

We can also create a smaller DataFrame using a subset of the same data, this time specifying which indices we want to be included.

In [63]:

pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'])

Out[63]:

	a	b
dog	1.0	NaN
fruit	3.0	20.0
bird	4.0	NaN

By specifying the column parameter, you can select which columns you'd like the new DataFrame to include. In the code below, we ask the DataFrame to include column 'e', which doesn't exist in the original dictionary. Because of this, a new column 'e' will be created with all its entries as NaN.

In [64]:

pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'], columns=['a', 'e'])

Out[64]:

	a	e
dog	1.0	NaN
fruit	3.0	NaN
bird	4.0	NaN

Creating a DataFrame from a list of Python dictionaries

Another way to create a DataFrame is to use a list of Python dictionaries as your data. In the code below, we create a list of Python dictionaries called 'df_data2' and use this to make a DataFrame called 'df2'. We then use many of the same techniques as above to explore the DataFrame.

Please see this link for a reminder on Python dictionaries.

In [65]:

# create a Python dictionary

df_data2 = [{'apple': 5, 'cherry': 10}, {'peter': 1, 'emily': 2, 'brian': 6}]

In [66]:

# labels get created as column headers

pd.DataFrame(df_data2)

Out[66]:

	apple	brian	cherry	emily	peter
0	5.0	NaN	10.0	NaN	NaN
1	NaN	6.0	NaN	2.0	1.0

In [67]:

# rename the rows from 0 and 1 to 'blue' and 'yellow' by specifying the index parameter

pd.DataFrame(df_data2, index=['blue', 'yellow'])

Out[67]:

	apple	brian	cherry	emily	peter
blue	5.0	NaN	10.0	NaN	NaN
yellow	NaN	6.0	NaN	2.0	1.0

In [68]:

# create a smaller DataFrame by specifying the columns

pd.DataFrame(df_data2, columns=['cherry', 'emily','brian'])

Out[68]:

	cherry	emily	brian
0	10.0	NaN	NaN
1	NaN	2.0	6.0

Exploring some basic DataFrame operations

Now let's look into how we can get data out of a DataFrame with some basic DataFrame operations. In the following code, we perform some operations on our DataFrame df.

In [69]:

# our DataFrame of interest
df

Out[69]:

	a	b
bird	4.0	NaN
cake	NaN	10.0
cat	2.0	NaN
dog	1.0	NaN
fruit	3.0	20.0
ice cream	NaN	30.0

In [70]:

# display only column 'a' of df (subsetting)
df['a']

Out[70]:

bird         4.0
cake         NaN
cat          2.0
dog          1.0
fruit        3.0
ice cream    NaN
Name: a, dtype: float64

In [71]:

# create a new column 'c' by adding 'a' and 'b' together
df['c'] = df['a'] + df['b']
df

Out[71]:

	a	b	c
bird	4.0	NaN	NaN
cake	NaN	10.0	NaN
cat	2.0	NaN	NaN
dog	1.0	NaN	NaN
fruit	3.0	20.0	23.0
ice cream	NaN	30.0	NaN

Note that since NaN values cannot be added to floating point values, the resulting values in 'c' are NaN. For index 'fruit', however, both 'a' and 'b' are floating point values and can be added together.

In [72]:

# create a new column 'd' of boolean values indicating whether or not an index's value in 'a' is greater than 2.0
# NaN values evaluate to False

df['d'] = df['a'] > 2.0
df

Out[72]:

	a	b	c	d
bird	4.0	NaN	NaN	True
cake	NaN	10.0	NaN	False
cat	2.0	NaN	NaN	False
dog	1.0	NaN	NaN	False
fruit	3.0	20.0	23.0	True
ice cream	NaN	30.0	NaN	False

In [73]:

# set cee equal to the 'c' column in the DataFrame

cee = df.pop('c')

In [74]:

cee

Out[74]:

bird          NaN
cake          NaN
cat           NaN
dog           NaN
fruit        23.0
ice cream     NaN
Name: c, dtype: float64

In [75]:

# the pop method has removed 'c' from df
df

Out[75]:

	a	b	d
bird	4.0	NaN	True
cake	NaN	10.0	False
cat	2.0	NaN	False
dog	1.0	NaN	False
fruit	3.0	20.0	True
ice cream	NaN	30.0	False

In [76]:

# delete column 'b' from the DataFrame
del df['b']

In [77]:

df

Out[77]:

	a	d
bird	4.0	True
cake	NaN	False
cat	2.0	False
dog	1.0	False
fruit	3.0	True
ice cream	NaN	False

In [78]:

# insert a new column that is a copy of column 'a'

df.insert(2, 'copy_of_a', df['a'])
df

Out[78]:

	a	d	copy_of_a
bird	4.0	True	4.0
cake	NaN	False	NaN
cat	2.0	False	2.0
dog	1.0	False	1.0
fruit	3.0	True	3.0
ice cream	NaN	False	NaN

In [79]:

# insert a new column that is a copy of 'a' up to excluding the value at the third position of the Series. The rest of the
# column is NaNs

df['a_upper_half'] = df['a'][:3]
df

Out[79]:

	a	d	copy_of_a	a_upper_half
bird	4.0	True	4.0	4.0
cake	NaN	False	NaN	NaN
cat	2.0	False	2.0	2.0
dog	1.0	False	1.0	NaN
fruit	3.0	True	3.0	NaN
ice cream	NaN	False	NaN	NaN

Note that while both methods above (df.insert and df['col']) allowed us to insert new columns into the DataFrame, only df.insert lets us specify which position we want the column to be in.

Data Manipulation with Pandas

There are 5 main data manipulation tools that Pandas covers:

**Filtering**: Selecting a subset of the data's rows that satisfies a certain property

**Subsetting**: Selecting a subset of the DataFrame's columns, to capture only some of the attributes of each datapoint

**Combining**: Combine two datasets by row or column

**Joining**: The same as combining but it fuses together rows that have the same value for a variable

**Feature Generation**: Applying a mathematical operation to each element of a row and/or column

Let's look at how we can use the iris dataset to use these tools.

In [80]:

# Load the iris dataset from sklearn and create a corresponding DataFrame
from sklearn import datasets
iris = datasets.load_iris() 
iris_data = pd.DataFrame(iris.data,columns = ['Sepal Length','Sepal Width','Petal Length','Petal Width'])
iris_data.head()

Out[80]:

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

First, let's filter the data so we have only samples with Petal Length > 1.0.

In [81]:

filtered = iris_data[iris_data['Petal Width']>1.0]
filtered.head()

Out[81]:

	Sepal Length	Sepal Width	Petal Length	Petal Width
50	7.0	3.2	4.7	1.4
51	6.4	3.2	4.5	1.5
52	6.9	3.1	4.9	1.5
53	5.5	2.3	4.0	1.3
54	6.5	2.8	4.6	1.5

Next, we use subsetting to find the variance in the Petal Width.

In [82]:

petal_width = iris_data['Petal Width']
petal_width.var(axis=0)

Out[82]:

0.5824143176733784

In pandas, it's especially easy to combine datasets by column because you can write dataframe['columnname']. However, say we want to make a new dataframe that has a categorical column based on which type of iris flower it is.

In [83]:

labels = pd.DataFrame(iris.target,columns=['Flower Type'])
labels
result = pd.concat([iris_data,labels],axis=1) # requires iterable argument so the DataFrames are in a list
result.head()

Out[83]:

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Lastly, mutations let you modify the data in an entire row or column. We will use mutation to do a simple standard score normalization, which is commonly used to scale the data (though as you'll see, it's not as statistically applicable in this case).

In [84]:

mean = iris_data['Sepal Width'].mean()
std = iris_data['Sepal Width'].std()
iris_data['Sepal Width']=(iris_data['Sepal Width']-mean)/std
iris_data['Sepal Width'].head()

Out[84]:

0    1.028611
1   -0.124540
2    0.336720
3    0.106090
4    1.259242
Name: Sepal Width, dtype: float64

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal Length	Sepal Width	Petal Length	Petal Width
50	7.0	3.2	4.7	1.4
51	6.4	3.2	4.5	1.5
52	6.9	3.1	4.9	1.5
53	5.5	2.3	4.0	1.3
54	6.5	2.8	4.6	1.5

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal Length	Sepal Width	Petal Length	Petal Width
50	7.0	3.2	4.7	1.4
51	6.4	3.2	4.5	1.5
52	6.9	3.1	4.9	1.5
53	5.5	2.3	4.0	1.3
54	6.5	2.8	4.6	1.5

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal Length	Sepal Width	Petal Length	Petal Width
50	7.0	3.2	4.7	1.4
51	6.4	3.2	4.5	1.5
52	6.9	3.1	4.9	1.5
53	5.5	2.3	4.0	1.3
54	6.5	2.8	4.6	1.5

	Sepal Length	Sepal Width	Petal Length	Petal Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2