Pandas

http://pandas.pydata.org

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

The primary data structures of pandas, Series (1-dimensional), DataFrame (2-dimensional), and Panel (3-dimensional) handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

In [1]:
import numpy as np
import pandas as pd
np.__version__, pd.__version__
Out[1]:
('1.9.1', '0.15.1')

Overview

Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index

In [2]:
pd.Series([1,3,5,np.nan,6,8])
Out[2]:
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns

In [3]:
dates = pd.date_range('20130101',periods=6)

Creating a DataFrame by passing a dict of objects that can be converted to series-like:

In [4]:
df = pd.DataFrame({"A": 1.,
                    "B": pd.Timestamp("20130102"),
                    "C": pd.Series(1, index=list(xrange(4)), dtype="float32"),
                    "D": np.arange(10, 14),
                    "E": pd.Categorical(["test", "train", "test", "train"]),
                    "F": "foo"})
In [5]:
df.dtypes
Out[5]:
A           float64
B    datetime64[ns]
C           float32
D             int64
E          category
F            object
dtype: object

Viewing Data

In [6]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
df.head()
Out[6]:
A B C D
2013-01-01 -0.013521 -0.538476 -0.304598 -0.770429
2013-01-02 -1.254896 1.801479 -0.620739 0.151766
2013-01-03 -1.354482 1.254227 0.611957 -0.850364
2013-01-04 -0.498235 1.300603 0.450060 1.791270
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721
In [7]:
df.tail(2)
Out[7]:
A B C D
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721
2013-01-06 0.137794 0.174232 0.402498 1.520787
In [8]:
df.index
Out[8]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
In [9]:
df.columns
Out[9]:
Index([u'A', u'B', u'C', u'D'], dtype='object')
In [10]:
df.values
Out[10]:
array([[-0.01352054, -0.53847571, -0.30459809, -0.77042864],
       [-1.25489567,  1.80147894, -0.62073864,  0.15176625],
       [-1.3544823 ,  1.25422737,  0.61195665, -0.85036394],
       [-0.49823537,  1.30060341,  0.45005968,  1.79127043],
       [-0.36811699, -1.64965572,  0.78651726, -0.41172073],
       [ 0.13779431,  0.17423239,  0.40249794,  1.52078737]])
In [11]:
df.describe()
Out[11]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.558576 0.390402 0.220949 0.238552
std 0.622938 1.314319 0.555490 1.156653
min -1.354482 -1.649656 -0.620739 -0.850364
25% -1.065731 -0.360299 -0.127824 -0.680752
50% -0.433176 0.714230 0.426279 -0.129977
75% -0.102170 1.289009 0.571482 1.178532
max 0.137794 1.801479 0.786517 1.791270
In [12]:
df.sort(columns="B")
Out[12]:
A B C D
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721
2013-01-01 -0.013521 -0.538476 -0.304598 -0.770429
2013-01-06 0.137794 0.174232 0.402498 1.520787
2013-01-03 -1.354482 1.254227 0.611957 -0.850364
2013-01-04 -0.498235 1.300603 0.450060 1.791270
2013-01-02 -1.254896 1.801479 -0.620739 0.151766

Selection

Getting

In [13]:
df["A"]
Out[13]:
2013-01-01   -0.013521
2013-01-02   -1.254896
2013-01-03   -1.354482
2013-01-04   -0.498235
2013-01-05   -0.368117
2013-01-06    0.137794
Freq: D, Name: A, dtype: float64
In [14]:
df.A
Out[14]:
2013-01-01   -0.013521
2013-01-02   -1.254896
2013-01-03   -1.354482
2013-01-04   -0.498235
2013-01-05   -0.368117
2013-01-06    0.137794
Freq: D, Name: A, dtype: float64
In [15]:
df[1:3]
Out[15]:
A B C D
2013-01-02 -1.254896 1.801479 -0.620739 0.151766
2013-01-03 -1.354482 1.254227 0.611957 -0.850364

Selection by label

The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits.

  • .loc is strictly label based (both endpoints are included)
  • .iloc is strictly integer position based (half-open interval)
  • .ix supports mixed integer and label based access
In [16]:
df.loc[:, ["B", "C"]]
Out[16]:
B C
2013-01-01 -0.538476 -0.304598
2013-01-02 1.801479 -0.620739
2013-01-03 1.254227 0.611957
2013-01-04 1.300603 0.450060
2013-01-05 -1.649656 0.786517
2013-01-06 0.174232 0.402498
In [17]:
df.loc["20130102": "20130104", ["B", "C"]]
Out[17]:
B C
2013-01-02 1.801479 -0.620739
2013-01-03 1.254227 0.611957
2013-01-04 1.300603 0.450060

Selection by Position

In [18]:
df.iloc[1:3]
Out[18]:
A B C D
2013-01-02 -1.254896 1.801479 -0.620739 0.151766
2013-01-03 -1.354482 1.254227 0.611957 -0.850364
In [19]:
df.iloc[:, 1:3]
Out[19]:
B C
2013-01-01 -0.538476 -0.304598
2013-01-02 1.801479 -0.620739
2013-01-03 1.254227 0.611957
2013-01-04 1.300603 0.450060
2013-01-05 -1.649656 0.786517
2013-01-06 0.174232 0.402498
In [20]:
df.iloc[3:5, -1:1:-1]
Out[20]:
D C
2013-01-04 1.791270 0.450060
2013-01-05 -0.411721 0.786517
In [21]:
df.iloc[[1, 2, 4], [0, 2]]
Out[21]:
A C
2013-01-02 -1.254896 -0.620739
2013-01-03 -1.354482 0.611957
2013-01-05 -0.368117 0.786517

Boolean Indexing

In [22]:
df[df.A > 0]
Out[22]:
A B C D
2013-01-06 0.137794 0.174232 0.402498 1.520787
In [23]:
df > 0
Out[23]:
A B C D
2013-01-01 False False False False
2013-01-02 False True False True
2013-01-03 False True True False
2013-01-04 False True True True
2013-01-05 False False True False
2013-01-06 True True True True

Setting

In [24]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
df["F"] = s1 # add a new column, automatically expands
df
Out[24]:
A B C D F
2013-01-01 -0.013521 -0.538476 -0.304598 -0.770429 NaN
2013-01-02 -1.254896 1.801479 -0.620739 0.151766 1
2013-01-03 -1.354482 1.254227 0.611957 -0.850364 2
2013-01-04 -0.498235 1.300603 0.450060 1.791270 3
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721 4
2013-01-06 0.137794 0.174232 0.402498 1.520787 5
In [25]:
# setting values by label
df.loc[dates[0], "A"] = 0
df
Out[25]:
A B C D F
2013-01-01 0.000000 -0.538476 -0.304598 -0.770429 NaN
2013-01-02 -1.254896 1.801479 -0.620739 0.151766 1
2013-01-03 -1.354482 1.254227 0.611957 -0.850364 2
2013-01-04 -0.498235 1.300603 0.450060 1.791270 3
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721 4
2013-01-06 0.137794 0.174232 0.402498 1.520787 5
In [26]:
# setting values by position
df.iloc[0, 1] = 0
df
Out[26]:
A B C D F
2013-01-01 0.000000 0.000000 -0.304598 -0.770429 NaN
2013-01-02 -1.254896 1.801479 -0.620739 0.151766 1
2013-01-03 -1.354482 1.254227 0.611957 -0.850364 2
2013-01-04 -0.498235 1.300603 0.450060 1.791270 3
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721 4
2013-01-06 0.137794 0.174232 0.402498 1.520787 5

Missing Data

pandas primarily uses the value np.nan to represent missing data. By default, it is ignored in computations it's involved in.

In [27]:
# drop any rows that have missing data
df.dropna()
Out[27]:
A B C D F
2013-01-02 -1.254896 1.801479 -0.620739 0.151766 1
2013-01-03 -1.354482 1.254227 0.611957 -0.850364 2
2013-01-04 -0.498235 1.300603 0.450060 1.791270 3
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721 4
2013-01-06 0.137794 0.174232 0.402498 1.520787 5
In [28]:
# fill in missing data
df.fillna(value=42)
Out[28]:
A B C D F
2013-01-01 0.000000 0.000000 -0.304598 -0.770429 42
2013-01-02 -1.254896 1.801479 -0.620739 0.151766 1
2013-01-03 -1.354482 1.254227 0.611957 -0.850364 2
2013-01-04 -0.498235 1.300603 0.450060 1.791270 3
2013-01-05 -0.368117 -1.649656 0.786517 -0.411721 4
2013-01-06 0.137794 0.174232 0.402498 1.520787 5
In [29]:
df.isnull()
Out[29]:
A B C D F
2013-01-01 False False False False True
2013-01-02 False False False False False
2013-01-03 False False False False False
2013-01-04 False False False False False
2013-01-05 False False False False False
2013-01-06 False False False False False

Operations

Stats

In [30]:
df.mean() # same as df.mean(0)
Out[30]:
A   -0.556323
B    0.480148
C    0.220949
D    0.238552
F    3.000000
dtype: float64
In [31]:
df.mean(1)
Out[31]:
2013-01-01   -0.268757
2013-01-02    0.215522
2013-01-03    0.332268
2013-01-04    1.208740
2013-01-05    0.471405
2013-01-06    1.447062
Freq: D, dtype: float64
In [32]:
df["F"].value_counts() # histogram
Out[32]:
5    1
4    1
3    1
2    1
1    1
dtype: int64

Apply

In [33]:
df.apply(np.cumsum) # along axis=0
Out[33]:
A B C D F
2013-01-01 0.000000 0.000000 -0.304598 -0.770429 NaN
2013-01-02 -1.254896 1.801479 -0.925337 -0.618662 1
2013-01-03 -2.609378 3.055706 -0.313380 -1.469026 3
2013-01-04 -3.107613 4.356310 0.136680 0.322244 6
2013-01-05 -3.475730 2.706654 0.923197 -0.089477 10
2013-01-06 -3.337936 2.880886 1.325695 1.431311 15
In [34]:
df.apply(lambda x: x.max() - x.min()) # calculate the range
Out[34]:
A    1.492277
B    3.451135
C    1.407256
D    2.641634
F    4.000000
dtype: float64

Merge

In [35]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[35]:
0 1 2 3
0 -0.221306 1.403735 -1.121338 -1.323353
1 1.557017 -2.327843 0.161665 -0.366561
2 -0.778079 1.494476 0.441979 -0.486226
3 -0.755218 -0.789131 -0.096869 1.275961
4 -1.923050 1.305421 2.322076 -1.448612
5 1.264380 0.815069 -0.809809 -1.241545
6 0.871387 -2.283768 -0.159607 -1.296561
7 0.382978 -0.324404 -1.142943 0.029581
8 0.490227 0.500010 0.904099 -0.193891
9 0.668668 -0.633312 -0.190546 0.732018
In [36]:
pieces = [df[:3], df[3:7], df[7:]] # break it into pieces
pd.concat(pieces) # concatenates rows
Out[36]:
0 1 2 3
0 -0.221306 1.403735 -1.121338 -1.323353
1 1.557017 -2.327843 0.161665 -0.366561
2 -0.778079 1.494476 0.441979 -0.486226
3 -0.755218 -0.789131 -0.096869 1.275961
4 -1.923050 1.305421 2.322076 -1.448612
5 1.264380 0.815069 -0.809809 -1.241545
6 0.871387 -2.283768 -0.159607 -1.296561
7 0.382978 -0.324404 -1.142943 0.029581
8 0.490227 0.500010 0.904099 -0.193891
9 0.668668 -0.633312 -0.190546 0.732018
In [37]:
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
left
Out[37]:
key lval
0 foo 1
1 foo 2
In [38]:
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right
Out[38]:
key rval
0 foo 4
1 foo 5
In [39]:
pd.merge(left, right, on="key") # this is an SQL style merge / join
Out[39]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [40]:
df = pd.DataFrame(np.random.randn(8, 4), columns=["A", "B", "C", "D"])
s = df.iloc[3]
df.append(s, ignore_index=True) # repeat row 3 as a new row
Out[40]:
A B C D
0 -0.802930 1.064959 -1.456150 -1.254707
1 0.175702 -0.855831 -0.528988 1.425909
2 0.948361 0.180468 -0.672670 0.665948
3 0.299942 0.084080 -0.548596 -0.792614
4 0.907037 1.929694 2.267796 0.116309
5 -0.576249 0.383789 0.285241 -1.926414
6 1.234840 -0.428826 1.340080 -1.132456
7 -0.795434 0.634670 0.396663 -0.578337
8 0.299942 0.084080 -0.548596 -0.792614

But beware:

In [41]:
df = pd.DataFrame(np.random.randn(8, 4), columns=["A", "B", "C", "D"])
s = df.iloc[3]
df2 = df.append(s) # repeat row 3 "as is" - probably not what you want!
df2
Out[41]:
A B C D
0 -0.914665 -0.866787 0.092680 0.413569
1 -0.134892 -0.473999 -0.421309 0.039511
2 0.195560 -0.695880 -0.016367 1.943548
3 -0.174682 1.584903 -0.276233 1.290800
4 1.888258 0.216033 1.147503 -1.190807
5 2.687998 0.648702 1.939244 -0.435992
6 0.036630 -0.920107 -0.687598 -0.576330
7 -0.271766 2.123034 -0.189470 -0.209918
3 -0.174682 1.584903 -0.276233 1.290800
In [42]:
df2.index[-1]
Out[42]:
3

Grouping

By “group by” we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure
In [43]:
df = pd.DataFrame({"A": ["foo", "bar", "foo", "bar",
                         "foo", "bar", "foo", "foo"],
                   "B": ["one", "one", "two", "three",
                         "two", "two", "one", "three"],
                   "C": np.random.randn(8),
                   "D": np.random.randn(8)})
df
Out[43]:
A B C D
0 foo one -1.179360 2.020667
1 bar one 0.138223 0.917891
2 foo two 0.481357 -0.892525
3 bar three 0.425901 0.702857
4 foo two -0.257032 -0.439499
5 bar two -0.659646 0.116992
6 foo one -0.080860 -1.159786
7 foo three -0.920501 0.383545
In [44]:
df.groupby("A").sum()
Out[44]:
C D
A
bar -0.095521 1.737740
foo -1.956396 -0.087598
In [45]:
df.groupby(["A", "B"]).sum() # hierarchical
Out[45]:
C D
A B
bar one 0.138223 0.917891
three 0.425901 0.702857
two -0.659646 0.116992
foo one -1.260220 0.860882
three -0.920501 0.383545
two 0.224325 -1.332024

Categoricals

In [46]:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "f"]})
df
Out[46]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 f
In [47]:
df["grade"] = df["raw_grade"].astype("category") # convert the raw grades to a categorical data type
In [48]:
df["grade"]
Out[48]:
0    a
1    b
2    b
3    a
4    a
5    f
Name: grade, dtype: category
Categories (3, object): [a < b < f]
In [49]:
df["grade"].cat.categories
Out[49]:
Index([u'a', u'b', u'f'], dtype='object')
In [50]:
df["grade"].cat.categories = ["very good", "good", "very bad"] # Rename the categories to more meaningful names
In [51]:
df["grade"]
Out[51]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good < good < very bad]
In [52]:
try:
    df.loc[0, "grade"] = "acceptable"
except:
    print("This won't work...")
This won't work...

Plotting

In [53]:
%matplotlib inline
from matplotlib import pyplot as plt
In [54]:
df = pd.DataFrame(np.random.randn(100, 4),
                  columns=["A", "B", "C", "D"])
df = df.cumsum()
In [55]:
plt.figure()
df.plot()
plt.legend(loc="best")
Out[55]:
<matplotlib.legend.Legend at 0x112239490>
<matplotlib.figure.Figure at 0x111b47a10>

Input / Output

In [56]:
df.to_csv("foo.csv")
In [57]:
df2 = pd.read_csv("foo.csv")
df2.loc[:10]
Out[57]:
Unnamed: 0 A B C D
0 0 -0.333823 0.776894 0.673178 -0.699580
1 1 -0.497605 0.298587 -0.145233 -0.894474
2 2 1.048358 -0.054480 -0.968650 -1.497525
3 3 0.832802 -0.245384 -2.532872 -2.051829
4 4 1.223810 -0.135948 -3.295947 -3.371382
5 5 1.164426 0.472430 -3.129992 -2.538316
6 6 -0.149887 -0.306995 -2.482947 -0.707087
7 7 -0.099956 0.000243 -0.673620 0.194677
8 8 -0.540222 -0.750738 -1.383149 -0.679118
9 9 -2.111503 -0.445074 -0.919335 -2.912505
10 10 -1.489700 -2.161447 -2.477718 -2.165269

Some Recipes

In [58]:
df = pd.DataFrame({"AAA": [4, 5, 6, 7], 
                   "BBB": [10, 20, 30, 40], 
                   "CCC": [100, 50, -30, -50]})
df
Out[58]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [59]:
df.loc[df.AAA >= 5, "BBB"] = -1 # conditionals
df
Out[59]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
2 6 -1 -30
3 7 -1 -50
In [60]:
df.loc[df.AAA <= 5] # splitting
Out[60]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
In [61]:
df = pd.DataFrame({"AAA": [4, 5, 6, 7], 
                   "BBB": [10, 20, 30, 40], 
                   "CCC": [100, 50, -30, -50]})
Crit1 = df.AAA <= 5.5
Crit2 = df.BBB == 10.0
Crit3 = df.CCC > -40.0
AllCrit = Crit1 & Crit2 & Crit3
df.loc[AllCrit]
Out[61]:
AAA BBB CCC
0 4 10 100
In [62]:
df.loc[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[62]:
AAA BBB CCC
0 4 10 100
2 6 30 -30

Creating new columns

In [63]:
df = pd.DataFrame({"AAA": [1, 2, 1, 3], 
                   "BBB": [1, 1, 2, 2], 
                   "CCC": [2,1,3,1]})
df
Out[63]:
AAA BBB CCC
0 1 1 2
1 2 1 1
2 1 2 3
3 3 2 1

Efficiently and dynamically creating new columns using applymap

In [64]:
source_cols = df.columns 
new_cols = [str(x) + "_cat" for x in source_cols]
categories = {1: "Alpha", 2: "Beta", 3: "Charlie"}
df[new_cols] = df[source_cols].applymap(categories.get)
df
Out[64]:
AAA BBB CCC AAA_cat BBB_cat CCC_cat
0 1 1 2 Alpha Alpha Beta
1 2 1 1 Beta Alpha Alpha
2 1 2 3 Alpha Beta Charlie
3 3 2 1 Charlie Beta Alpha

More grouping tricks:

In [65]:
df = pd.DataFrame({"animal": "cat dog cat fish dog cat cat".split(),
                   "size": list("SSMMMLL"),
                   "weight": [8, 10, 11, 1, 20, 12, 12],
                   "adult": [False] * 5 + [True] * 2})
df
Out[65]:
adult animal size weight
0 False cat S 8
1 False dog S 10
2 False cat M 11
3 False fish M 1
4 False dog M 20
5 True cat L 12
6 True cat L 12
In [66]:
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
Out[66]:
animal
cat       L
dog       M
fish      M
dtype: object
In [67]:
gb = df.groupby(["animal"])
gb.get_group("cat") # same as df.loc[df.animal == "cat"]
Out[67]:
adult animal size weight
0 False cat S 8
2 False cat M 11
5 True cat L 12
6 True cat L 12
In [68]:
def growup(x):
    avg_weight = sum(x[x.size == "S"].weight * 1.5)
    avg_weight += sum(x[x.size == "M"].weight * 1.25)
    avg_weight += sum(x[x.size == "L"].weight)
    avg_weight = avg_weight / len(x)
    return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])

gb.apply(growup)                    
Out[68]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
In [69]:
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, -1, 1, 2]})
df
Out[69]:
A B
0 1 1
1 1 -1
2 2 1
3 2 2
In [70]:
gb = df.groupby("A") # replacing negative values with mean of the rest of a group
def replace(g):
  mask = g < 0
  g.loc[mask] = g[~mask].mean()
  return g
            
gb.transform(replace)
Out[70]:
B
0 1
1 1
2 1
3 2