Python Pandas Cheat Sheet

As a data analyst, these are common tasks I perform using Pandas

Setting display output options

In [70]:
import pandas as pd
pd.set_option("display.max_rows",1000)    # or pd.options.display.max_rows=1000
pd.set_option("display.max_columns",20)   # or pd.options.display.max_columns=20 
pd.set_option('precision',7)
pd.set_option('large_repr', 'truncate')
In [4]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[4]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [5]:
data.describe()
Out[5]:
ounces
count 9.000000
mean 6.055556
std 2.855307
min 3.000000
25% 4.000000
50% 6.000000
75% 7.500000
max 12.000000

df.info() shows data types, number of rows and columns, and memory usage of your data frame

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 2 columns):
group     9 non-null object
ounces    9 non-null float64
dtypes: float64(1), object(1)
memory usage: 216.0+ bytes

Sorting More Than One Column

Just pass a list of columns you want to sort and pass corresponding list of True/False to the ascending parameter

In [16]:
data.sort_values(by=['group','ounces'], ascending=[False, True], inplace=True)
data
Out[16]:
group ounces
6 c 3.0
7 c 5.0
8 c 6.0
3 b 6.0
4 b 7.5
5 b 8.0
1 a 3.0
0 a 4.0
2 a 12.0

Removing duplicates

In [19]:
import pandas as pd
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [3, 2, 1, 3, 3, 4, 4]})
In [20]:
data.sort_values(by='k2')
data
Out[20]:
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4
In [21]:
data.drop_duplicates()  # by default, duplicate is defined by all columns
Out[21]:
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
5 two 4

Define duplicates by column name(s):

In [22]:
data.drop_duplicates(subset='k1')  # duplicate in column k1 only
Out[22]:
k1 k2
0 one 3
3 two 3

Creating a new column based on values from another column

In [5]:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[5]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal using a dictionary and a function:

In [7]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

def meat2animal(series):
    if series["food"]=='bacon':
        return 'pig'
    elif series["food"]=='pulled pork':
        return 'pig'
    elif series["food"]=='pastrami':
        return 'cow'
    elif series["food"]=='corned beef':
        return 'cow'
    elif series["food"]=='honey ham':
        return 'pig'
    else:
        return 'salmon'
In [8]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data
Out[8]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

Or we could use apply() and apply the meat2animal() function to the new column:

In [9]:
# axis=1 means to apply the function for each row, but I prefer to use axis='columns' instead
data['animal2'] = data.apply(meat2animal,axis='columns')
data
Out[9]:
food ounces animal animal2
0 bacon 4.0 pig pig
1 pulled pork 3.0 pig pig
2 bacon 12.0 pig pig
3 Pastrami 6.0 cow salmon
4 corned beef 7.5 cow cow
5 Bacon 8.0 pig salmon
6 pastrami 3.0 cow cow
7 honey ham 5.0 pig pig
8 nova lox 6.0 salmon salmon

Use apply() along with lambda function to create new columns:

In [38]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=10)
data = {'value': [value[0] for value in np.random.randn(10,1).tolist()], 'date': dates }
df = pd.DataFrame.from_dict(data)
df
Out[38]:
date value
0 2013-01-01 -0.164750
1 2013-01-02 -0.626383
2 2013-01-03 0.537509
3 2013-01-04 0.659939
4 2013-01-05 0.464557
5 2013-01-06 0.919569
6 2013-01-07 0.955645
7 2013-01-08 1.140392
8 2013-01-09 1.108809
9 2013-01-10 -1.141781
In [39]:
df.dtypes
Out[39]:
date     datetime64[ns]
value           float64
dtype: object

Since date column is an actual datetime object, we can leverage its strftime() function:

In [36]:
df['month'] = df['date'].apply(lambda x: x.strftime('%Y-%m'))
df['year'] = df['date'].apply(lambda x: x.strftime('%Y'))
df
Out[36]:
date value month year
0 2013-01-01 1.218454 2013-01 2013
1 2013-01-02 -1.624252 2013-01 2013
2 2013-01-03 -0.378726 2013-01 2013
3 2013-01-04 0.181450 2013-01 2013
4 2013-01-05 -0.606842 2013-01 2013
5 2013-01-06 -0.891220 2013-01 2013
6 2013-01-07 -1.478226 2013-01 2013
7 2013-01-08 1.084743 2013-01 2013
8 2013-01-09 -1.789110 2013-01 2013
9 2013-01-10 -0.418485 2013-01 2013

We can also use dataframe.assign() function which was added in pandas version 0.16

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df.assign(ratio = df['data1'] / df['data2'])
Out[4]:
data1 data2 ratio
0 0.246394 -0.496658 -0.496105
1 1.690199 1.245658 1.356873
2 -0.132093 0.834331 -0.158322
3 -0.046662 0.717323 -0.065050
4 -0.355658 0.439520 -0.809196

Removing or dropping a column

In [10]:
data.drop('animal2', axis='columns', inplace=True)  # dropping/removing a column
data
Out[10]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

Replacing Values in a Series

In [4]:
import pandas as pd
import numpy as np

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
Out[4]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

If you want to replace -999 with NaN:

In [5]:
data.replace(-999, np.nan, inplace=True)
data
Out[5]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute value:

In [9]:
import pandas as pd
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
Out[9]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
In [10]:
data.replace([-999, -1000], np.nan, inplace=True)
data
Out[10]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

Filling or Dropping NaN ("Not a Number") Values

Dropping rows that have NaN:

In [8]:
data.dropna()
Out[8]:
0    1.0
2    2.0
5    3.0
dtype: float64

Filling/replacing NaN values with something else (replace NaN with 0/zero):

In [12]:
data.fillna(value=0)
Out[12]:
0    1.0
1    0.0
2    2.0
3    0.0
4    0.0
5    3.0
dtype: float64

Renaming Axis Indexes

In [47]:
import pandas as pd
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data
Out[47]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

3 rows × 4 columns

In [48]:
data.index.map(str.upper)
Out[48]:
array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)
In [51]:
data.index = data.index.map(str.upper)
data
Out[51]:
one two three four
OHIO 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11

3 rows × 4 columns

If you want to create a transformed version of a data set without modifying the original, a useful method is rename:

In [65]:
data.rename(index=str.title, columns=str.upper, inplace=True)  # str.title means to make the 1st letter capitalized only
data
Out[65]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

3 rows × 4 columns

Renaming Indexes or Columns

rename() can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [46]:
import pandas as pd
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data
Out[46]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [47]:
data.rename(index={'Ohio': 'INDIANA'},columns={'three': 'peekaboo'},inplace=True)
data
Out[47]:
one two peekaboo four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

You can also apply str functions to modify the index or column labels

In [48]:
data.rename(index=str.title, columns=str.upper, inplace=True)  # str.title means to make the 1st letter capitalized only
data
Out[48]:
ONE TWO PEEKABOO FOUR
Indiana 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

Binning Data Into Intervals

In [71]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so, you have to use cut() function:

In [73]:
bins = [18, 25, 35, 60, 100]
In [75]:
cats = pd.cut(ages, bins)
cats
Out[75]:
  (18, 25]
  (18, 25]
  (18, 25]
  (25, 35]
  (18, 25]
  (18, 25]
  (35, 60]
  (25, 35]
 (60, 100]
  (35, 60]
  (35, 60]
  (25, 35]
Levels (4): Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype=object)
In [77]:
cats.labels
Out[77]:
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int32)
In [78]:
cats.levels
Out[78]:
Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')
In [79]:
pd.value_counts(cats)
Out[79]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

If you want the right value to be exclusive in the intervals:

In [80]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[80]:
  [18, 26)
  [18, 26)
  [18, 26)
  [26, 36)
  [18, 26)
  [18, 26)
  [36, 61)
  [26, 36)
 [61, 100)
  [36, 61)
  [36, 61)
  [26, 36)
Levels (4): Index(['[18, 26)', '[26, 36)', '[36, 61)', '[61, 100)'], dtype=object)

You can also pass your own bin names by passing a list or array to the labels option:

In [83]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
pd.value_counts(pd.cut(ages, bins, labels=group_names))
Out[83]:
Youth         5
YoungAdult    3
MiddleAged    3
Senior        1
dtype: int64

Grouping Data ( see also value_count() and pivot_table() )

In [2]:
import pandas as pd

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[2]:
data1 data2 key1 key2
0 1.378812 0.622186 a one
1 1.821631 0.652364 a two
2 -0.961298 -1.039626 b one
3 0.360465 0.671288 b two
4 -1.140391 0.691223 a one
In [3]:
grouped = df['A'].groupby(df['key1'])
grouped.mean()
Out[3]:
key1
a      -0.388562
b       0.167782
dtype: float64
In [1]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=10)
df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=list('ABCD'))
df
Out[1]:
A B C D
2013-01-01 1.183730 1.295833 1.085529 -0.279914
2013-01-02 -1.104646 0.539370 -0.628320 -0.739858
2013-01-03 0.688594 0.691522 -0.691798 1.003799
2013-01-04 -0.063884 0.049397 -0.658725 -1.715724
2013-01-05 0.091806 -0.677499 0.871060 -0.258136
2013-01-06 -0.011150 1.042270 -0.332827 0.986962
2013-01-07 0.628330 -1.179083 0.088909 0.444550
2013-01-08 -1.565315 1.049263 -1.243827 1.706111
2013-01-09 0.313865 0.792014 -0.200633 0.715121
2013-01-10 -0.061469 1.359909 1.824914 -1.684415

New in version 0.20 - agg() function

In [2]:
df.agg('sum')
Out[2]:
A    0.099859
B    4.962997
C    0.114282
D    0.178495
dtype: float64
In [3]:
df.agg(['sum', 'min'])
Out[3]:
A B C D
sum 0.099859 4.962997 0.114282 0.178495
min -1.565315 -1.179083 -1.243827 -1.715724
In [4]:
df.A.agg('sum')
Out[4]:
0.099859321519735381
In [5]:
df[['A','C']].agg('sum')
Out[5]:
A    0.099859
C    0.114282
dtype: float64

Sub-selecting or Slicing a Data Frame

Filtering by label name: [loc]
Filtering by index row and/or column: [iloc]

In [1]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[1]:
A B C D
2013-01-01 0.522775 -2.031792 -0.266824 1.780004
2013-01-02 0.122243 0.934587 -0.037507 0.425352
2013-01-03 -1.023274 -0.921369 -0.278390 1.650902
2013-01-04 0.139989 0.503434 -0.161022 2.345143
2013-01-05 0.412856 -0.338236 -0.665537 -0.970697
2013-01-06 1.029252 -1.025318 -1.284003 -0.243114

Getting first n rows of data frame using index slicing syntax

In [2]:
df[0:3]  # get first 3 rows of the data frame
Out[2]:
A B C D
2013-01-01 0.522775 -2.031792 -0.266824 1.780004
2013-01-02 0.122243 0.934587 -0.037507 0.425352
2013-01-03 -1.023274 -0.921369 -0.278390 1.650902

Slicing based on data frame's index range

In [3]:
df['20130102':'20130104'] # get rows by index range
Out[3]:
A B C D
2013-01-02 0.122243 0.934587 -0.037507 0.425352
2013-01-03 -1.023274 -0.921369 -0.278390 1.650902
2013-01-04 0.139989 0.503434 -0.161022 2.345143

Slicing based on column labels/names using loc

In [4]:
df.loc[:,['A','B']]  # syntax is: df.loc[rows_index, cols_index]
Out[4]:
A B
2013-01-01 0.522775 -2.031792
2013-01-02 0.122243 0.934587
2013-01-03 -1.023274 -0.921369
2013-01-04 0.139989 0.503434
2013-01-05 0.412856 -0.338236
2013-01-06 1.029252 -1.025318

Slicing based on row index label and column label combined using loc

In [5]:
df.loc['20130102':'20130104',['A','B']]
Out[5]:
A B
2013-01-02 0.122243 0.934587
2013-01-03 -1.023274 -0.921369
2013-01-04 0.139989 0.503434

Resampling Time Series Data

For complete list of resampling time periods - link

In [11]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=180)
df = pd.DataFrame(np.random.randn(180,1),index=dates,columns=list('A'))
df.head()
Out[11]:
A
2013-01-01 -0.100352
2013-01-02 0.533593
2013-01-03 -0.709328
2013-01-04 0.561621
2013-01-05 1.285367

Partial date filtering using loc, get just January's data:

In [14]:
df.loc['2013-01']
Out[14]:
A
2013-01-01 -0.100352
2013-01-02 0.533593
2013-01-03 -0.709328
2013-01-04 0.561621
2013-01-05 1.285367
2013-01-06 -0.359507
2013-01-07 -1.316567
2013-01-08 -0.998065
2013-01-09 0.243157
2013-01-10 -1.376942
2013-01-11 0.513412
2013-01-12 0.190791
2013-01-13 0.939814
2013-01-14 2.952349
2013-01-15 -0.133369
2013-01-16 0.827884
2013-01-17 0.796712
2013-01-18 0.090352
2013-01-19 0.407645
2013-01-20 0.202147
2013-01-21 1.214386
2013-01-22 0.765865
2013-01-23 -0.026111
2013-01-24 1.300111
2013-01-25 -0.776675
2013-01-26 0.568263
2013-01-27 1.320500
2013-01-28 1.055734
2013-01-29 -1.776905
2013-01-30 0.352055
2013-01-31 -0.020482

Resample daily data to monthly data

In [12]:
monthly = df.resample('M').sum()
monthly.head(10)
Out[12]:
A
2013-01-31 8.527454
2013-02-28 8.428312
2013-03-31 -6.245119
2013-04-30 -6.931797
2013-05-31 4.965161
2013-06-30 -2.025121

But now, I prefer to see YYYY-MM format - for complete list of date formats - link

In [10]:
monthly.index = monthly.index.strftime('%Y-%m')
monthly.head(20)
Out[10]:
A
2013-01 0.660669
2013-02 -6.322408
2013-03 1.345849
2013-04 -3.176901
2013-05 -3.136590
2013-06 -1.975411

Slicing based on index position of the row or column using iloc

In [6]:
df.iloc[3]  # returns 4th row (index=3) of the data frame
Out[6]:
A    0.139989
B    0.503434
C   -0.161022
D    2.345143
Name: 2013-01-04 00:00:00, dtype: float64
In [7]:
df.iloc[3:5,0:2]  # returns specific range of rows and columns of the data frame
Out[7]:
A B
2013-01-04 0.139989 0.503434
2013-01-05 0.412856 -0.338236
In [8]:
df.iloc[[1,5],[0,2]] # returns specific rows and columns using lists containing columns or row indexes
Out[8]:
A C
2013-01-02 0.122243 -0.037507
2013-01-06 1.029252 -1.284003
In [9]:
df.iloc[1:3,:]  # returning specific rows and returning all columns
Out[9]:
A B C D
2013-01-02 0.122243 0.934587 -0.037507 0.425352
2013-01-03 -1.023274 -0.921369 -0.278390 1.650902
In [10]:
df.iloc[:,1:3]  # returning all rows and specific columns
Out[10]:
B C
2013-01-01 -2.031792 -0.266824
2013-01-02 0.934587 -0.037507
2013-01-03 -0.921369 -0.278390
2013-01-04 0.503434 -0.161022
2013-01-05 -0.338236 -0.665537
2013-01-06 -1.025318 -1.284003
In [11]:
df.iloc[1,1]  # getting secific scalar/single value
Out[11]:
0.93458693139263815

The boolean operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

In [33]:
df[df.A > -0.5]  # or df[df["A"] > -0.5], this syntax works when there is a space in the column name
Out[33]:
A B C D
2013-01-03 2.351035 0.085318 -2.265766 0.432268
2013-01-04 0.447390 0.205224 -0.614702 0.212125
2013-01-05 -0.455682 -2.765288 0.475253 0.432910
2013-01-06 1.000494 -0.462591 -0.638053 0.481087

or I've seen some people do boolean filtering by passing "criteria" variable to the data frame like so:

In [34]:
criteria = df.A > -0.5
df[criteria]
Out[34]:
A B C D
2013-01-03 2.351035 0.085318 -2.265766 0.432268
2013-01-04 0.447390 0.205224 -0.614702 0.212125
2013-01-05 -0.455682 -2.765288 0.475253 0.432910
2013-01-06 1.000494 -0.462591 -0.638053 0.481087
In [2]:
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2
Out[2]:
A B C D E
2013-01-01 0.618255 0.699026 -0.857985 -0.167018 one
2013-01-02 1.193699 0.304388 0.036477 0.073046 one
2013-01-03 -1.409751 0.599982 1.359995 0.901001 two
2013-01-04 0.165882 -0.156601 -0.764553 0.665824 three
2013-01-05 0.190956 -0.515755 -0.866012 -0.908422 four
2013-01-06 0.849617 -1.473412 0.013182 0.443061 three
In [36]:
df2[df2['E'].isin(['two','four'])]  # read as "return rows where column E contains two or four"
Out[36]:
A B C D E
2013-01-03 2.351035 0.085318 -2.265766 0.432268 two
2013-01-05 -0.455682 -2.765288 0.475253 0.432910 four

We can also do partial string matching. So let's say you don't know the exact spelling a word is you want to match, you can do this:

In [3]:
df2[df2.E.str.contains("tw|ou")]
Out[3]:
A B C D E
2013-01-03 -1.409751 0.599982 1.359995 0.901001 two
2013-01-05 0.190956 -0.515755 -0.866012 -0.908422 four

Using ~ to do a "NOT"

In [38]:
df2[~df2['E'].isin(['two','four'])]  # column E containing values not in two or four
Out[38]:
A B C D E
2013-01-01 -0.564503 -1.195000 0.441140 -0.291384 one
2013-01-02 -0.893038 -0.372430 -0.436048 -0.545141 one
2013-01-04 0.447390 0.205224 -0.614702 0.212125 three
2013-01-06 1.000494 -0.462591 -0.638053 0.481087 three

Filtering using query() method

In [52]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[52]:
A B C D
2013-01-01 -0.683630 -0.961407 1.052018 0.512382
2013-01-02 -0.755844 1.672273 0.365384 1.030149
2013-01-03 -0.532466 0.616725 1.066731 -1.056756
2013-01-04 0.153878 -1.270198 0.346587 0.150570
2013-01-05 1.645820 0.817894 0.168950 -0.833291
2013-01-06 -0.768034 1.145290 -0.612168 0.325120
In [58]:
df.query('A > C')
Out[58]:
A B C D
2013-01-05 1.64582 0.817894 0.16895 -0.833291
In [69]:
df.query('A > 0')
Out[69]:
A B C D
2013-01-04 0.153878 -1.270198 0.346587 0.150570
2013-01-05 1.645820 0.817894 0.168950 -0.833291
In [68]:
df.query('A > 0 & A < 1')
Out[68]:
A B C D
2013-01-04 0.153878 -1.270198 0.346587 0.15057
In [65]:
df.query('A > B | A > C')  # where A is greater than B or A is greater than C
Out[65]:
A B C D
2013-01-01 -0.683630 -0.961407 1.052018 0.512382
2013-01-04 0.153878 -1.270198 0.346587 0.150570
2013-01-05 1.645820 0.817894 0.168950 -0.833291

Obtaining columns with partial column labels

In [40]:
import pandas as pd

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[40]:
data1 data2 key1 key2
0 0.740132 1.220225 a one
1 1.223369 0.830145 a two
2 -2.046775 -1.149754 b one
3 -3.068403 0.675124 b two
4 -0.467039 0.640412 a one
In [42]:
df.filter(like='data')
Out[42]:
data1 data2
0 0.740132 1.220225
1 1.223369 0.830145
2 -2.046775 -1.149754
3 -3.068403 0.675124
4 -0.467039 0.640412

Getting Value Counts

It is so tempting to use the groupby() function or pivot_table, but most of the time, value_counts() function is all we need.

In [1]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[1]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [3]:
data.group.value_counts()
Out[3]:
a    3
b    3
c    3
Name: group, dtype: int64

Getting Cumulative Sum

In [3]:
data.group.value_counts().cumsum()
Out[3]:
a    3
b    6
c    9
dtype: int64

Pivot table example - see also this excellent article on pivot tables

In [51]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[51]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0

Calculating the means of each group

In [3]:
data.pivot_table(values='ounces',index='group',aggfunc=np.mean)
Out[3]:
group
a        6.333333
b        7.166667
c        4.666667
Name: ounces, dtype: float64

Getting counts by group

In [5]:
data.pivot_table(values='ounces',index='group',aggfunc='count')
Out[5]:
group
a        3
b        3
c        3
Name: ounces, dtype: int64

Getting cumulative sum or running total of the group counts

In [6]:
data.pivot_table(values='ounces',index='group',aggfunc='count').cumsum()
Out[6]:
group
a        3
b        6
c        9
Name: ounces, dtype: int64

Let's get sample data from dplython library which is a clone of dplyr

In [6]:
from dplython import *
In [7]:
diamonds.head()
Out[7]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Goal: Create a column that contains the difference between a diamond's carat and the average of the carats for that diamond's color.

Key take-away is to use the transform method of your specific column of the groupby object.

In [21]:
(diamonds.assign(carat_diff=diamonds
                 .groupby('color')['carat']
                 .transform(lambda x:x-x.mean())
                )
                .sort_values(by=['color','carat_diff'])
                .head(10)
)
Out[21]:
Unnamed: 0 carat cut color clarity depth table price x y z carat_diff
31597 31598 0.20 Ideal D VS2 61.5 57.0 367 3.81 3.77 2.33 -0.457795
31600 31601 0.20 Premium D VS2 62.3 60.0 367 3.73 3.68 2.31 -0.457795
31601 31602 0.20 Premium D VS2 61.7 60.0 367 3.77 3.72 2.31 -0.457795
38276 38277 0.21 Premium D VS2 61.6 59.0 386 3.82 3.78 2.34 -0.447795
38277 38278 0.21 Premium D VS2 60.6 60.0 386 3.85 3.81 2.32 -0.447795
38278 38279 0.21 Premium D VS2 59.1 62.0 386 3.89 3.86 2.29 -0.447795
38279 38280 0.21 Premium D VS2 58.3 59.0 386 3.96 3.93 2.30 -0.447795
54 55 0.22 Premium D VS2 59.3 62.0 404 3.91 3.88 2.31 -0.437795
28 29 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40 -0.427795
34 35 0.23 Very Good D VS1 61.9 58.0 402 3.92 3.96 2.44 -0.427795

For those curious how to do this using dplyr / dplython:

In [20]:
# Using dplyr/dplython syntax
(diamonds >>
  group_by(X.color) >>
  mutate(carat_diff = X.carat - X.carat.mean()) >>
  ungroup() >>
  arrange(X.color, X.carat_diff) >>
  head(10)
)
/home/pybokeh/envs/jupyter/lib/python3.5/site-packages/dplython/dplython.py:379: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[key] = val.applyFcns(df)
/home/pybokeh/envs/jupyter/lib/python3.5/site-packages/dplython/dplython.py:429: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  return lambda df: DplyFrame(df.sort(names))
Out[20]:
Unnamed: 0 carat cut color clarity depth table price x y z carat_diff
25981 31598 0.20 Ideal D VS2 61.5 57.0 367 3.81 3.77 2.33 -0.457795
25982 31601 0.20 Premium D VS2 62.3 60.0 367 3.73 3.68 2.31 -0.457795
25983 31602 0.20 Premium D VS2 61.7 60.0 367 3.77 3.72 2.31 -0.457795
26960 38277 0.21 Premium D VS2 61.6 59.0 386 3.82 3.78 2.34 -0.447795
26961 38278 0.21 Premium D VS2 60.6 60.0 386 3.85 3.81 2.32 -0.447795
26962 38279 0.21 Premium D VS2 59.1 62.0 386 3.89 3.86 2.29 -0.447795
26963 38280 0.21 Premium D VS2 58.3 59.0 386 3.96 3.93 2.30 -0.447795
22409 55 0.22 Premium D VS2 59.3 62.0 404 3.91 3.88 2.31 -0.437795
22404 29 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40 -0.427795
22405 35 0.23 Very Good D VS1 61.9 58.0 402 3.92 3.96 2.44 -0.427795

Here's a comparison I made between dplython and pandas.

In [1]:
import pandas as pd
data = pd.DataFrame({'group': [80, 70, 75, 75],
                     'ounces': [20, 30, 25, 25],
                     'size': [100, 100, 100, 100]}
                   )
data
Out[1]:
group ounces size
0 80 20 100
1 70 30 100
2 75 25 100
3 75 25 100
In [2]:
perc_of_rows = data.apply(lambda x : x / x.sum() * 100, axis='columns') # or axis=1
perc_of_rows
Out[2]:
group ounces size
0 40.0 10.0 50.0
1 35.0 15.0 50.0
2 37.5 12.5 50.0
3 37.5 12.5 50.0
In [38]:
perc_of_columns = data.apply(lambda x : x / x.sum() * 100, axis='index') # or axis=0
perc_of_columns
Out[38]:
group ounces size
0 26.666667 20.0 25.0
1 23.333333 30.0 25.0
2 25.000000 25.0 25.0
3 25.000000 25.0 25.0

or percent of a specific column:

In [39]:
data['Perc_of_Group_Column'] = data['group'] / data['group'].sum() * 100
data
Out[39]:
group ounces size Perc_of_Group_Column
0 80 20 100 26.666667
1 70 30 100 23.333333
2 75 25 100 25.000000
3 75 25 100 25.000000

Let's say you have a data frame with several columns and don't want to horizontally scroll to see the columns, but want to vertically scroll instead. Then that case, transpose the data frame.

In [2]:
import pandas as pd

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[2]:
data1 data2 key1 key2
0 -0.548420 0.381045 a one
1 -0.327573 0.273854 a two
2 -1.356345 -0.045686 b one
3 -0.476385 1.075987 b two
4 -0.923575 -0.973773 a one
In [3]:
df.head(1).transpose()
Out[3]:
0
data1 -0.5484204
data2 0.3810447
key1 a
key2 one

You can also do df.dtypes to accomplish similar output, but it's nice to see actual data along with the columns.

In [4]:
df.dtypes
Out[4]:
data1    float64
data2    float64
key1      object
key2      object
dtype: object
In [49]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[49]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [50]:
data.reset_index(level=0, inplace=True)
data
Out[50]:
index group ounces
0 0 a 4.0
1 1 a 3.0
2 2 a 12.0
3 3 b 6.0
4 4 b 7.5
5 5 b 8.0
6 6 c 3.0
7 7 c 5.0
8 8 c 6.0
In [6]:
import pandas as pd
col = ['a','b','c']
data = pd.DataFrame([[1,2,3],[10,11,12],[20,21,22]],columns=col)
data
Out[6]:
a b c
0 1 2 3
1 10 11 12
2 20 21 22
In [7]:
data = data.set_index('a')
data
Out[7]:
b c
a
1 2 3
10 11 12
20 21 22

To remove the index name ("a"), do:

In [8]:
data.index.name = None
data
Out[8]:
b c
1 2 3
10 11 12
20 21 22
In [1]:
import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s
Out[1]:
09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1
dtype: int64

From above, we have holes in our data. Let's fill in those missing holes using the idx date_range we specified above.

In [63]:
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
s
Out[63]:
2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
2013-09-09     0
2013-09-10     0
2013-09-11     0
2013-09-12     0
2013-09-13     0
2013-09-14     0
2013-09-15     0
2013-09-16     0
2013-09-17     0
2013-09-18     0
2013-09-19     0
2013-09-20     0
2013-09-21     0
2013-09-22     0
2013-09-23     0
2013-09-24     0
2013-09-25     0
2013-09-26     0
2013-09-27     0
2013-09-28     0
2013-09-29     0
2013-09-30     0
Freq: D, dtype: int64
In [ ]:
import pyodbc           # If using PostgreSQL, MySQL, SQLite, Redshift, MS SQL Server, or Oracle, use db.py instead
import pandas as pd
from getpass import getpass    # Module that will create a text input widget AND mask your password

userid = 'your_userid'
pw = getpass(prompt='Enter your password: ')

cnxn_string = 'DSN=your_dsn;UID=' + userid + ';PWD=' + pw

cnxn = pyodbc.connect(cnxn_string)
cursor = cnxn.cursor()

sql = """
SELECT * from your_table...
"""

df = pd.read_sql(sql, cnxn, index_col=None, parse_dates={'some_column':"%Y-%m-%d"})

# Close connections
cursor.close()
cnxn.close()

Also check out Yhat's db.py. Currently, doesn't support IBM DB2.

Certain charting libaries like ggplot expect the data to be in long format. Let's get some data in wide format.

In [17]:
%matplotlib inline
import matplotlib.pyplot as plt
from pandas_datareader import data as web
from datetime import datetime
import pandas as pd

stocks = {'tech':['GOOGL', 'MSFT', 'LNKD', 'YHOO', 'FB','HPQ','AMZN'],
          'auto':['TM','F','GM','HMC','NSANY','HYMTF'],
          'housing':['HD','WMT','LOW']
          }

start_date = datetime(2016,1,1)
end_date = datetime(2016, 4, 21)

def get_px(stock, start, end):
    return web.get_data_yahoo(stock, start, end)['Adj Close']

df = pd.DataFrame({n: get_px(n, start_date, end_date) for n in stocks['tech']})

Below is what data looks like in 'wide' format:

In [18]:
df.head(10)
Out[18]:
AMZN FB GOOGL HPQ LNKD MSFT YHOO
Date
2016-01-04 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
2016-01-05 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
2016-01-06 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
2016-01-07 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
2016-01-08 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
2016-01-11 617.739990 97.510002 733.070007 10.531878 210.929993 51.927167 30.170000
2016-01-12 617.890015 99.370003 745.340027 10.699993 210.429993 52.403745 30.690001
2016-01-13 581.809998 95.440002 719.570007 10.462655 203.130005 51.271873 29.440001
2016-01-14 593.000000 98.370003 731.390015 10.502211 203.210007 52.731395 30.320000
2016-01-15 570.179993 94.970001 710.489990 9.997867 196.320007 50.626508 29.139999
In [19]:
# Create a column from the index values
df.reset_index(level=0, inplace=True)
df.head(10)
Out[19]:
Date AMZN FB GOOGL HPQ LNKD MSFT YHOO
0 2016-01-04 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
1 2016-01-05 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
2 2016-01-06 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
3 2016-01-07 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
4 2016-01-08 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
5 2016-01-11 617.739990 97.510002 733.070007 10.531878 210.929993 51.927167 30.170000
6 2016-01-12 617.890015 99.370003 745.340027 10.699993 210.429993 52.403745 30.690001
7 2016-01-13 581.809998 95.440002 719.570007 10.462655 203.130005 51.271873 29.440001
8 2016-01-14 593.000000 98.370003 731.390015 10.502211 203.210007 52.731395 30.320000
9 2016-01-15 570.179993 94.970001 710.489990 9.997867 196.320007 50.626508 29.139999

This is what data looks like in 'long' format:

In [20]:
df_long = pd.melt(df, id_vars=['Date']).dropna()
df_long.head()
Out[20]:
Date variable value
0 2016-01-04 AMZN 636.989990
1 2016-01-05 AMZN 633.789978
2 2016-01-06 AMZN 632.650024
3 2016-01-07 AMZN 607.940002
4 2016-01-08 AMZN 607.049988
In [21]:
df_long.tail()
Out[21]:
Date variable value
527 2016-04-15 YHOO 36.509998
528 2016-04-18 YHOO 36.520000
529 2016-04-19 YHOO 36.330002
530 2016-04-20 YHOO 37.840000
531 2016-04-21 YHOO 37.669998
In [22]:
df_long.rename(columns={'variable': 'Stock', 'value':'Price'},inplace=True)
df_long.head(10)
Out[22]:
Date Stock Price
0 2016-01-04 AMZN 636.989990
1 2016-01-05 AMZN 633.789978
2 2016-01-06 AMZN 632.650024
3 2016-01-07 AMZN 607.940002
4 2016-01-08 AMZN 607.049988
5 2016-01-11 AMZN 617.739990
6 2016-01-12 AMZN 617.890015
7 2016-01-13 AMZN 581.809998
8 2016-01-14 AMZN 593.000000
9 2016-01-15 AMZN 570.179993

Now we can plot the stock prices. Since the data is in long format, we can take advantage of ggplot's color= parameter, which will magically make line charts with different colors for each stock. If the data was in wide format, you would have had to manually specify all the different stocks to plot or use a loop construct.

In [13]:
%matplotlib inline
import matplotlib.pyplot as plt
from ggplot import *

stock_plot = ggplot(df_long, aes(x='Date', y='Price', color='Stock')) + \
        geom_line() + \
        ylab('Price (US$)') + \
        ggtitle('Auto Stocks: Adj Close') + \
        scale_x_date(labels='%b %d %y',breaks=date_breaks('3 months') )
    
# Have to use MATPLOTLIB api to rotate x tick labels.  Hopefully some day, ggplot will implement a rotation option.
fig = stock_plot.draw()
ax = fig.gca()
labels = ax.get_xticklabels() 
for label in labels: 
    label.set_rotation(270)
plt.show()

Sadly, ggplot's development appears to have stagnated. I would recommend using other charting libraries instead.

In contrast to ggplot, there are charting libraries like plotly and bokeh that expect the data to be in wide format. If you're plotting directly from pandas data frames, the data frames must be in wide format also.

In [24]:
pivoted = df_long.pivot(index='Date', columns='Stock', values='Price')
pivoted.head()
Out[24]:
Stock AMZN FB GOOGL HPQ LNKD MSFT YHOO
Date
2016-01-04 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
2016-01-05 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
2016-01-06 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
2016-01-07 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
2016-01-08 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
In [27]:
# since data is now in wide format, we can plot data frames directly
pivoted.plot.line()
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdc7fa50438>
In [27]:
import pandas as pd
data = pd.DataFrame({'medal': ['bronze', 'silver', 'silver', 'gold','bronze', 'bronze', 'gold', 'gold','gold'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[27]:
medal ounces
0 bronze 4.0
1 silver 3.0
2 silver 12.0
3 gold 6.0
4 bronze 7.5
5 bronze 8.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
In [28]:
data["medal"] = data["medal"].astype("category")
data.dtypes
Out[28]:
medal     category
ounces     float64
dtype: object
In [29]:
data.sort_values(by=["medal"])
Out[29]:
medal ounces
0 bronze 4.0
4 bronze 7.5
5 bronze 8.0
3 gold 6.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
1 silver 3.0
2 silver 12.0

Now define the order of the catgorical data using set_categories()

In [30]:
data["medal"].cat.set_categories([ "gold","silver","bronze"],inplace=True)
In [31]:
data.sort_values(by=["medal"])
Out[31]:
medal ounces
3 gold 6.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
1 silver 3.0
2 silver 12.0
0 bronze 4.0
4 bronze 7.5
5 bronze 8.0
In [72]:
import pandas as pd
left = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
left
Out[72]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [76]:
right = pd.DataFrame({'label': ['a','b','c'],
                     'value': ['alpha','beta','charlie']})
right
Out[76]:
label value
0 a alpha
1 b beta
2 c charlie

By default, merge() does an inner join, but you can specify 'left', 'right', or 'outer' joins. These types of joins are familiar if you have SQL background. Here's a good SQL joins visualizer.

In [85]:
inner_joined = pd.merge(left, right, how='inner', left_on='group', right_on='label')
inner_joined
Out[85]:
group ounces label value
0 a 4.0 a alpha
1 a 3.0 a alpha
2 a 12.0 a alpha
3 b 6.0 b beta
4 b 7.5 b beta
5 b 8.0 b beta
6 c 3.0 c charlie
7 c 5.0 c charlie
8 c 6.0 c charlie

NOTE: To merge 2 Series, you have to use pd.concat() instead

For more on merge, join, and concatenate, see the official docs

YouTube video on joining data sets using pandas

In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
df
Out[1]:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
In [2]:
df.isnull().any(axis=1)
Out[2]:
0    False
1     True
2     True
3    False
4    False
dtype: bool
In [3]:
df[df.isnull().any(axis=1)]
Out[3]:
0 1 2
1 0 NaN 0
2 0 0 NaN
In [10]:
import pandas as pd

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[10]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
In [13]:
df.dtypes
Out[13]:
one      object
two      object
three    object
dtype: object
In [14]:
df[['two', 'three']] = df[['two', 'three']].astype(float)
In [15]:
df.dtypes
Out[15]:
one       object
two      float64
three    float64
dtype: object

Starting with version 1.5, you can now pass a "data" parameter inside MATPLOTLIB's plot() function

Prior to version 1.5:

In [3]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"var1":[1,2,3,4,5,6], "var2":[1,2,3,4,5,6]})
plt.plot(df["var1"], df["var2"])
plt.show()

With version 1.5+

In [4]:
plt.plot("var1", "var2", data=df)
plt.show()

When making histograms, pandas knows to make sub-plots based on the by=column option:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_excel('/home/pybokeh/Downloads/temp.xlsx', sheetname='Claims')
df.hist(column='DTF', by='GRADE_SHORT')
plt.show()

For box plots, instead of making sub-plots, pandas intelligently splits the data using the by= parameter":

In [2]:
df.boxplot(column='DTF', by='GRADE_SHORT')
plt.show()

conda install bokeh or pip instal bokeh

In [28]:
from bokeh.sampledata.autompg import autompg as df
from bokeh.charts import BoxPlot, output_notebook, show, hplot
In [29]:
df.head()
Out[29]:
mpg cyl displ hp weight accel yr origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
In [31]:
box = BoxPlot(df, values='mpg', label='cyl', color='cyl',
               title="MPG Box Plot by Cylinder Count", plot_width=400, plot_height=400)

# Embed the chart within this notebook
output_notebook()

show(box)
Loading BokehJS ...
Out[31]:

<Bokeh Notebook handle for In[31]>

Need to first install plotly and cufflinks: pip install plotly and pip install cufflinks

In [33]:
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import numpy as np
import cufflinks as cf
init_notebook_mode()

df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])

iplot(df.iplot(asFigure=True, kind='box', title='Plotly Pandas', dimensions=(800,500)), show_link=False)

With method chaining, it eliminates the need for making intermediary variables. You can process your data frame in a series of method calls when you enclose your data frame with parenthesis. Here's a very contrived example:

In [6]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = (pd.read_csv('/home/pybokeh/temp/vehicles.csv',
                 usecols=['year', 'make', 'model', 'comb08', 'fuelType', 'fuelType1', 
                          'fuelType2', 'atvType', 'cylinders', 'VClass'])
      .rename(columns={'comb08':'combmpg'})
      .query("make in('Honda','Acura','Toyota','Lexus') "
             "& fuelType1 in('Regular Gasoline','Premium Gasoline','Midgrade Gasoline') "
             "& cylinders in(4, 6) "
             "& VClass in('Compact Cars','Subcompact Cars','Midsize Cars','Large Cars','Sport Utility','Minivan') "
             "& ~(fuelType2 in('E85','Electricity','Natural Gas','Propane'))")
      ['combmpg'].plot.hist(alpha=0.5, label='Honda Motor Co')
     )
plt.title("Histogram of Combined Hwy+City MPG", weight="bold")
plt.show()

Also check out pipe() method.

Assuming you have R and rpy2 installed, we can send a Pandas data frame to R using the rpy2 IPython notebook extension to take advantage of R's various statistical and graphing packages.

Simple example where I will pull Yen currency exchange rate data from the web and graph it using R's ggplot2 library:

In [1]:
# load the rpy2 extension
%load_ext rpy2.ipython

Let's fetch the exchange rate data:

In [2]:
import pandas as pd

pandas_df = pd.read_csv('http://research.stlouisfed.org/fred2/data/EXJPUS.txt', sep='  ', skiprows=28, engine='python',
                 parse_dates=[0], header=None, names=['Month','Value'])

Sanity check on the data:

In [3]:
pandas_df.dtypes
Out[3]:
Month    datetime64[ns]
Value           float64
dtype: object
In [4]:
pandas_df.tail()
Out[4]:
Month Value
528 2015-01-01 118.2500
529 2015-02-01 118.7600
530 2015-03-01 120.3945
531 2015-04-01 119.5095
532 2015-05-01 119.7064

We'll use this one-liner %R magic command to send the Pandas data frame to R as input using the "-i" flag. See here for other flags.

In [7]:
%R -i pandas_df

Alternatively, you can use pandas built-in facility to do this:

In [5]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

Next, we'll use %%R magic command which will make the entire code cell R executable:

In [8]:
%%R -w 800 -h 500 -u px # instead of px, you can also choose 'in', 'cm', or 'mm'
df = as.data.frame(pandas_df)  # ensure data frame that was passed in from Python will be R's data frame type
library(ggplot2)        # plotting library
library(scales)         # enables us to change the x label time scale

p = ggplot(pandas_df, aes(pandas_df$Month, pandas_df$Value))
p = p + 
    geom_line() +
    scale_x_datetime(labels=date_format("%Y-%m"), breaks="5 years", minor_breaks="1 year") +
    ggtitle("Yen to US Dollars") +
    xlab("Time") +
    ylab("Yen Value")
print(p)