Python Pandas Cheat Sheet

As a data analyst, below are common tasks I perform using Pandas. However, best practice rule of thumb is to use as much SQL as possible for transformations or summarizations before pulling data into pandas as pandas is limited to using your local machine's resources. Another reason why SQL should be used as much as possible, there will likely be more people that are familiar with SQL in your organization than Python/pandas. So to reduce "technical debt", Python or pandas use should be limited if data manipulations can be done in SQL instead.

Setting display output options

By default, a large pandas dataframe (several columns or several rows) won't be completely rendered in your browser. You can override the defaults. However, you should not be surprised if your browser crashes if it tries to render several hundreds of thousands of rows. Here's the link to pandas display options documentation.

In [1]:
import pandas as pd
pd.set_option("display.max_rows",1000)    # or pd.options.display.max_rows=1000
pd.set_option("display.max_columns",20)   # or pd.options.display.max_columns=20 
pd.set_option('precision',7)
pd.set_option('large_repr', 'truncate')
In [2]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[2]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [3]:
data.describe()
Out[3]:
ounces
count 9.0000000
mean 6.0555556
std 2.8553069
min 3.0000000
25% 4.0000000
50% 6.0000000
75% 7.5000000
max 12.0000000

df.info() shows data types, number of rows and columns, and memory usage of your data frame

In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   group   9 non-null      object 
 1   ounces  9 non-null      float64
dtypes: float64(1), object(1)
memory usage: 272.0+ bytes

Sorting More Than One Column

Just pass a list of columns you want to sort and pass corresponding list of True/False to the ascending parameter

In [5]:
data.sort_values(by=['group','ounces'], ascending=[False, True], inplace=True)
data
Out[5]:
group ounces
6 c 3.0
7 c 5.0
8 c 6.0
3 b 6.0
4 b 7.5
5 b 8.0
1 a 3.0
0 a 4.0
2 a 12.0

Removing duplicates

In [6]:
import pandas as pd
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [3, 2, 1, 3, 3, 4, 4]})
In [7]:
data.sort_values(by='k2')
data
Out[7]:
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4
In [9]:
data.drop_duplicates()  # by default, a "duplicate" is defined by all columns
Out[9]:
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
5 two 4

Define duplicates by column name(s):

In [10]:
data.drop_duplicates(subset='k1')  # duplicate in column k1 only
Out[10]:
k1 k2
0 one 3
3 two 3
In [11]:
data
Out[11]:
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4
In [12]:
help(data.duplicated)
Help on method duplicated in module pandas.core.frame:

duplicated(subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first') -> 'Series' method of pandas.core.frame.DataFrame instance
    Return boolean Series denoting duplicate rows.
    
    Considering certain columns is optional.
    
    Parameters
    ----------
    subset : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns.
    keep : {'first', 'last', False}, default 'first'
        Determines which duplicates (if any) to mark.
    
        - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
        - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
        - False : Mark all duplicates as ``True``.
    
    Returns
    -------
    Series
        Boolean series for each duplicated rows.
    
    See Also
    --------
    Index.duplicated : Equivalent method on index.
    Series.duplicated : Equivalent method on Series.
    Series.drop_duplicates : Remove duplicate values from Series.
    DataFrame.drop_duplicates : Remove duplicate values from DataFrame.
    
    Examples
    --------
    Consider dataset containing ramen rating.
    
    >>> df = pd.DataFrame({
    ...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    ...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    ...     'rating': [4, 4, 3.5, 15, 5]
    ... })
    >>> df
        brand style  rating
    0  Yum Yum   cup     4.0
    1  Yum Yum   cup     4.0
    2  Indomie   cup     3.5
    3  Indomie  pack    15.0
    4  Indomie  pack     5.0
    
    By default, for each set of duplicated values, the first occurrence
    is set on False and all others on True.
    
    >>> df.duplicated()
    0    False
    1     True
    2    False
    3    False
    4    False
    dtype: bool
    
    By using 'last', the last occurrence of each set of duplicated values
    is set on False and all others on True.
    
    >>> df.duplicated(keep='last')
    0     True
    1    False
    2    False
    3    False
    4    False
    dtype: bool
    
    By setting ``keep`` on False, all duplicates are True.
    
    >>> df.duplicated(keep=False)
    0     True
    1     True
    2    False
    3    False
    4    False
    dtype: bool
    
    To find duplicates on specific column(s), use ``subset``.
    
    >>> df.duplicated(subset=['brand'])
    0    False
    1     True
    2    False
    3     True
    4     True
    dtype: bool

In [13]:
duplicate_rows = data[data.duplicated()]
duplicate_rows
Out[13]:
k1 k2
4 two 3
6 two 4

Creating a new column based on values from one or more columns

If you are familiar with Excel's IF/ELSE function or SQL's CASE WHEN statements, then this section is for you. For smallish data sets and not too complex IF/ELSE logic, map() and apply() are fine. But for larger data sets, you will notice a significant slowdown. For large data sets, you have a few options:

  • use pandas .loc
  • use np.where
  • use np.select
  • use swifter library
  • Dask's dataframe apply which swifter does for you under the hood

This article summarizes the first 3 options above and explains the reasons for slowdown. You may have to open the link using your browser's private/incognito mode.

If the data was originally sourced from a database table, then apply CASE WHEN logic prior to bringing the data into pandas. Best practice rule of thumb is to use as much SQL as possible for transformations or summarizations before pulling data into pandas as pandas is limited to using your local machine's resources. Another reason why SQL should be used as much as possible, there will likely be more people that are familiar with SQL in your organization than Python/pandas. So to reduce "technical debt", Python or pandas use should be limited if data manipulations can be done in SQL instead.

In [14]:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami','corned beef', 'bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[14]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal using a dictionary and also a function:

In [15]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

def meat2animal(column):
    if column == 'bacon':
        return 'pig'
    elif column == 'pulled pork':
        return 'pig'
    elif column == 'pastrami':
        return 'cow'
    elif column == 'corned beef':
        return 'cow'
    elif column == 'honey ham':
        return 'pig'
    else:
        return 'salmon'

Using the Python dictionary from above:

In [16]:
data['animal'] = data['food'].map(meat_to_animal)
data
Out[16]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

or using the function meat2animal():

In [17]:
data['animal'] = data['food'].map(meat2animal)
data
Out[17]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

We could also modify the function slightly to accept a dataframe row and perform logic on more than one column

In [18]:
def use2columns(row):
    if row['animal'] == 'pig' and row['ounces'] > 4:
        return 'Big Pig'
    elif row['animal'] == 'pig' and row['ounces'] <= 4:
        return 'Little Pig'
    else:
        return 'Other Animal'

NOTE: With large dataframes, applying a function across an entire row will impose a huge performance penalty.

In [19]:
data['animal2'] = data.apply(use2columns, axis='columns')
data
Out[19]:
food ounces animal animal2
0 bacon 4.0 pig Little Pig
1 pulled pork 3.0 pig Little Pig
2 bacon 12.0 pig Big Pig
3 pastrami 6.0 cow Other Animal
4 corned beef 7.5 cow Other Animal
5 bacon 8.0 pig Big Pig
6 pastrami 3.0 cow Other Animal
7 honey ham 5.0 pig Big Pig
8 nova lox 6.0 salmon Other Animal

Use apply() along with lambda function to create new columns:

In [20]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=10)
data = {'value': [value[0] for value in np.random.randn(10,1).tolist()], 'date': dates }
df = pd.DataFrame.from_dict(data)
df
Out[20]:
value date
0 -0.9893220 2013-01-01
1 0.1526321 2013-01-02
2 1.4833127 2013-01-03
3 1.4310336 2013-01-04
4 -2.3151007 2013-01-05
5 0.4999976 2013-01-06
6 0.8193681 2013-01-07
7 -1.0378093 2013-01-08
8 -0.6673418 2013-01-09
9 0.3181329 2013-01-10
In [21]:
df.dtypes
Out[21]:
value           float64
date     datetime64[ns]
dtype: object

OBJECTIVE: Let's say you want to create a month and year column

Since date column is an actual datetime object, we can leverage its strftime() function:

In [22]:
df['month'] = df['date'].apply(lambda x: x.strftime('%Y-%m'))
df['year'] = df['date'].apply(lambda x: x.strftime('%Y'))
df
Out[22]:
value date month year
0 -0.9893220 2013-01-01 2013-01 2013
1 0.1526321 2013-01-02 2013-01 2013
2 1.4833127 2013-01-03 2013-01 2013
3 1.4310336 2013-01-04 2013-01 2013
4 -2.3151007 2013-01-05 2013-01 2013
5 0.4999976 2013-01-06 2013-01 2013
6 0.8193681 2013-01-07 2013-01 2013
7 -1.0378093 2013-01-08 2013-01 2013
8 -0.6673418 2013-01-09 2013-01 2013
9 0.3181329 2013-01-10 2013-01 2013

Creating a new column based on values in a separate dataframe using a combination of map, query and lambda function

A better alternative is to use the merge() function.

In [23]:
import pandas as pd
left = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
left
Out[23]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [24]:
right = pd.DataFrame({'label': ['a','b','c'],
                     'value': ['alpha','beta','charlie']})
right
Out[24]:
label value
0 a alpha
1 b beta
2 c charlie
In [25]:
left['Label'] = left['group'].map(
    lambda group: right.query("label == @group")['value'].iloc[0]
)
left
Out[25]:
group ounces Label
0 a 4.0 alpha
1 a 3.0 alpha
2 a 12.0 alpha
3 b 6.0 beta
4 b 7.5 beta
5 b 8.0 beta
6 c 3.0 charlie
7 c 5.0 charlie
8 c 6.0 charlie

An equivalent to SQL's ROW_NUMBER / PARTITION BY

OBJECTIVE:

- To identify duplicate records with row numbering
- In the example below, a duplicate record is based on CLAIM_NUM and PART_NUM (both sorted by ascending order)
- With records having a row number other than 1, we want to set cost amounts to zero

This example mimics SQL's row_number() function.

In [26]:
import pandas as pd
import numpy as np
In [27]:
df = pd.read_clipboard()
In [28]:
df
Out[28]:
CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD TOTAL_COST_USD
0 1 062315LH 645.33 60.34 46.30 751.97
1 1 062345LH 323.55 67.25 20.56 751.97
2 1 062015LH 303.13 80.45 35.34 751.97
3 2 062315LH 613.45 60.34 46.30 720.09
4 2 062015LH 300.25 80.45 35.34 720.09
5 3 062345LH 333.10 67.25 20.56 420.91
6 4 062345LH 300.25 80.45 46.30 427.00

Use combination of sort_values(), groupby(), and cumcount() function to create ROW_NUM column:

In [29]:
df['ROW_NUM'] = df.sort_values(by=['PART_NUM']).groupby(['CLAIM_NUM']).cumcount() + 1
df.sort_values(by=['CLAIM_NUM', 'ROW_NUM'], inplace=True)
df
Out[29]:
CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD TOTAL_COST_USD ROW_NUM
2 1 062015LH 303.13 80.45 35.34 751.97 1
0 1 062315LH 645.33 60.34 46.30 751.97 2
1 1 062345LH 323.55 67.25 20.56 751.97 3
4 2 062015LH 300.25 80.45 35.34 720.09 1
3 2 062315LH 613.45 60.34 46.30 720.09 2
5 3 062345LH 333.10 67.25 20.56 420.91 1
6 4 062345LH 300.25 80.45 46.30 427.00 1

Then use np.where() to "zero out" the cost columns whose row numbers are not 1:

In [30]:
df['PART_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['PART_COST_USD'], 0)
df['LABOR_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['LABOR_COST_USD'], 0)
df['HANDLING_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['HANDLING_COST_USD'], 0)
df['TOTAL_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['TOTAL_COST_USD'], 0)
In [31]:
df
Out[31]:
CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD TOTAL_COST_USD ROW_NUM
2 1 062015LH 303.13 80.45 35.34 751.97 1
0 1 062315LH 0.00 0.00 0.00 0.00 2
1 1 062345LH 0.00 0.00 0.00 0.00 3
4 2 062015LH 300.25 80.45 35.34 720.09 1
3 2 062315LH 0.00 0.00 0.00 0.00 2
5 3 062345LH 333.10 67.25 20.56 420.91 1
6 4 062345LH 300.25 80.45 46.30 427.00 1

We can also use dataframe.assign() function to create a new column which was added in pandas version 0.16

In [32]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df.assign(ratio = df['data1'] / df['data2'])
Out[32]:
data1 data2 ratio
0 -0.1957977 -1.3006177 0.1505420
1 1.5043693 1.8548645 0.8110400
2 -2.1327051 -0.9218325 2.3135495
3 0.4580521 -1.6185442 -0.2830025
4 -1.4260360 0.2567198 -5.5548338

Text to Columns Example

In [33]:
import pandas as pd

df = pd.DataFrame({'name': ['John Doe','Jane Doe','Someone Else'], 
                   'number': [12, 234, 235]
                  })
df
Out[33]:
name number
0 John Doe 12
1 Jane Doe 234
2 Someone Else 235
In [34]:
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True)
In [35]:
df
Out[35]:
name number first_name last_name
0 John Doe 12 John Doe
1 Jane Doe 234 Jane Doe
2 Someone Else 235 Someone Else

Removing or dropping a column

In [36]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami','corned beef', 'bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[36]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
In [37]:
data.drop(columns=['food'], inplace=True)
data
Out[37]:
ounces
0 4.0
1 3.0
2 12.0
3 6.0
4 7.5
5 8.0
6 3.0
7 5.0
8 6.0

Replacing Values in a Series

In [38]:
import pandas as pd
import numpy as np

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
Out[38]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

If you want to replace -999 with NaN:

In [39]:
data.replace(-999, np.nan, inplace=True)
data
Out[39]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute value:

In [40]:
import pandas as pd
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
Out[40]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
In [41]:
data.replace([-999, -1000], np.nan, inplace=True)
data
Out[41]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

Dropping rows that have NaN:

In [42]:
data.dropna()
Out[42]:
0    1.0
2    2.0
5    3.0
dtype: float64

Filling/replacing NaN values with something else (replace NaN with 0/zero):

In [43]:
data.fillna(value=0)
Out[43]:
0    1.0
1    0.0
2    2.0
3    0.0
4    0.0
5    3.0
dtype: float64
In [44]:
import pandas as pd
from datetime import datetime
In [45]:
df = pd.read_clipboard()
df.head()
Out[45]:
Date GBP_to_USD
0 Wednesday 15 November 2017 1 GBP = 1.317 USD
1 Tuesday 14 November 2017 1 GBP = 1.3158 USD
2 Monday 13 November 2017 1 GBP = 1.312 USD
3 Sunday 12 November 2017 1 GBP = 1.32 USD
4 Saturday 11 November 2017 1 GBP = 1.32 USD

OBJECTIVE: You want to make the dates actual dates and you want to extract the USD amounts

Need to convert the string dates into actual Python dates and extract just the numeric values from the GBP_to_USD columns

In [46]:
df = pd.read_clipboard(converters={'Date': lambda x: datetime.strptime(x, "%A %d %B %Y"),
                                   'GBP_to_USD': lambda x: x.split(" ")[3]
                                  })
df
Out[46]:
Date GBP_to_USD
0 2017-11-15 1.317
1 2017-11-14 1.3158
2 2017-11-13 1.312
3 2017-11-12 1.32
4 2017-11-11 1.32

Renaming Index or Columns

NOTE: In pandas land, index is equivalent to your row labels or row numbers

rename() can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [51]:
import pandas as pd
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data
Out[51]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [52]:
data.rename(index={'Ohio': 'INDIANA'},columns={'three': 'peekaboo'},inplace=True)
data
Out[52]:
one two peekaboo four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

You can also apply str functions to modify the index or column labels

In [53]:
data.rename(index=str.title, columns=str.upper, inplace=True)  # str.title means to make the 1st letter capitalized only
data
Out[53]:
ONE TWO PEEKABOO FOUR
Indiana 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

Binning Data Into Intervals

In [54]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so, you have to use cut() function:

In [55]:
bins = [18, 25, 35, 60, 100]
In [56]:
cats = pd.cut(ages, bins)
cats
Out[56]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
In [59]:
pd.value_counts(cats)
Out[59]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

If you want the right value to be exclusive in the intervals:

In [60]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[60]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [61]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
pd.value_counts(pd.cut(ages, bins, labels=group_names))
Out[61]:
Youth         5
MiddleAged    3
YoungAdult    3
Senior        1
dtype: int64

Sub-selecting or Slicing a Data Frame

Filtering by label name: [loc]
Filtering by index row and/or column: [iloc]

In [62]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[62]:
A B C D
2013-01-01 -0.6096394 -1.4504565 -0.4148328 -0.4758915
2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733
2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354
2013-01-04 1.7385611 0.0070701 0.0479491 0.8737716
2013-01-05 0.1361647 0.8550785 -0.3441474 1.5101716
2013-01-06 1.3321668 0.5266501 -0.3535708 1.8317850

Getting first n rows of data frame using index slicing syntax

In [63]:
df[0:3]  # get first 3 rows of the data frame
Out[63]:
A B C D
2013-01-01 -0.6096394 -1.4504565 -0.4148328 -0.4758915
2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733
2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354

Slicing based on data frame's index range

In [64]:
df['20130102':'20130104'] # get rows by index range
Out[64]:
A B C D
2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733
2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354
2013-01-04 1.7385611 0.0070701 0.0479491 0.8737716

Slicing based on column labels/names using loc

In [65]:
df.loc[:,['A','B']]  # syntax is: df.loc[rows_index, cols_index]
Out[65]:
A B
2013-01-01 -0.6096394 -1.4504565
2013-01-02 1.6101875 -0.7751663
2013-01-03 -2.7636373 -0.1262512
2013-01-04 1.7385611 0.0070701
2013-01-05 0.1361647 0.8550785
2013-01-06 1.3321668 0.5266501

Slicing based on row index label and column label combined using loc

In [66]:
df.loc['20130102':'20130104',['A','B']]
Out[66]:
A B
2013-01-02 1.6101875 -0.7751663
2013-01-03 -2.7636373 -0.1262512
2013-01-04 1.7385611 0.0070701

Resampling Time Series Data

For complete list of resampling time periods - link

In [67]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=180)
df = pd.DataFrame(np.random.randn(180,1),index=dates,columns=list('A'))
df.head()
Out[67]:
A
2013-01-01 1.0478521
2013-01-02 -0.8181170
2013-01-03 -0.6467086
2013-01-04 0.6722718
2013-01-05 0.6782096

Partial date filtering using loc, get just January's data:

In [68]:
df.loc['2013-01']
Out[68]:
A
2013-01-01 1.0478521
2013-01-02 -0.8181170
2013-01-03 -0.6467086
2013-01-04 0.6722718
2013-01-05 0.6782096
2013-01-06 -0.8074374
2013-01-07 1.5643882
2013-01-08 0.3723588
2013-01-09 0.4197943
2013-01-10 1.5196560
2013-01-11 -0.4932106
2013-01-12 0.4736382
2013-01-13 0.2886090
2013-01-14 0.8174301
2013-01-15 0.0395030
2013-01-16 1.5106605
2013-01-17 -2.0664942
2013-01-18 0.8735501
2013-01-19 -0.8793671
2013-01-20 1.0978811
2013-01-21 -0.5022753
2013-01-22 -1.1285848
2013-01-23 -1.2105503
2013-01-24 -0.3503544
2013-01-25 0.0748389
2013-01-26 0.8998185
2013-01-27 -0.0481116
2013-01-28 -1.7208055
2013-01-29 -0.1856816
2013-01-30 -0.0957440
2013-01-31 0.0110788

UPDATE: You no longer need to use the .loc() function if your index is already of type DateTime:

In [69]:
df['2013-01']
Out[69]:
A
2013-01-01 1.0478521
2013-01-02 -0.8181170
2013-01-03 -0.6467086
2013-01-04 0.6722718
2013-01-05 0.6782096
2013-01-06 -0.8074374
2013-01-07 1.5643882
2013-01-08 0.3723588
2013-01-09 0.4197943
2013-01-10 1.5196560
2013-01-11 -0.4932106
2013-01-12 0.4736382
2013-01-13 0.2886090
2013-01-14 0.8174301
2013-01-15 0.0395030
2013-01-16 1.5106605
2013-01-17 -2.0664942
2013-01-18 0.8735501
2013-01-19 -0.8793671
2013-01-20 1.0978811
2013-01-21 -0.5022753
2013-01-22 -1.1285848
2013-01-23 -1.2105503
2013-01-24 -0.3503544
2013-01-25 0.0748389
2013-01-26 0.8998185
2013-01-27 -0.0481116
2013-01-28 -1.7208055
2013-01-29 -0.1856816
2013-01-30 -0.0957440
2013-01-31 0.0110788

Resample daily data to monthly data

In [70]:
monthly = df.resample('M').sum()
monthly.head(10)
Out[70]:
A
2013-01-31 1.4080967
2013-02-28 -3.1467609
2013-03-31 -4.7020674
2013-04-30 -10.9058725
2013-05-31 2.5918750
2013-06-30 11.2540062

But now, I prefer to see YYYY-MM format - for complete list of date formats - link

In [71]:
monthly.index = monthly.index.strftime('%Y-%m')
monthly.head(20)
Out[71]:
A
2013-01 1.4080967
2013-02 -3.1467609
2013-03 -4.7020674
2013-04 -10.9058725
2013-05 2.5918750
2013-06 11.2540062

Slicing based on index / numerical position of the row or column using iloc

In [77]:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[77]:
A B C D
2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940
2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759
2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699
2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924
In [78]:
df.iloc[3]  # returns 4th row (index=3) of the data frame
Out[78]:
A    0.1283517
B    0.1085160
C    0.5731905
D    0.3059759
Name: 2013-01-04 00:00:00, dtype: float64
In [79]:
df.iloc[3:5,0:2]  # returns specific range of rows and columns of the data frame
Out[79]:
A B
2013-01-04 0.1283517 0.1085160
2013-01-05 -0.8193987 -0.5878911
In [80]:
df.iloc[1:3,:]  # returning specific rows and returning all columns
Out[80]:
A B C D
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940
In [81]:
df.iloc[:,1:3]  # returning all rows and specific columns
Out[81]:
B C
2013-01-01 1.1939995 1.0831108
2013-01-02 -0.7338731 0.5363171
2013-01-03 -0.9869213 -0.7907918
2013-01-04 0.1085160 0.5731905
2013-01-05 -0.5878911 -0.7304131
2013-01-06 -1.6513622 -0.7548627
In [82]:
df.iloc[1,1]  # getting secific scalar/single value
Out[82]:
-0.7338730678633689

The boolean operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

By the way, there are 2 ways to access a dataframe column. Using bracket syntax (df["column_name"]) or dot notation syntax (df.column_name). It is recommended that you use bracket syntax since you could in theory create a column name that happens to be an attribute of the dataframe. Another reason to use bracket syntax is if your column name has a space in it. Then in this case, the dot notation syntax would not work.

In [83]:
df[df.A > -0.5]  # or df[df["A"] > -0.5], this syntax works when there is a space in the column name
Out[83]:
A B C D
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940
2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759
2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924

or I've seen some people do boolean filtering by passing "criteria" variable to the data frame like so:

In [84]:
criteria = df['A'] > -0.5
df[criteria]
Out[84]:
A B C D
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940
2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759
2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924
In [85]:
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2
Out[85]:
A B C D E
2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371 one
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336 one
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two
2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759 three
2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four
2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924 three
In [86]:
df2[df2['E'].isin(['two','four'])]  # read as "return rows where column E contains two or four"
Out[86]:
A B C D E
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two
2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four

We can also do partial string matching. So let's say you don't know the exact spelling a word is you want to match, you can do this:

In [87]:
df2[df2.E.str.contains("tw|ou")]
Out[87]:
A B C D E
2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two
2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four

Using ~ to perform a "NOT" filtering

In [88]:
df2[~df2['E'].isin(['two','four'])]  # column E containing values not in two or four
Out[88]:
A B C D E
2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371 one
2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336 one
2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759 three
2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924 three

Filtering using query() method. With version 0.25, we can use query() / eval() with column names with spaces using back ticks.

This is equivalent to using SQL's WHERE clause

In [89]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[89]:
A B C D
2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354
2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250
2013-01-03 -0.9531144 1.0518699 -0.0242550 -0.3607456
2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688
2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664
2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577
In [90]:
df.query("A > C")
Out[90]:
A B C D
2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354
2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688
2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577
In [91]:
df.query("A > 0")
Out[91]:
A B C D
2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250
2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688
2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664
In [92]:
df.query("A > 0 & A < 1")
Out[92]:
A B C D
2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250
2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664
In [93]:
df.query("A > B | A > C")  # where A is greater than B or A is greater than C
Out[93]:
A B C D
2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354
2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250
2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688
2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577

You've already seen an example usage of the .str. accessor above. There are 2 other accessors that help you perform date, and categorical functions.

In [94]:
import pandas as pd
In [95]:
pd.Series._accessors
Out[95]:
{'cat', 'dt', 'sparse', 'str'}
In [96]:
dir(pd.Series.dt)
Out[96]:
['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_add_delegate_accessors',
 '_constructor',
 '_delegate_method',
 '_delegate_property_get',
 '_delegate_property_set',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_freeze',
 '_get_values',
 '_reset_cache',
 'asfreq',
 'ceil',
 'components',
 'date',
 'day',
 'day_name',
 'dayofweek',
 'dayofyear',
 'days',
 'days_in_month',
 'daysinmonth',
 'end_time',
 'floor',
 'freq',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'microsecond',
 'microseconds',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'nanoseconds',
 'normalize',
 'quarter',
 'qyear',
 'round',
 'second',
 'seconds',
 'start_time',
 'strftime',
 'time',
 'timetz',
 'to_period',
 'to_pydatetime',
 'to_pytimedelta',
 'to_timestamp',
 'total_seconds',
 'tz',
 'tz_convert',
 'tz_localize',
 'week',
 'weekday',
 'weekofyear',
 'year']

From the dir() output above, we see that there are some useful date-related attributes that we can leverage

In [97]:
daterng = pd.Series(pd.date_range('2017', periods=9, freq='Q'))
In [98]:
daterng
Out[98]:
0   2017-03-31
1   2017-06-30
2   2017-09-30
3   2017-12-31
4   2018-03-31
5   2018-06-30
6   2018-09-30
7   2018-12-31
8   2019-03-31
dtype: datetime64[ns]
In [99]:
daterng.dt.day_name()
Out[99]:
0      Friday
1      Friday
2    Saturday
3      Sunday
4    Saturday
5    Saturday
6      Sunday
7      Monday
8      Sunday
dtype: object
In [100]:
daterng[daterng.dt.is_year_end]
Out[100]:
3   2017-12-31
7   2018-12-31
dtype: datetime64[ns]

Obtaining columns with partial column labels

In [101]:
import pandas as pd

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[101]:
key1 key2 data1 data2
0 a one -0.5009150 -1.1067497
1 a two -0.4532828 0.3262502
2 b one -0.2458276 0.2946161
3 b two -0.3467547 -0.6925296
4 a one 0.1528616 -0.0112956
In [102]:
df.filter(like='data')
Out[102]:
data1 data2
0 -0.5009150 -1.1067497
1 -0.4532828 0.3262502
2 -0.2458276 0.2946161
3 -0.3467547 -0.6925296
4 0.1528616 -0.0112956

Getting Value Counts

It is so tempting to use the groupby() function or pivot_table, but most of the time, value_counts() function is all we need.

In [103]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[103]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [104]:
data['group'].value_counts()
Out[104]:
a    3
b    3
c    3
Name: group, dtype: int64

There is also a handy normalize option:

In [105]:
data['group'].value_counts(normalize=True) 
Out[105]:
a    0.3333333
b    0.3333333
c    0.3333333
Name: group, dtype: float64

Getting Cumulative Sum

In [106]:
data['group'].value_counts().cumsum()
Out[106]:
a    3
b    6
c    9
Name: group, dtype: int64

Grouping Data ( see also value_count() and pivot_table() )

New in version 0.20 - agg() function. Now aggregating is less verbose!

Now for some simple aggregations:

In [107]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=10)
df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=list('ABCD'))
df
Out[107]:
A B C D
2013-01-01 0.9596146 1.9972782 0.6017156 1.4154062
2013-01-02 -1.4137538 -1.3745631 -0.3215356 0.7472984
2013-01-03 0.7632499 1.6742977 1.3021326 0.1888954
2013-01-04 0.8005111 -0.1462794 0.3436599 0.1310278
2013-01-05 0.9066343 0.5098032 -0.1090932 -0.6649942
2013-01-06 -0.5543259 -0.7512375 1.7489706 -1.6296998
2013-01-07 0.2970382 1.2624804 1.0115971 0.2765671
2013-01-08 -0.9845397 0.8508958 0.5400208 0.6634031
2013-01-09 -0.1188055 0.4146147 -0.8512962 0.1770277
2013-01-10 -1.3645758 0.0008273 -0.8191429 0.3820062

Sum for each column:

In [108]:
df.agg('sum')
Out[108]:
A   -0.7089525
B    4.4381173
C    3.4470286
D    1.6869378
dtype: float64

Sum and min of each column:

In [109]:
df.agg(['sum', 'min'])
Out[109]:
A B C D
sum -0.7089525 4.4381173 3.4470286 1.6869378
min -1.4137538 -1.3745631 -0.8512962 -1.6296998

Sum of just one or more columns:

In [110]:
df['A'].agg('sum')
Out[110]:
-0.7089524679704087
In [111]:
df[['A','C']].agg('sum')
Out[111]:
A   -0.7089525
C    3.4470286
dtype: float64

Now for some groupby() examples:

In [112]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[112]:
key1 key2 data1 data2
0 a one 0.1564853 -0.6441757
1 a two 1.3446095 0.2890500
2 b one 0.4096645 -1.2109906
3 b two -0.1875962 1.3129774
4 a one -0.7794965 0.4034388

Using a combination of groupby() and agg():

In a lot of cases, instead of pivot_table(), we can use the groupby() \ agg() combination.

General preferred syntax: df.groupby('grouping_column').agg({'aggregating_column': 'aggregating_functions'})

In [114]:
df
Out[114]:
key1 key2 data1 data2
0 a one 0.1564853 -0.6441757
1 a two 1.3446095 0.2890500
2 b one 0.4096645 -1.2109906
3 b two -0.1875962 1.3129774
4 a one -0.7794965 0.4034388

NOTE: You can pass a list of aggregating functions:

In [113]:
df.groupby('key1').agg({'data1': ['sum','max']})
Out[113]:
data1
sum max
key1
a 0.7215983 1.3446095
b 0.2220683 0.4096645

Pivot table example

See also this excellent article on pivot tables by Chris Moffitt

In [115]:
import pandas as pd
import numpy as np

data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[115]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0

Calculating the means of each group

In [116]:
data.pivot_table(values='ounces',index='group',aggfunc=np.mean)
Out[116]:
ounces
group
a 6.3333333
b 7.1666667
c 4.6666667

Getting counts by group

In [117]:
data.pivot_table(values='ounces',index='group',aggfunc='count')
Out[117]:
ounces
group
a 3
b 3
c 3

Getting cumulative sum or running total of the group counts

In [118]:
data.pivot_table(values='ounces',index='group',aggfunc='count').cumsum()
Out[118]:
ounces
group
a 3
b 6
c 9

You can pass a dictionary to the aggfunc= parameter to specify specific columns to have specific aggregate function applied to them

In [119]:
data.pivot_table(values='ounces', index='group', aggfunc={'group': len, 'ounces': np.sum})
Out[119]:
group ounces
group
a 3 19.0
b 3 21.5
c 3 14.0

With named aggregations, we can create new columns based on aggregations on other columns.

In [120]:
import pandas as pd

animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                        'height': [9.1, 6.0, 9.5, 34.0],
                        'weight': [7.9, 7.5, 9.9, 198.0]})
animals
Out[120]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [122]:
animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column='height', aggfunc='min'),
    max_height=pd.NamedAgg(column='height', aggfunc='max'),
    average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean)
)
Out[122]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75

There is also a less verbose syntax using tuples instead of NamedAgg() function:

In [123]:
animals.groupby("kind").agg(
    min_height=('height', 'min'),
    max_height=('height', 'max'),
    average_weight=('weight', np.mean)
)
Out[123]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
In [124]:
import pandas as pd
data = pd.DataFrame({'group': [80, 70, 75, 75],
                     'ounces': [20, 30, 25, 25],
                     'size': [100, 100, 100, 100]}
                   )
data
Out[124]:
group ounces size
0 80 20 100
1 70 30 100
2 75 25 100
3 75 25 100
In [125]:
perc_of_rows = data.apply(lambda x : x / x.sum() * 100, axis='columns') # or axis=1
perc_of_rows
Out[125]:
group ounces size
0 40.0 10.0 50.0
1 35.0 15.0 50.0
2 37.5 12.5 50.0
3 37.5 12.5 50.0
In [126]:
perc_of_columns = data.apply(lambda x : x / x.sum() * 100, axis='index') # or axis=0
perc_of_columns
Out[126]:
group ounces size
0 26.6666667 20.0 25.0
1 23.3333333 30.0 25.0
2 25.0000000 25.0 25.0
3 25.0000000 25.0 25.0

or percent of a specific column:

In [127]:
data['Perc_of_Group_Column'] = data['group'] / data['group'].sum() * 100
data
Out[127]:
group ounces size Perc_of_Group_Column
0 80 20 100 26.6666667
1 70 30 100 23.3333333
2 75 25 100 25.0000000
3 75 25 100 25.0000000

Let's say you have a data frame with several columns and don't want to horizontally scroll to see the columns, but want to vertically scroll instead. Then that case, transpose the data frame.

In [128]:
import pandas as pd

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[128]:
key1 key2 data1 data2
0 a one 1.2570738 1.2159738
1 a two 0.6660120 -1.4213658
2 b one 0.6255760 -0.0309261
3 b two 0.1350255 1.4271483
4 a one 1.4709446 -0.3034248
In [129]:
df.head(1).transpose()
Out[129]:
0
key1 a
key2 one
data1 1.257074
data2 1.215974
In [131]:
import pandas as pd
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[131]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [132]:
data.reset_index(level=0, inplace=True)
data
Out[132]:
index group ounces
0 0 a 4.0
1 1 a 3.0
2 2 a 12.0
3 3 b 6.0
4 4 b 7.5
5 5 b 8.0
6 6 c 3.0
7 7 c 5.0
8 8 c 6.0
In [133]:
import pandas as pd
col = ['a','b','c']
data = pd.DataFrame([[1,2,3],[10,11,12],[20,21,22]],columns=col)
data
Out[133]:
a b c
0 1 2 3
1 10 11 12
2 20 21 22
In [134]:
data = data.set_index('a')
data
Out[134]:
b c
a
1 2 3
10 11 12
20 21 22

To remove the index name ("a"), do:

In [135]:
data.index.name = None
data
Out[135]:
b c
1 2 3
10 11 12
20 21 22
In [136]:
import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s
Out[136]:
09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1
dtype: int64

From above, we have holes in our data. Let's fill in those missing holes using the idx date_range we specified above using reindex function.

In [137]:
# First need to convert the dataframe index type as type DateTime
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
s
Out[137]:
2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
2013-09-09     0
2013-09-10     0
2013-09-11     0
2013-09-12     0
2013-09-13     0
2013-09-14     0
2013-09-15     0
2013-09-16     0
2013-09-17     0
2013-09-18     0
2013-09-19     0
2013-09-20     0
2013-09-21     0
2013-09-22     0
2013-09-23     0
2013-09-24     0
2013-09-25     0
2013-09-26     0
2013-09-27     0
2013-09-28     0
2013-09-29     0
2013-09-30     0
Freq: D, dtype: int64
In [ ]:
import pyodbc
import pandas as pd
from getpass import getpass    # Module that will create a text input widget AND mask your password

userid = 'your_userid'
pw = getpass(prompt='Enter your password: ')

cnxn_string = 'DSN=your_dsn;UID=' + userid + ';PWD=' + pw

sql = """
SELECT * from your_table...
"""

with pyodbc.connect(cnxn_string) as conn:
    df = pd.read_sql(sql, conn, index_col=None, parse_dates={'some_column':"%Y-%m-%d"})

Certain charting libaries like expect the data to be in long format. Let's get some data in wide format.

In [ ]:
import matplotlib.pyplot as plt
from pandas_datareader import data as pdr
from datetime import datetime
import pandas as pd

stocks = {'tech':['GOOGL', 'MSFT', 'LNKD', 'YHOO', 'FB','HPQ','AMZN'],
          'auto':['TM','F','GM','HMC','NSANY','HYMTF'],
          'housing':['HD','WMT','LOW']
          }

start_date = datetime(2016,1,1)
end_date = datetime(2016, 4, 21)

def get_px(stock, start, end):
    return web.get_data_yahoo(stock, start, end)['Adj Close']

df = pd.DataFrame({n: get_px(n, start_date, end_date) for n in stocks['tech']})
In [144]:
df = pd.read_clipboard()
df
Out[144]:
Date AMZN FB GOOGL HPQ LNKD MSFT YHOO
0 1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
1 1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
2 1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
3 1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
4 1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
5 1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 51.927167 30.170000
6 1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 52.403745 30.690001
7 1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 51.271873 29.440001
8 1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 52.731395 30.320000
9 1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 50.626508 29.139999

Below is what data looks like in 'wide' format:

In [146]:
df = pd.read_clipboard()
df
Out[146]:
Date AMZN FB GOOGL HPQ LNKD MSFT YHOO
0 1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
1 1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
2 1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
3 1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
4 1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
5 1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 51.927167 30.170000
6 1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 52.403745 30.690001
7 1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 51.271873 29.440001
8 1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 52.731395 30.320000
9 1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 50.626508 29.139999

This is what data looks like in 'long' format:

In [147]:
df_long = pd.melt(df, id_vars=['Date']).dropna()
df_long
Out[147]:
Date variable value
0 1/4/2016 AMZN 636.989990
1 1/5/2016 AMZN 633.789978
2 1/6/2016 AMZN 632.650024
3 1/7/2016 AMZN 607.940002
4 1/8/2016 AMZN 607.049988
5 1/11/2016 AMZN 617.739990
6 1/12/2016 AMZN 617.890015
7 1/13/2016 AMZN 581.809998
8 1/14/2016 AMZN 593.000000
9 1/15/2016 AMZN 570.179993
10 1/4/2016 FB 102.220001
11 1/5/2016 FB 102.730003
12 1/6/2016 FB 102.970001
13 1/7/2016 FB 97.919998
14 1/8/2016 FB 97.330002
15 1/11/2016 FB 97.510002
16 1/12/2016 FB 99.370003
17 1/13/2016 FB 95.440002
18 1/14/2016 FB 98.370003
19 1/15/2016 FB 94.970001
20 1/4/2016 GOOGL 759.440002
21 1/5/2016 GOOGL 761.530029
22 1/6/2016 GOOGL 759.330017
23 1/7/2016 GOOGL 741.000000
24 1/8/2016 GOOGL 730.909973
25 1/11/2016 GOOGL 733.070007
26 1/12/2016 GOOGL 745.340027
27 1/13/2016 GOOGL 719.570007
28 1/14/2016 GOOGL 731.390015
29 1/15/2016 GOOGL 710.489990
30 1/4/2016 HPQ 11.471342
31 1/5/2016 HPQ 11.580122
32 1/6/2016 HPQ 11.164780
33 1/7/2016 HPQ 10.650548
34 1/8/2016 HPQ 10.462655
35 1/11/2016 HPQ 10.531878
36 1/12/2016 HPQ 10.699993
37 1/13/2016 HPQ 10.462655
38 1/14/2016 HPQ 10.502211
39 1/15/2016 HPQ 9.997867
40 1/4/2016 LNKD 225.550003
41 1/5/2016 LNKD 229.000000
42 1/6/2016 LNKD 226.080002
43 1/7/2016 LNKD 217.000000
44 1/8/2016 LNKD 215.899994
45 1/11/2016 LNKD 210.929993
46 1/12/2016 LNKD 210.429993
47 1/13/2016 LNKD 203.130005
48 1/14/2016 LNKD 203.210007
49 1/15/2016 LNKD 196.320007
50 1/4/2016 MSFT 54.409346
51 1/5/2016 MSFT 54.657563
52 1/6/2016 MSFT 53.664692
53 1/7/2016 MSFT 51.798093
54 1/8/2016 MSFT 51.956956
55 1/11/2016 MSFT 51.927167
56 1/12/2016 MSFT 52.403745
57 1/13/2016 MSFT 51.271873
58 1/14/2016 MSFT 52.731395
59 1/15/2016 MSFT 50.626508
60 1/4/2016 YHOO 31.400000
61 1/5/2016 YHOO 32.200001
62 1/6/2016 YHOO 32.160000
63 1/7/2016 YHOO 30.160000
64 1/8/2016 YHOO 30.629999
65 1/11/2016 YHOO 30.170000
66 1/12/2016 YHOO 30.690001
67 1/13/2016 YHOO 29.440001
68 1/14/2016 YHOO 30.320000
69 1/15/2016 YHOO 29.139999
In [148]:
df_long.rename(columns={'variable': 'Stock', 'value':'Price'},inplace=True)
df_long
Out[148]:
Date Stock Price
0 1/4/2016 AMZN 636.989990
1 1/5/2016 AMZN 633.789978
2 1/6/2016 AMZN 632.650024
3 1/7/2016 AMZN 607.940002
4 1/8/2016 AMZN 607.049988
5 1/11/2016 AMZN 617.739990
6 1/12/2016 AMZN 617.890015
7 1/13/2016 AMZN 581.809998
8 1/14/2016 AMZN 593.000000
9 1/15/2016 AMZN 570.179993
10 1/4/2016 FB 102.220001
11 1/5/2016 FB 102.730003
12 1/6/2016 FB 102.970001
13 1/7/2016 FB 97.919998
14 1/8/2016 FB 97.330002
15 1/11/2016 FB 97.510002
16 1/12/2016 FB 99.370003
17 1/13/2016 FB 95.440002
18 1/14/2016 FB 98.370003
19 1/15/2016 FB 94.970001
20 1/4/2016 GOOGL 759.440002
21 1/5/2016 GOOGL 761.530029
22 1/6/2016 GOOGL 759.330017
23 1/7/2016 GOOGL 741.000000
24 1/8/2016 GOOGL 730.909973
25 1/11/2016 GOOGL 733.070007
26 1/12/2016 GOOGL 745.340027
27 1/13/2016 GOOGL 719.570007
28 1/14/2016 GOOGL 731.390015
29 1/15/2016 GOOGL 710.489990
30 1/4/2016 HPQ 11.471342
31 1/5/2016 HPQ 11.580122
32 1/6/2016 HPQ 11.164780
33 1/7/2016 HPQ 10.650548
34 1/8/2016 HPQ 10.462655
35 1/11/2016 HPQ 10.531878
36 1/12/2016 HPQ 10.699993
37 1/13/2016 HPQ 10.462655
38 1/14/2016 HPQ 10.502211
39 1/15/2016 HPQ 9.997867
40 1/4/2016 LNKD 225.550003
41 1/5/2016 LNKD 229.000000
42 1/6/2016 LNKD 226.080002
43 1/7/2016 LNKD 217.000000
44 1/8/2016 LNKD 215.899994
45 1/11/2016 LNKD 210.929993
46 1/12/2016 LNKD 210.429993
47 1/13/2016 LNKD 203.130005
48 1/14/2016 LNKD 203.210007
49 1/15/2016 LNKD 196.320007
50 1/4/2016 MSFT 54.409346
51 1/5/2016 MSFT 54.657563
52 1/6/2016 MSFT 53.664692
53 1/7/2016 MSFT 51.798093
54 1/8/2016 MSFT 51.956956
55 1/11/2016 MSFT 51.927167
56 1/12/2016 MSFT 52.403745
57 1/13/2016 MSFT 51.271873
58 1/14/2016 MSFT 52.731395
59 1/15/2016 MSFT 50.626508
60 1/4/2016 YHOO 31.400000
61 1/5/2016 YHOO 32.200001
62 1/6/2016 YHOO 32.160000
63 1/7/2016 YHOO 30.160000
64 1/8/2016 YHOO 30.629999
65 1/11/2016 YHOO 30.170000
66 1/12/2016 YHOO 30.690001
67 1/13/2016 YHOO 29.440001
68 1/14/2016 YHOO 30.320000
69 1/15/2016 YHOO 29.139999
In [149]:
pivoted = df_long.pivot(index='Date', columns='Stock', values='Price')
pivoted
Out[149]:
Stock AMZN FB GOOGL HPQ LNKD MSFT YHOO
Date
1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 51.927167 30.170000
1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 52.403745 30.690001
1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 51.271873 29.440001
1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 52.731395 30.320000
1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 50.626508 29.139999
1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 54.409346 31.400000
1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 54.657563 32.200001
1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 53.664692 32.160000
1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 51.798093 30.160000
1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 51.956956 30.629999
In [150]:
import pandas as pd
data = pd.DataFrame({'medal': ['bronze', 'silver', 'silver', 'gold','bronze', 'bronze', 'gold', 'gold','gold'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[150]:
medal ounces
0 bronze 4.0
1 silver 3.0
2 silver 12.0
3 gold 6.0
4 bronze 7.5
5 bronze 8.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
In [151]:
data.dtypes
Out[151]:
medal      object
ounces    float64
dtype: object
In [152]:
data["medal"] = data["medal"].astype("category")
data.dtypes
Out[152]:
medal     category
ounces     float64
dtype: object
In [153]:
data.sort_values(by=["medal"])
Out[153]:
medal ounces
0 bronze 4.0
4 bronze 7.5
5 bronze 8.0
3 gold 6.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
1 silver 3.0
2 silver 12.0

Now define the order of the catgorical data using set_categories()

In [154]:
data["medal"].cat.set_categories([ "gold","silver","bronze"], inplace=True)
In [155]:
data.sort_values(by='medal')
Out[155]:
medal ounces
3 gold 6.0
6 gold 3.0
7 gold 5.0
8 gold 6.0
1 silver 3.0
2 silver 12.0
0 bronze 4.0
4 bronze 7.5
5 bronze 8.0

Creating a column of type categorical that is also ordered

In [156]:
df = pd.DataFrame({'ID': [100, 101, 102, 103],
                   'quality': ['good', 'very good', 'good', 'excellent']})
df
Out[156]:
ID quality
0 100 good
1 101 very good
2 102 good
3 103 excellent
In [159]:
cat_dtype = pd.api.types.CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True)
In [161]:
df['quality'] = df.quality.astype(cat_dtype)
df.sort_values(by='quality')
Out[161]:
ID quality
0 100 good
2 102 good
1 101 very good
3 103 excellent
In [162]:
import pandas as pd
left = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
left
Out[162]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [163]:
right = pd.DataFrame({'label': ['a','b','c'],
                     'value': ['alpha','beta','charlie']})
right
Out[163]:
label value
0 a alpha
1 b beta
2 c charlie

By default, merge() does an inner join, but you can specify 'left', 'right', or 'outer' joins.

These types of joins are familiar if you have SQL background. Here's a good SQL joins visualizer.

In [164]:
inner_joined = pd.merge(left, right, how='inner', left_on='group', right_on='label')
inner_joined
Out[164]:
group ounces label value
0 a 4.0 a alpha
1 a 3.0 a alpha
2 a 12.0 a alpha
3 b 6.0 b beta
4 b 7.5 b beta
5 b 8.0 b beta
6 c 3.0 c charlie
7 c 5.0 c charlie
8 c 6.0 c charlie

NOTE: To merge 2 Series, you have to use pd.concat() instead

For more on merge, join, and concatenate, see the official docs

YouTube video on joining data sets using pandas

In [165]:
import numpy as np
import pandas as pd
df = pd.DataFrame([range(3), [0, np.NaN, 1], [1, 1, np.NaN], range(3), range(3)])
df.columns = ['A','B','C']
df
Out[165]:
A B C
0 0 1.0 2.0
1 0 NaN 1.0
2 1 1.0 NaN
3 0 1.0 2.0
4 0 1.0 2.0
In [166]:
df.isna().any(axis=1)
Out[166]:
0    False
1     True
2     True
3    False
4    False
dtype: bool
In [167]:
df[df.isna().any(axis=1)]
Out[167]:
A B C
1 0 NaN 1.0
2 1 1.0 NaN

To filter a column for na/null values and also filter another column:

In [168]:
df[(df['C']==1) & (df['B'].isna())]
Out[168]:
A B C
1 0 NaN 1.0
In [169]:
import pandas as pd

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[169]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
In [170]:
df.dtypes
Out[170]:
one      object
two      object
three    object
dtype: object
In [171]:
df[['two', 'three']] = df[['two', 'three']].astype(float)
In [172]:
df.dtypes
Out[172]:
one       object
two      float64
three    float64
dtype: object

Starting with version 1.5, you can now pass a "data" parameter inside MATPLOTLIB's plot() function

Prior to version 1.5:

In [180]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"var1":[1,2,3,4,5,6], "var2":[1,2,3,4,5,6]})
plt.plot(df["var1"], df["var2"])
plt.show()

With version 1.5+

In [174]:
plt.plot("var1", "var2", data=df)
plt.show()

With method chaining, it eliminates the need for making intermediary variables. You can process your data frame in a series of method calls when you enclose your data frame with parenthesis. Here's a very contrived example where you are creating these intermediary variables between transformation steps:

In [ ]:
df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

Instead of the above, we can chain the methods as long as you surround the code with parenthesis:

In [ ]:
(pd.read_csv('data.csv')
   .fillna(...)
   .query('some_condition')
   .assign(new_column = df.cut(...))
   .pivot_table(...)
   .rename(...)
)

It is recommended that you don't use method chaining if it will significantly impact readability.

Also check out pipe() method as it can also be used to better organize data transformations in discrete functions.

A short series of video tutorials on using pipe()

In [175]:
import pandas as pd
df = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [176]:
df
Out[176]:
group ounces
0 a 4.0
1 a 3.0
2 a 12.0
3 b 6.0
4 b 7.5
5 b 8.0
6 c 3.0
7 c 5.0
8 c 6.0
In [177]:
def mean_age_by_group(df, col):
    return df.groupby(col).mean()
In [178]:
def uppercase_column_names(df):
    df.columns = df.columns.str.upper()
    return df
In [179]:
(df.pipe(mean_age_by_group, col='group')
     .pipe(uppercase_column_names)
)
Out[179]:
OUNCES
group
a 6.3333333
b 7.1666667
c 4.6666667