5 new changes in pandas you need to know about (video)

In [1]:
import pandas as pd
pd.__version__
Out[1]:
'0.22.0'

1. ix has been deprecated

New in 0.20.0

In [2]:
# read the drinks dataset into a DataFrame
drinks = pd.read_csv('http://bit.ly/drinksbycountry', index_col='country')
drinks.head()
Out[2]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [3]:
# loc accesses by label
drinks.loc['Angola', 'spirit_servings']
Out[3]:
57
In [4]:
# iloc accesses by position
drinks.iloc[4, 1]
Out[4]:
57
In [5]:
# ix accesses by label OR position (newly deprecated)
drinks.ix['Angola', 1]
/Users/kevin/miniconda3/envs/pd22.0/lib/python3.5/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
Out[5]:
57
In [6]:
# alternative: use loc
drinks.loc['Angola', drinks.columns[1]]
Out[6]:
57
In [7]:
# alternative: use iloc
drinks.iloc[drinks.index.get_loc('Angola'), 1]
Out[7]:
57
In [8]:
# ix accesses by label OR position (newly deprecated)
drinks.ix[4, 'spirit_servings']
Out[8]:
57
In [9]:
# alternative: use loc
drinks.loc[drinks.index[4], 'spirit_servings']
Out[9]:
57
In [10]:
# alternative: use iloc
drinks.iloc[4, drinks.columns.get_loc('spirit_servings')]
Out[10]:
57

2. Aliases have been added for isnull and notnull

New in 0.21.0

In [11]:
# read the UFO dataset into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()
Out[11]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [12]:
# check which values are missing
ufo.isnull().head()
Out[12]:
City Colors Reported Shape Reported State Time
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False
In [13]:
# check which values are not missing
ufo.notnull().head()
Out[13]:
City Colors Reported Shape Reported State Time
0 True False True True True
1 True False True True True
2 True False True True True
3 True False True True True
4 True False True True True
In [14]:
# drop rows with missing values
ufo.dropna().head()
Out[14]:
City Colors Reported Shape Reported State Time
12 Belton RED SPHERE SC 6/30/1939 20:00
19 Bering Sea RED OTHER AK 4/30/1943 23:00
36 Portsmouth RED FORMATION VA 7/10/1945 1:30
44 Blairsden GREEN SPHERE CA 6/30/1946 19:00
82 San Jose BLUE CHEVRON CA 7/15/1947 21:00
In [15]:
# fill in missing values
ufo.fillna(value='UNKNOWN').head()
Out[15]:
City Colors Reported Shape Reported State Time
0 Ithaca UNKNOWN TRIANGLE NY 6/1/1930 22:00
1 Willingboro UNKNOWN OTHER NJ 6/30/1930 20:00
2 Holyoke UNKNOWN OVAL CO 2/15/1931 14:00
3 Abilene UNKNOWN DISK KS 6/1/1931 13:00
4 New York Worlds Fair UNKNOWN LIGHT NY 4/18/1933 19:00
In [16]:
# new alias for isnull
ufo.isna().head()
Out[16]:
City Colors Reported Shape Reported State Time
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False
In [17]:
# new alias for notnull
ufo.notna().head()
Out[17]:
City Colors Reported Shape Reported State Time
0 True False True True True
1 True False True True True
2 True False True True True
3 True False True True True
4 True False True True True

3. drop now accepts "index" and "columns" keywords

New in 0.21.0

In [18]:
# read the UFO dataset into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()
Out[18]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [19]:
# old way to drop rows: specify labels and axis
ufo.drop([0, 1], axis=0).head()
ufo.drop([0, 1], axis='index').head()
Out[19]:
City Colors Reported Shape Reported State Time
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
5 Valley City NaN DISK ND 9/15/1934 15:30
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
In [20]:
# new way to drop rows: specify index
ufo.drop(index=[0, 1]).head()
Out[20]:
City Colors Reported Shape Reported State Time
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
5 Valley City NaN DISK ND 9/15/1934 15:30
6 Crater Lake NaN CIRCLE CA 6/15/1935 0:00
In [21]:
# old way to drop columns: specify labels and axis
ufo.drop(['City', 'State'], axis=1).head()
ufo.drop(['City', 'State'], axis='columns').head()
Out[21]:
Colors Reported Shape Reported Time
0 NaN TRIANGLE 6/1/1930 22:00
1 NaN OTHER 6/30/1930 20:00
2 NaN OVAL 2/15/1931 14:00
3 NaN DISK 6/1/1931 13:00
4 NaN LIGHT 4/18/1933 19:00
In [22]:
# new way to drop columns: specify columns
ufo.drop(columns=['City', 'State']).head()
Out[22]:
Colors Reported Shape Reported Time
0 NaN TRIANGLE 6/1/1930 22:00
1 NaN OTHER 6/30/1930 20:00
2 NaN OVAL 2/15/1931 14:00
3 NaN DISK 6/1/1931 13:00
4 NaN LIGHT 4/18/1933 19:00

4. rename and reindex now accept "axis" keyword

New in 0.21.0

In [23]:
# old way to rename columns: specify columns
ufo.rename(columns={'City':'CITY', 'State':'STATE'}).head()
Out[23]:
CITY Colors Reported Shape Reported STATE Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [24]:
# new way to rename columns: specify mapper and axis
ufo.rename({'City':'CITY', 'State':'STATE'}, axis='columns').head()
Out[24]:
CITY Colors Reported Shape Reported STATE Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [25]:
# note: mapper can be a function
ufo.rename(str.upper, axis='columns').head()
Out[25]:
CITY COLORS REPORTED SHAPE REPORTED STATE TIME
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

5. Ordered categories must be specified independent of the data

New in 0.21.0

In [26]:
# create a small DataFrame
df = pd.DataFrame({'ID':[100, 101, 102, 103],
                   'quality':['good', 'very good', 'good', 'excellent']})
df
Out[26]:
ID quality
0 100 good
1 101 very good
2 102 good
3 103 excellent
In [27]:
# old way to create an ordered category (deprecated)
df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
/Users/kevin/miniconda3/envs/pd22.0/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  
Out[27]:
0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]
In [28]:
# new way to create an ordered category
from pandas.api.types import CategoricalDtype
quality_cat = CategoricalDtype(['good', 'very good', 'excellent'], ordered=True)
df['quality'] = df.quality.astype(quality_cat)
df.quality
Out[28]:
0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]