Pandas Fundamentals

This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.

For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 10)

baseball = pd.read_csv("../data/baseball.csv", index_col='id')

Notice that we specified the id column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining player and year:

In [2]:
player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind
Out[2]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
schilcu012006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
myersmi012006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
helliri012006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
johnsra052006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

100 rows × 22 columns

This looks okay, but let's check:

In [3]:
baseball_newind.index.is_unique
Out[3]:
False

So, indices need not be unique. Our choice is not unique because some players change teams within years.

In [4]:
pd.Series(baseball_newind.index).value_counts()
Out[4]:
wellsda012007    2
gomezch022007    2
wickmbo012007    2
sweenma012007    2
loftoke012007    2
                ..
thomafr042007    1
mesajo012007     1
thomeji012007    1
mabryjo012007    1
johnsra052006    1
Length: 88, dtype: int64

The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [5]:
baseball_newind.loc['wickmbo012007']
Out[5]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
wickmbo012007 wickmbo01 2007 2 ARI NL 8 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 22 columns

We will learn more about indexing below.

We can create a truly unique index by combining player, team and year:

In [6]:
player_unique = baseball.player + baseball.team + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_unique
baseball_newind.head()
Out[6]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
womacto01CHN2006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
schilcu01BOS2006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
myersmi01NYA2006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
helliri01MIL2006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
johnsra05NYA2006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

In [7]:
baseball_newind.index.is_unique
Out[7]:
True

We can create meaningful indices more easily using a hierarchical index; for now, we will stick with the numeric id field as our index.

Manipulating indices

Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

A simple use of reindex is to alter the order of the rows:

In [8]:
baseball.reindex(baseball.index[::-1]).head()
Out[8]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

Notice that the id index is not sequential. Say we wanted to populate the table with every id value. We could specify and index that is a sequence from the first to the last id numbers in the database, and Pandas would fill in the missing data with NaN values:

In [9]:
id_range = range(baseball.index.values.min(), baseball.index.values.max())
baseball.reindex(id_range).head()
Out[9]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006.0 2.0 CHN NL 19.0 50.0 6.0 14.0 1.0 ... 2.0 1.0 1.0 4.0 4.0 0.0 0.0 3.0 0.0 0.0
88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88643 schilcu01 2006.0 1.0 BOS AL 31.0 2.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88645 myersmi01 2006.0 1.0 NYA AL 62.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

Missing values can be filled as desired, either with selected values, or by rule:

In [10]:
baseball.reindex(id_range, method='ffill').head()
Out[10]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88642 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88644 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

In [11]:
baseball.reindex(id_range, fill_value='charliebrown', columns=['player']).head()
Out[11]:
player
id
88641 womacto01
88642 charliebrown
88643 schilcu01
88644 charliebrown
88645 myersmi01

Keep in mind that reindex does not work if we pass a non-unique index series.

We can remove rows or columns via the drop method:

In [12]:
baseball.shape
Out[12]:
(100, 22)
In [13]:
baseball.drop([89525, 89526])
Out[13]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 ... 66.0 5.0 0.0 132 54.0 43.0 3.0 0.0 2.0 13.0
89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

98 rows × 22 columns

In [14]:
baseball.drop(['ibb','hbp'], axis=1)
Out[14]:
player year stint team lg g ab r h X2b X3b hr rbi sb cs bb so sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.0 4 4.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 0 0 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 0 0 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3 3 25.0 6.0 1.0 37 74.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.0 27 30.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0

100 rows × 20 columns

Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the Index object to extract values in addition to arrays of integers.

In [15]:
# Sample Series object
hits = baseball_newind.h
hits
Out[15]:
womacto01CHN2006     14
schilcu01BOS2006      1
myersmi01NYA2006      0
helliri01MIL2006      0
johnsra05NYA2006      1
                   ... 
benitar01FLO2007      0
benitar01SFN2007      0
ausmubr01HOU2007     82
aloumo01NYN2007     112
alomasa02NYN2007      3
Name: h, Length: 100, dtype: int64
In [16]:
# Numpy-style indexing
hits[:3]
Out[16]:
womacto01CHN2006    14
schilcu01BOS2006     1
myersmi01NYA2006     0
Name: h, dtype: int64
In [17]:
# Indexing by label
hits[['womacto01CHN2006','schilcu01BOS2006']]
Out[17]:
womacto01CHN2006    14
schilcu01BOS2006     1
Name: h, dtype: int64

We can also slice with data labels, since they have an intrinsic order within the Index:

In [18]:
hits['womacto01CHN2006':'gonzalu01ARI2006']
Out[18]:
womacto01CHN2006     14
schilcu01BOS2006      1
myersmi01NYA2006      0
helliri01MIL2006      0
johnsra05NYA2006      1
finlest01SFN2006    105
gonzalu01ARI2006    159
Name: h, dtype: int64
In [19]:
hits['womacto01CHN2006':'gonzalu01ARI2006'] = 5
hits
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[19]:
womacto01CHN2006      5
schilcu01BOS2006      5
myersmi01NYA2006      5
helliri01MIL2006      5
johnsra05NYA2006      5
                   ... 
benitar01FLO2007      0
benitar01SFN2007      0
ausmubr01HOU2007     82
aloumo01NYN2007     112
alomasa02NYN2007      3
Name: h, Length: 100, dtype: int64

In a DataFrame we can slice along either or both axes:

In [20]:
baseball_newind[['h','ab']]
Out[20]:
h ab
womacto01CHN2006 5 50
schilcu01BOS2006 5 2
myersmi01NYA2006 5 0
helliri01MIL2006 5 3
johnsra05NYA2006 5 6
... ... ...
benitar01FLO2007 0 0
benitar01SFN2007 0 0
ausmubr01HOU2007 82 349
aloumo01NYN2007 112 328
alomasa02NYN2007 3 22

100 rows × 2 columns

In [21]:
baseball_newind[baseball_newind.ab>500]
Out[21]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 14.0
vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 14.0
thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 14.0
rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 16.0
griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 14.0
delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 12.0
biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0

7 rows × 22 columns

For a more concise (and readable) syntax, we can use the new query method to perform selection on a DataFrame. Instead of having to type the fully-specified column, we can simply pass a string that describes what to select. The query above is then simply:

In [22]:
baseball_newind.query('ab > 500')
Out[22]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 14.0
vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 14.0
thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 14.0
rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 16.0
griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 14.0
delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 12.0
biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0

7 rows × 22 columns

The DataFrame.index and DataFrame.columns are placed in the query namespace by default. If you want to refer to a variable in the current namespace, you can prefix the variable with @:

In [23]:
min_ab = 450
In [24]:
baseball_newind.query('ab > @min_ab')
Out[24]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
gonzalu01ARI2006 gonzalu01 2006 1 ARI NL 153 586 93 5 52 ... 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 14.0
vizquom01SFN2007 vizquom01 2007 1 SFN NL 145 513 54 126 18 ... 51.0 14.0 6.0 44 48.0 6.0 1.0 14.0 3.0 14.0
thomafr04TOR2007 thomafr04 2007 1 TOR AL 155 531 63 147 30 ... 95.0 0.0 0.0 81 94.0 3.0 7.0 0.0 5.0 14.0
sheffga01DET2007 sheffga01 2007 1 DET AL 133 494 107 131 20 ... 75.0 22.0 5.0 84 71.0 2.0 9.0 0.0 6.0 10.0
rodriiv01DET2007 rodriiv01 2007 1 DET AL 129 502 50 141 31 ... 63.0 2.0 2.0 9 96.0 1.0 1.0 1.0 2.0 16.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
kentje01LAN2007 kentje01 2007 1 LAN NL 136 494 78 149 36 ... 79.0 1.0 3.0 57 61.0 4.0 5.0 0.0 6.0 17.0
griffke02CIN2007 griffke02 2007 1 CIN NL 144 528 78 146 24 ... 93.0 6.0 1.0 85 99.0 14.0 1.0 0.0 9.0 14.0
gonzalu01LAN2007 gonzalu01 2007 1 LAN NL 139 464 70 129 23 ... 68.0 6.0 2.0 56 56.0 4.0 4.0 0.0 2.0 11.0
delgaca01NYN2007 delgaca01 2007 1 NYN NL 139 538 71 139 30 ... 87.0 4.0 0.0 52 118.0 8.0 11.0 0.0 6.0 12.0
biggicr01HOU2007 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0

11 rows × 22 columns

The indexing field loc allows us to select subsets of rows and columns in an intuitive way:

In [25]:
baseball_newind.loc['gonzalu01ARI2006', ['h','X2b', 'X3b', 'hr']]
Out[25]:
h       5
X2b    52
X3b     2
hr     15
Name: gonzalu01ARI2006, dtype: object
In [26]:
baseball_newind.loc[:'myersmi01NYA2006', 'hr']
Out[26]:
womacto01CHN2006    1
schilcu01BOS2006    0
myersmi01NYA2006    0
Name: hr, dtype: int64

In addition to using loc to select rows and columns by label, pandas also allows indexing by position using the iloc attribute.

So, we can query rows and columns by absolute position, rather than by name:

In [27]:
baseball_newind.iloc[:5, 5:8]
Out[27]:
g ab r
womacto01CHN2006 19 50 6
schilcu01BOS2006 31 2 0
myersmi01NYA2006 62 0 0
helliri01MIL2006 20 3 0
johnsra05NYA2006 33 6 0

Exercise

You can use the isin method query a DataFrame based upon a list of values as follows:

data['phylum'].isin(['Firmacutes', 'Bacteroidetes'])

Use isin to find all players that played for the Los Angeles Dodgers (LAN) or the San Francisco Giants (SFN). How many records contain these values?

In [28]:
# Write your answer here

Operations

DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years. First, let's (artificially) construct two Series, consisting of home runs hit in years 2006 and 2007, respectively:

In [29]:
hr2006 = baseball.loc[baseball.year==2006, 'hr']
hr2006.index = baseball.player[baseball.year==2006]

hr2007 = baseball.loc[baseball.year==2007, 'hr']
hr2007.index = baseball.player[baseball.year==2007]
In [30]:
hr2007
Out[30]:
player
francju01     0
francju01     1
zaungr01     10
witasja01     0
williwo02     1
             ..
benitar01     0
benitar01     0
ausmubr01     3
aloumo01     13
alomasa02     0
Name: hr, Length: 92, dtype: int64

Now, let's add them together, in hopes of getting 2-year home run totals:

In [31]:
hr_total = hr2006 + hr2007
hr_total
Out[31]:
player
alomasa02   NaN
aloumo01    NaN
ausmubr01   NaN
benitar01   NaN
benitar01   NaN
             ..
wickmbo01   NaN
williwo02   NaN
witasja01   NaN
womacto01   NaN
zaungr01    NaN
Name: hr, Length: 94, dtype: float64

Pandas' data alignment places NaN values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years.

In [32]:
hr_total[hr_total.notnull()]
Out[32]:
player
finlest01     7.0
gonzalu01    30.0
johnsra05     0.0
myersmi01     0.0
schilcu01     0.0
seleaa01      0.0
Name: hr, dtype: float64

While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with NaN. We can use the add method to calculate player home run totals by using the fill_value argument to insert a zero for home runs where labels do not overlap:

In [33]:
hr2007.add(hr2006, fill_value=0)
Out[33]:
player
alomasa02     0.0
aloumo01     13.0
ausmubr01     3.0
benitar01     0.0
benitar01     0.0
             ... 
wickmbo01     0.0
williwo02     1.0
witasja01     0.0
womacto01     1.0
zaungr01     10.0
Name: hr, Length: 94, dtype: float64

Operations can also be broadcast between rows or columns.

For example, if we subtract the maximum number of home runs hit from the hr column, we get how many fewer than the maximum were hit by each player:

In [34]:
baseball.hr - baseball.hr.max()
Out[34]:
id
88641   -34
88643   -35
88645   -35
88649   -35
88650   -35
         ..
89525   -35
89526   -35
89530   -32
89533   -22
89534   -35
Name: hr, Length: 100, dtype: int64

Or, looking at things row-wise, we can see how a particular player compares with the rest of the group with respect to important statistics

In [35]:
baseball.loc[89521, "player"]
Out[35]:
'bondsba01'
In [36]:
stats = baseball[['h','X2b', 'X3b', 'hr']]
diff = stats - stats.loc[89521]
diff[:10]
Out[36]:
h X2b X3b hr
id
88641 -80 -13 0 -27
88643 -93 -14 0 -28
88645 -94 -14 0 -28
88649 -94 -14 0 -28
88650 -93 -14 0 -28
88652 11 7 12 -22
88653 65 38 2 -13
88662 -89 -13 0 -28
89177 -84 -11 0 -28
89178 -84 -14 0 -27

We can also apply functions to each column or row of a DataFrame

In [37]:
import numpy as np

stats.apply(np.median)
Out[37]:
h      8.0
X2b    1.0
X3b    0.0
hr     0.0
dtype: float64
In [38]:
def range_calc(x):
    return x.max() - x.min()
In [39]:
stat_range = lambda x: x.max() - x.min()
stats.apply(stat_range)
Out[39]:
h      159
X2b     52
X3b     12
hr      35
dtype: int64

Lets use apply to calculate a meaningful baseball statistics, slugging percentage:

$$SLG = \frac{1B + (2 \times 2B) + (3 \times 3B) + (4 \times HR)}{AB}$$

And just for fun, we will format the resulting estimate.

In [40]:
def slugging(x): 
    bases = x['h']-x['X2b']-x['X3b']-x['hr'] + 2*x['X2b'] + 3*x['X3b'] + 4*x['hr']
    ab = x['ab']+1e-6
    
    return bases/ab

baseball.apply(slugging, axis=1).round(3)
Out[40]:
id
88641    0.360
88643    0.500
88645    0.000
88649    0.000
88650    0.167
         ...  
89525    0.000
89526    0.000
89530    0.324
89533    0.524
89534    0.182
Length: 100, dtype: float64

Sorting and Ranking

Pandas objects include methods for re-ordering data.

In [41]:
baseball_newind.sort_index().head()
Out[41]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
alomasa02NYN2007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
aloumo01NYN2007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
ausmubr01HOU2007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
benitar01FLO2007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
benitar01SFN2007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

In [42]:
baseball_newind.sort_index(ascending=False).head()
Out[42]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
zaungr01TOR2007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0
womacto01CHN2006 womacto01 2006 2 CHN NL 19 50 6 5 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
witasja01TBA2007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
williwo02HOU2007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0
wickmbo01ATL2007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

Try sorting the columns instead of the rows, in ascending order:

In [43]:
baseball_newind.sort_index(axis=1).head()
Out[43]:
X2b X3b ab bb cs g gidp h hbp hr ... player r rbi sb sf sh so stint team year
womacto01CHN2006 1 0 50 4 1.0 19 0.0 5 0.0 1 ... womacto01 6 2.0 1.0 0.0 3.0 4.0 2 CHN 2006
schilcu01BOS2006 0 0 2 0 0.0 31 0.0 5 0.0 0 ... schilcu01 0 0.0 0.0 0.0 0.0 1.0 1 BOS 2006
myersmi01NYA2006 0 0 0 0 0.0 62 0.0 5 0.0 0 ... myersmi01 0 0.0 0.0 0.0 0.0 0.0 1 NYA 2006
helliri01MIL2006 0 0 3 0 0.0 20 0.0 5 0.0 0 ... helliri01 0 0.0 0.0 0.0 0.0 2.0 1 MIL 2006
johnsra05NYA2006 0 0 6 0 0.0 33 0.0 5 0.0 0 ... johnsra05 0 0.0 0.0 0.0 0.0 4.0 1 NYA 2006

5 rows × 22 columns

We can also use sort_values to sort a Series by value, rather than by label.

In [44]:
baseball.hr.sort_values()
Out[44]:
id
89534     0
89365     0
89469     0
89367     0
89370     0
         ..
89378    25
89361    26
89521    28
89462    30
89360    35
Name: hr, Length: 100, dtype: int64

For a DataFrame, we can sort according to the values of one or more columns using the by argument of sort_values:

In [45]:
baseball[['player','sb','cs']].sort_values(ascending=[False,True], 
                                           by=['sb', 'cs']).head(10)
Out[45]:
player sb cs
id
89378 sheffga01 22.0 5.0
89430 loftoke01 21.0 4.0
89347 vizquom01 14.0 6.0
89463 greensh01 11.0 1.0
88652 finlest01 7.0 0.0
89462 griffke02 6.0 1.0
89530 ausmubr01 6.0 1.0
89466 gonzalu01 6.0 2.0
89521 bondsba01 5.0 0.0
89438 kleskry01 5.0 1.0

Ranking does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.

In [46]:
baseball.hr.rank()
Out[46]:
id
88641    62.5
88643    29.0
88645    29.0
88649    29.0
88650    29.0
         ... 
89525    29.0
89526    29.0
89530    71.5
89533    88.0
89534    29.0
Name: hr, Length: 100, dtype: float64

Ties are assigned the mean value of the tied ranks, which may result in decimal values.

In [47]:
pd.Series([100,100]).rank()
Out[47]:
0    1.5
1    1.5
dtype: float64

Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:

In [48]:
baseball.hr.rank(method='first')
Out[48]:
id
88641    58.0
88643     1.0
88645     2.0
88649     3.0
88650     4.0
         ... 
89525    55.0
89526    56.0
89530    72.0
89533    88.0
89534    57.0
Name: hr, Length: 100, dtype: float64

Calling the DataFrame's rank method results in the ranks of all columns:

In [49]:
baseball.rank(ascending=False).head()
Out[49]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 2.0 96.5 7.0 82.0 31.5 70.0 47.5 40.5 39.0 50.5 ... 51.0 24.5 17.5 44.5 59.0 66.0 65.5 16.0 70.0 76.5
88643 37.5 96.5 57.0 88.0 81.5 55.5 73.0 81.0 63.5 78.0 ... 78.5 63.5 62.5 79.0 73.0 66.0 65.5 67.5 70.0 76.5
88645 47.5 96.5 57.0 40.5 81.5 36.0 91.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5
88649 66.0 96.5 57.0 47.0 31.5 67.5 69.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 67.0 66.0 65.5 67.5 70.0 76.5
88650 61.5 96.5 57.0 40.5 81.5 51.0 64.5 81.0 63.5 78.0 ... 78.5 63.5 62.5 79.0 59.0 66.0 65.5 67.5 70.0 76.5

5 rows × 22 columns

In [50]:
baseball[['r','h','hr']].rank(ascending=False).head()
Out[50]:
r h hr
id
88641 40.5 39.0 38.5
88643 81.0 63.5 72.0
88645 81.0 84.5 72.0
88649 81.0 84.5 72.0
88650 81.0 63.5 72.0

Exercise

Calculate on base percentage for each player, and return the ordered series of estimates.

$$OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF}$$
In [51]:
# Write your answer here

Hierarchical indexing

In the baseball example, I was forced to combine 3 fields to obtain a unique index that was not simply an integer value. A more elegant way to have done this would be to create a hierarchical index from the three fields.

In [52]:
baseball_h = baseball.set_index(['year', 'team', 'player'])
baseball_h.head(10)
Out[52]:
stint lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
year team player
2006 CHN womacto01 2 NL 19 50 6 14 1 0 1 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
BOS schilcu01 1 AL 31 2 0 1 0 0 0 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
NYA myersmi01 1 AL 62 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
MIL helliri01 1 NL 20 3 0 0 0 0 0 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
NYA johnsra05 1 AL 33 6 0 1 0 0 0 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
SFN finlest01 1 NL 139 426 66 105 21 12 6 40.0 7.0 0.0 46 55.0 2.0 2.0 3.0 4.0 6.0
ARI gonzalu01 1 NL 153 586 93 159 52 2 15 73.0 0.0 1.0 69 58.0 10.0 7.0 0.0 6.0 14.0
LAN seleaa01 1 NL 28 26 2 5 1 0 0 0.0 0.0 0.0 1 7.0 0.0 0.0 6.0 0.0 1.0
2007 ATL francju01 2 NL 15 40 1 10 3 0 0 8.0 0.0 0.0 4 10.0 1.0 0.0 0.0 1.0 1.0
NYN francju01 1 NL 40 50 7 10 0 0 1 8.0 2.0 1.0 10 13.0 0.0 0.0 0.0 1.0 1.0

This index is a MultiIndex object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, Pandas does not print the repeats, making it easy to identify groups of values.

In [53]:
baseball_h.index[:10]
Out[53]:
MultiIndex(levels=[[2006, 2007], ['ARI', 'ATL', 'BAL', 'BOS', 'CHA', 'CHN', 'CIN', 'CLE', 'COL', 'DET', 'FLO', 'HOU', 'KCA', 'LAA', 'LAN', 'MIL', 'MIN', 'NYA', 'NYN', 'OAK', 'PHI', 'SDN', 'SFN', 'SLN', 'TBA', 'TEX', 'TOR'], ['alomasa02', 'aloumo01', 'ausmubr01', 'benitar01', 'biggicr01', 'bondsba01', 'cirilje01', 'claytro01', 'clemero02', 'coninje01', 'cormirh01', 'delgaca01', 'easleda01', 'edmonji01', 'embreal01', 'finlest01', 'floydcl01', 'francju01', 'glavito02', 'gomezch02', 'gonzalu01', 'gordoto01', 'graffto01', 'greensh01', 'griffke02', 'guarded01', 'helliri01', 'hernaro01', 'hoffmtr01', 'johnsra05', 'jonesto02', 'kentje01', 'kleskry01', 'loaizes01', 'loftoke01', 'mabryjo01', 'maddugr01', 'martipe02', 'mesajo01', 'moyerja01', 'mussimi01', 'myersmi01', 'oliveda02', 'parkch01', 'perezne01', 'piazzmi01', 'ramirma02', 'rodriiv01', 'rogerke01', 'sandere02', 'schilcu01', 'schmija01', 'seaneru01', 'seleaa01', 'sheffga01', 'smoltjo01', 'sosasa01', 'sprinru01', 'stairma01', 'stantmi02', 'stinnke01', 'suppaje01', 'sweenma01', 'tavarju01', 'thomafr04', 'thomeji01', 'timlimi01', 'trachst01', 'valenjo03', 'villoro01', 'vizquom01', 'wakefti01', 'walketo04', 'weathda01', 'wellsda01', 'whiteri01', 'whitero02', 'wickmbo01', 'williwo02', 'witasja01', 'womacto01', 'zaungr01']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [5, 3, 17, 15, 17, 22, 0, 14, 1, 18], [80, 50, 41, 26, 29, 15, 20, 53, 17, 17]],
           names=['year', 'team', 'player'])
In [54]:
baseball_h.index.is_unique
Out[54]:
True

Try using this hierarchical index to retrieve Julio Franco (francju01), who played for the Atlanta Braves (ATL) in 2007:

In [55]:
baseball_h.loc[(2007, 'ATL', 'francju01')]
Out[55]:
stint     2
lg       NL
g        15
ab       40
r         1
         ..
ibb       1
hbp       0
sh        0
sf        1
gidp      1
Name: (2007, ATL, francju01), Length: 19, dtype: object

Recall earlier we imported some microbiome data using two index columns. This created a 2-level hierarchical index:

In [56]:
mb = pd.read_csv("../data/microbiome.csv", index_col=['Taxon','Patient'])
In [57]:
mb.head(10)
Out[57]:
Group Tissue Stool
Taxon Patient
Firmicutes 1 0 136 4182
2 1 1174 703
3 0 408 3946
4 1 831 8605
5 0 693 50
6 1 718 717
7 0 173 33
8 1 228 80
9 0 162 3196
10 1 372 32

With a hierachical index, we can select subsets of the data based on a partial index:

In [58]:
mb.loc['Proteobacteria']
Out[58]:
Group Tissue Stool
Patient
1 0 2469 1821
2 1 839 661
3 0 4414 18
4 1 12044 83
5 0 2310 12
... ... ... ...
10 1 6857 795
11 0 483 666
12 1 2950 3994
13 0 1541 816
14 1 1307 53

14 rows × 3 columns

Hierarchical indices can be created on either or both axes. Here is a trivial example:

In [59]:
frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)), 
                  index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
                  columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])

frame
Out[59]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

If you want to get fancy, both the row and column indices themselves can be given names:

In [60]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[60]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

With this, we can do all sorts of custom indexing:

In [61]:
frame.loc['a', 'Ohio']
Out[61]:
color Green Red
key2
1 0 1
2 3 4

Try retrieving the value corresponding to b2 in Colorado:

In [62]:
# Write your answer here

Additionally, the order of the set of indices in a hierarchical MultiIndex can be changed by swapping them pairwise:

In [63]:
mb.swaplevel('Patient', 'Taxon').head()
Out[63]:
Group Tissue Stool
Patient Taxon
1 Firmicutes 0 136 4182
2 Firmicutes 1 1174 703
3 Firmicutes 0 408 3946
4 Firmicutes 1 831 8605
5 Firmicutes 0 693 50

Data can also be sorted by any index level, using sortlevel:

In [64]:
mb.sortlevel('Patient', ascending=False).head()
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
  """Entry point for launching an IPython kernel.
Out[64]:
Group Tissue Stool
Taxon Patient
Proteobacteria 14 1 1307 53
Other 14 1 305 32
Firmicutes 14 1 281 2377
Bacteroidetes 14 1 102 33
Actinobacteria 14 1 310 204

Missing data

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).

In [65]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo
Out[65]:
0       NaN
1        -3
2      None
3    foobar
dtype: object
In [66]:
foo.isnull()
Out[66]:
0     True
1    False
2     True
3    False
dtype: bool

Missing values may be dropped or indexed out:

In [67]:
test_scores = pd.read_csv('../data/test_scores.csv', index_col=0, nrows=50)
test_scores
Out[67]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False NaN 55 1.0 NaN False False
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 NaN False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False NaN 58 1.0 1.0 True False

50 rows × 11 columns

In [68]:
test_scores.dropna()
Out[68]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
7 39 0 2.0 3.0 True 1.0 50 1.0 1.0 False True
12 109 1 1.0 0.0 False 0.0 48 1.0 1.0 True False
14 91 0 2.0 0.0 False 0.0 58 1.0 1.0 True True
19 65 1 0.0 2.0 False 0.0 50 1.0 1.0 False True
21 71 1 1.0 2.0 False 1.0 58 0.0 0.0 True True
... ... ... ... ... ... ... ... ... ... ... ...
50 61 1 1.0 1.0 True 0.0 57 0.0 1.0 False True
51 84 0 2.0 0.0 False 0.0 52 0.0 1.0 False False
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False

21 rows × 11 columns

In [69]:
test_scores.isnull()
Out[69]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 False False False False False True False False True False False
1 False False False True False False False False False False False
2 False False False False False False False False True False False
3 False False False True False False False False False False False
5 False False False False False False False False True False False
... ... ... ... ... ... ... ... ... ... ... ...
52 False False False False False False False False False False False
53 False False False False False False False False False False False
54 False False False True False False False False False False False
55 False False False False False False False False False False False
56 False False False False False True False False False False False

50 rows × 11 columns

In [70]:
test_scores[test_scores.notnull()]
Out[70]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False NaN 55 1.0 NaN False False
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 NaN False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False NaN 58 1.0 1.0 True False

50 rows × 11 columns

By default, dropna drops entire rows in which one or more values are missing.

This can be overridden by passing the how='all' argument, which only drops a row when every field is a missing value.

In [71]:
test_scores.dropna(how='all')
Out[71]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False NaN 55 1.0 NaN False False
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 NaN False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False NaN 58 1.0 1.0 True False

50 rows × 11 columns

This can be customized further by specifying how many values need to be present before a row is dropped via the thresh argument.

In [72]:
test_scores.dropna(thresh=10)
Out[72]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
6 85 1 1.0 1.0 False 0.0 53 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 NaN False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False NaN 58 1.0 1.0 True False

42 rows × 11 columns

This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.

Exercise

Try using the axis argument to drop columns with missing values:

In [73]:
# Write your answer here

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the fillna argument.

In [74]:
test_scores.fillna(-999)
Out[74]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False -999.0 55 1.0 -999.0 False False
1 31 1 0.0 -999.0 False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 -999.0 False True
3 75 0 3.0 -999.0 False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 -999.0 False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 -999.0 False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False -999.0 58 1.0 1.0 True False

50 rows × 11 columns

In [75]:
test_scores.fillna({'family_inv': 0, 'prev_disab': 1})
Out[75]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False 1.0 55 1.0 NaN False False
1 31 1 0.0 0.0 False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 0.0 False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 0.0 False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False 1.0 58 1.0 1.0 True False

50 rows × 11 columns

Notice that fillna by default returns a new object with the desired filling behavior, rather than changing the Series or DataFrame in place (in general, we like to do this, by the way!).

We can alter values in-place using inplace=True.

In [76]:
test_scores.prev_disab.fillna(0, inplace=True)
test_scores
Out[76]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False 0.0 55 1.0 NaN False False
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 NaN False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False 0.0 58 1.0 1.0 True False

50 rows × 11 columns

Missing values can also be interpolated, using any one of a variety of methods:

In [77]:
test_scores.fillna(method='bfill')
Out[77]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False 0.0 55 1.0 0.0 False False
1 31 1 0.0 1.0 False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 1.0 False True
3 75 0 3.0 4.0 False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 1.0 False False
... ... ... ... ... ... ... ... ... ... ... ...
52 67 0 2.0 2.0 False 0.0 49 0.0 1.0 True True
53 89 1 1.0 0.0 False 0.0 48 1.0 1.0 False True
54 76 1 1.0 1.0 False 1.0 56 1.0 1.0 False True
55 98 1 1.0 1.0 True 0.0 51 1.0 1.0 True False
56 93 1 0.0 0.0 False 0.0 58 1.0 1.0 True False

50 rows × 11 columns

Data summarization

We often wish to summarize data in Series or DataFrame objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.

In [78]:
baseball.sum()
Out[78]:
player    womacto01schilcu01myersmi01helliri01johnsra05f...
year                                                 200692
stint                                                   113
team      CHNBOSNYAMILNYASFNARILANATLNYNTORTBAHOUARIATLM...
lg        NLALALNLALNLNLNLNLNLALALNLNLNLALNLNLNLNLALALNL...
                                ...                        
ibb                                                     177
hbp                                                     112
sh                                                      138
sf                                                      120
gidp                                                    354
Length: 22, dtype: object

Clearly, sum is more meaningful for some columns than others. For methods like mean for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [79]:
baseball.mean()
Out[79]:
year     2006.92
stint       1.13
g          52.38
ab        136.54
r          18.69
          ...   
ibb         1.77
hbp         1.12
sh          1.38
sf          1.20
gidp        3.54
Length: 19, dtype: float64

The important difference between NumPy's functions and Pandas' methods is that the latter have built-in support for handling missing data.

In [80]:
test_scores.mean()
Out[80]:
score            78.500000
male              0.480000
siblings          1.020000
family_inv        0.972973
non_english       0.320000
                   ...    
age_test         53.120000
non_severe_hl     0.420000
mother_hs         0.787879
early_ident       0.280000
non_white         0.460000
Length: 11, dtype: float64

Sometimes we may not want to ignore missing values, and allow the nan to propagate.

In [81]:
test_scores.mean(skipna=False)
Out[81]:
score            78.50
male              0.48
siblings          1.02
family_inv         NaN
non_english       0.32
                 ...  
age_test         53.12
non_severe_hl     0.42
mother_hs          NaN
early_ident       0.28
non_white         0.46
Length: 11, dtype: float64

Passing axis=1 will summarize over rows instead of columns, which only makes sense in certain situations.

In [82]:
extra_bases = baseball[['X2b','X3b','hr']].sum(axis=1)
extra_bases.sort_values(ascending=False)
Out[82]:
id
88653    69
89439    57
89361    56
89462    55
89396    54
         ..
89355     0
89354     0
89480     0
89348     0
89420     0
Length: 100, dtype: int64

A useful summarization that gives a quick snapshot of multiple statistics for a Series or DataFrame is describe:

In [83]:
baseball.describe()
Out[83]:
year stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
count 100.00000 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000
mean 2006.92000 1.130000 52.380000 136.540000 18.69000 35.820000 7.390000 0.550000 4.370000 18.47000 1.380000 0.460000 15.490000 24.080000 1.770000 1.12000 1.380000 1.200000 3.540000
std 0.27266 0.337998 48.031299 181.936853 27.77496 50.221807 11.117277 1.445124 7.975537 28.34793 3.694878 1.067613 25.812649 32.804496 5.042957 2.23055 2.919042 2.035046 5.201826
min 2006.00000 1.000000 1.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
25% 2007.00000 1.000000 9.500000 2.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 0.000000
50% 2007.00000 1.000000 33.000000 40.500000 2.00000 8.000000 1.000000 0.000000 0.000000 2.00000 0.000000 0.000000 1.000000 7.000000 0.000000 0.00000 0.000000 0.000000 1.000000
75% 2007.00000 1.000000 83.250000 243.750000 33.25000 62.750000 11.750000 1.000000 6.000000 27.00000 1.000000 0.000000 19.250000 37.250000 1.250000 1.00000 1.000000 2.000000 6.000000
max 2007.00000 2.000000 155.000000 586.000000 107.00000 159.000000 52.000000 12.000000 35.000000 96.00000 22.000000 6.000000 132.000000 134.000000 43.000000 11.00000 14.000000 9.000000 21.000000

describe can detect non-numeric data and sometimes yield useful information about it.

In [84]:
baseball.player.describe()
Out[84]:
count           100
unique           82
top       gomezch02
freq              2
Name: player, dtype: object

We can also calculate summary statistics across multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$
In [85]:
baseball.hr.cov(baseball.X2b)
Out[85]:
69.07646464646454
$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$
In [86]:
baseball.hr.corr(baseball.X2b)
Out[86]:
0.7790615182539742
In [87]:
baseball.ab.corr(baseball.h)
Out[87]:
0.9942174036272377

Try running corr on the entire baseball DataFrame to see what is returned:

In [88]:
# Write answer here

If we have a DataFrame with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:

In [89]:
mb.head()
Out[89]:
Group Tissue Stool
Taxon Patient
Firmicutes 1 0 136 4182
2 1 1174 703
3 0 408 3946
4 1 831 8605
5 0 693 50
In [90]:
mb.sum(level='Taxon')
Out[90]:
Group Tissue Stool
Taxon
Firmicutes 7 9634 30172
Proteobacteria 7 42508 12483
Actinobacteria 7 6167 1615
Bacteroidetes 7 8880 4276
Other 7 2868 242

Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

In [91]:
mb.to_csv("mb.csv")

The to_csv method writes a DataFrame to a comma-separated values (csv) file. You can specify custom delimiters (via sep argument), how missing values are written (via na_rep argument), whether the index is writen (via index argument), whether the header is included (via header argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.

In [92]:
baseball.to_pickle("baseball_pickle")

The complement to to_pickle is the read_pickle function, which restores the pickle to a DataFrame or Series:

In [93]:
pd.read_pickle("baseball_pickle")
Out[93]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

100 rows × 22 columns

As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.


References

Python for Data Analysis Wes McKinney