An incredible amount of work has gone into making the open-source Retrosheet and Lahman data sets as accurate as possible, but they are not perfectly accurate.
Prior to performing any data analysis, it is necessary to check the accuracy of the data. The limitations of the data will define the limitations of the analysis.
Performing data consistency tests is also a great way to validate the data processing pipeline and to learn about the data.
The pytest test suite supplied in this repo automatically runs a superset of the tests performed in this notebook. The purpose of this notebook is to explain the key data consistency tests in more detail to better understand the data.
A brief summary of the wrangled data is provided below at: MLB Data Summary
Prior to performing the data consistency tests, it is necessary to verify that the fields which should uniquely identify a record, actually do. This is performed in the section called Primary and Foreign Key tests.
Three types of data consistency tests will be performed.
1. Retrosheet Stats vs Lahman Stats
2. Individual Stats vs Team Stats
Retrosheet
Notes:
Lahman
3. Batting Stats vs Pitching Allowed Stats
Retrosheet
Lahman
All data between 1974 and 2019 inclusive will be used. Retrosheet has all play-by-play data over this time period.
The data consistency tests use several different hard-coded tolerance values:
It is reasonable to expect that:
Retrosheet vs Lahman
Retrosheet Batting vs Lahman Batting Players
Retrosheet Batting vs Lahman Batting Stats
Retrosheet Pitching vs Lahman Pitching Players
Retrosheet Pitching vs Lahman Pitching Stats
Retrosheet Fielding vs Lahman Fielding Players
Retrosheet Fielding vs Lahman Fielding Stats
Batting vs Pitching Allowed
Retrosheet Pitching Allowed vs Retrosheet Hitting
Lahman Pitching Allowed vs Lahman Hitting
Individual vs Team
Retrosheet Batting vs Retrosheet Team Batting
Lahman Batting vs Lahman Team Batting
Retrosheet Fielding vs Retrosheet Team Fielding
Lahman Fielding vs Lahman Team Fielding
Retrosheet Pitching vs Retrosheet Team Pitching
Lahman Pitching vs Lahman Team Pitching
This notebook assumes that the Lahman and Retrosheet data sets have been downloaded and wrangled using the scripts in the ../download_scripts
directory of this repo.
For this notebook, Retrosheet data from 1974 through 2019 inclusive is used.
The ../download_scripts/data_helper.py
function: from_csv_with_types()
uses pd.read_csv() with dtypes set to the type data read in from: <filename>_types.csv. This allows Pandas to use the previously optimized data types which require about 1/3rd as much memory.
Main csv files.
Lahman
Retrosheet
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re
from scipy.stats import linregress
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100 # increase dpi, will make figures larger and clearer
import sys
# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()
pd.set_option("display.max_columns", 50)
Loading all the data up front makes the code clearer, but uses more memory.
As optimized Pandas data types were persisted when the data was wrangled, the total memory usage is about 3 times less than if Pandas inferred the data types with pd.read_csv().
A minor drawback to using the smallest integer type that will hold the data, is that some Pandas methods do not compare different integer types properly without first prompting them to the same larger integer type. Aggregations work properly for all Pandas data types.
Notes:
lahman_people = dh.from_csv_with_types(lahman_data / 'people.csv')
lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')
lahman_batting = dh.from_csv_with_types(lahman_data / 'batting.csv')
lahman_pitching = dh.from_csv_with_types(lahman_data / 'pitching.csv')
lahman_fielding = dh.from_csv_with_types(lahman_data / 'fielding.csv')
# restrict Lahman data to be between 1974 and 2019 inclusive
lahman_batting = lahman_batting.query('1974 <= year <= 2019')
lahman_pitching = lahman_pitching.query('1974 <= year <= 2019')
lahman_fielding = lahman_fielding.query('1974 <= year <= 2019')
lahman_teams = lahman_teams.query('1974 <= year <= 2019')
retro_batting = dh.from_csv_with_types(retrosheet_data / 'batting.csv.gz')
retro_pitching = dh.from_csv_with_types(retrosheet_data / 'pitching.csv.gz')
retro_fielding = dh.from_csv_with_types(retrosheet_data / 'fielding.csv.gz')
retro_team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz')
retro_game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz')
# restrict Retrosheet data to be between 1974 and 2019 inclusive
retro_batting = retro_batting.query('1974 <= year <= 2019')
retro_pitching = retro_pitching.query('1974 <= year <= 2019')
retro_fielding = retro_fielding.query('1974 <= year <= 2019')
retro_team_game = retro_team_game.query('1974 <= year <= 2019')
retro_game = retro_game.query('1974 <= game_start.dt.year <= 2019')
# verify these years exist in the downloaded data
(retro_batting['year'].agg(['min', 'max']) == (1974, 2019)).all()
True
# verify all the years are in the downloaded data
retro_batting['year'].nunique() == (2019 - 1974) + 1
True
Perform these tests before checking for data consistency.
In the optional ../download_scripts/postgres_load_data.py
script, primary and foreign key constraints are used to enforce these data checks.
The Lahman people.csv file contains a mapping between Lahman player_id and Retrosheet player_id.
Ensure that this mapping is "one to one" and "onto", which just means:
Although the code that follows allows for a missing mapping, there are no missing mappings for players between 1974 and 2019 (as of the December 2019 Lahman data update).
# verify Lahman player_id is unique
dh.is_unique(lahman_people, ['player_id'])
True
# verify Lahman's retrosheet player_id is unqiue or missing
dh.is_unique(lahman_people, ['retro_id'], ignore_null=True)
True
# get the unique player ids from Retrosheet
retro_players = pd.Series(retro_batting['player_id'].unique(), name='player_id')
print(f'There are {len(retro_players)} unique players in Retrosheet')
There are 9383 unique players in Retrosheet
# use an inner join to verify that the Lahman player_id to Retrosheet player_id
# mapping is one-to-one and onto
mapping = lahman_people[['player_id', 'retro_id']].merge(retro_players, how='inner',
left_on=['retro_id'], right_on=['player_id'],
suffixes=('_lahman', '_retro'))
print(f'There are {len(mapping)} player_ids that match between Lahman and Retrosheet')
There are 9383 player_ids that match between Lahman and Retrosheet
len(retro_players) == len(mapping)
True
# Given the above, the following must be True
r_players = set(retro_players)
l_players = set(lahman_people['retro_id'])
r_players.issubset(l_players)
True
Same analysis as above, but with (team_id, year) instead of player_id.
# verify Lahman (team_id, year) is unique
dh.is_unique(lahman_teams, ['team_id', 'year'])
True
# verify Lahman's retrosheet (team__id, year) is unqiue
dh.is_unique(lahman_teams, ['team_id_retro', 'year'])
True
# alternatively Panda's drop_duplicates() could have been used
retro_team_ids = set(zip(retro_team_game['team_id'],
retro_team_game['year']))
retro_team_ids_df = pd.DataFrame(retro_team_ids, columns=['team_id', 'year'])
print(f'There are {len(retro_team_ids_df)} unique (team_id, year) in Retrosheet')
There are 1288 unique (team_id, year) in Retrosheet
# use an inner join to verify that the mapping is one-to-one and onto
mapping = lahman_teams.merge(retro_team_ids_df, how='inner',
left_on=['team_id_retro', 'year'],
right_on=['team_id','year'])
print(f'There are {len(mapping)} (team_id, year) that match between Lahman and Retrosheet')
There are 1288 (team_id, year) that match between Lahman and Retrosheet
len(retro_team_ids) == len(mapping)
True
# Given the above, the following must be True
lahman_team_ids = set(zip(lahman_teams['team_id_retro'], lahman_teams['year']))
retro_team_ids.issubset(lahman_team_ids)
True
dh.is_unique(lahman_batting, ['player_id', 'year', 'stint'])
True
dh.is_unique(lahman_pitching, ['player_id', 'year', 'stint'])
True
dh.is_unique(lahman_fielding, ['player_id', 'year', 'stint', 'pos'])
True
# Note: 1 duplicate key was cleaned during wrangling for data between 1948 and 2019
# Additional entries were cleaned for earlier years
# The two entries for Chris Young (younc004) were summed into a single entry
# Manually spot check with the box score for this game at Baseball Reference
# to verify that summing all stats (other than g) was the correct approach
dh.game_id_to_url('BOS201708250')
dh.is_unique(retro_batting, ['player_id', 'game_id'])
True
dh.is_unique(retro_pitching, ['player_id', 'game_id'])
True
dh.is_unique(retro_fielding, ['player_id', 'game_id', 'pos'])
True
dh.is_unique(retro_team_game, ['team_id', 'game_id'])
True
dh.is_unique(retro_game, ['game_id'])
True
The download scripts ensured that columns with the same meaning, were given the same name. This makes it easy to compare data between CSV files using their column names.
# verify that Lahman and Retrosheet have stats on the same set of batters
lahman_batters = pd.merge(lahman_batting['player_id'], lahman_people[['player_id', 'retro_id']])
l_batters = set(lahman_batters['retro_id'].unique())
r_batters = set(retro_batting['player_id'].unique())
r_batters == l_batters
True
# batting columns to compare
cols = set(retro_batting.columns) & set(lahman_batting.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)
17
cols
{'ab', 'bb', 'cs', 'double', 'g', 'gidp', 'h', 'hbp', 'hr', 'ibb', 'r', 'rbi', 'sb', 'sf', 'sh', 'so', 'triple'}
# aggregate the stats in common for all players for all years (1974 thru 2019)
l = lahman_batting[cols]
r = retro_batting[cols]
l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)
r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)
# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))
ab 1.426074e-07 bb 1.487730e-06 cs 3.295273e-05 double 2.922456e-06 g 1.397775e-06 gidp 8.906306e-05 h 0.000000e+00 hbp 0.000000e+00 hr 0.000000e+00 ibb 1.776294e-05 r 0.000000e+00 rbi 5.724236e-06 sb 2.964764e-05 sf 0.000000e+00 sh 1.411652e-05 so 0.000000e+00 triple 0.000000e+00 dtype: float64
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')
0.000089
# all 17 batting attributes from 1974-2019 between Lahman and Retrosheet
# are within plus/minus 0.01% of each other when summed
(np.abs(1.0 - (l_sums / r_sums)) < .0001).all()
True
# verify that Lahman and Retrosheet have stats on exactly the same set of pitchers
lahman_pitchers = pd.merge(lahman_pitching['player_id'], lahman_people[['player_id', 'retro_id']])
l_pitchers = set(lahman_pitchers['retro_id'].unique())
r_pitchers = set(retro_pitching['player_id'].unique())
r_pitchers == l_pitchers
True
# pitching columns to compare
cols = set(retro_pitching.columns) & set(lahman_pitching.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)
21
cols
{'bb', 'bk', 'cg', 'er', 'g', 'gf', 'gidp', 'gs', 'h', 'hbp', 'hr', 'ibb', 'l', 'r', 'sf', 'sh', 'sho', 'so', 'sv', 'w', 'wp'}
# aggregate the stats in common for all players for all years (1974 thru 2019)
l = lahman_pitching[cols]
r = retro_pitching[cols]
l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)
r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)
# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))
bb 0.000001 bk 0.000403 cg 0.000327 er 0.000002 g 0.000105 gf 0.000032 gidp 0.000006 gs 0.000000 h 0.000000 hbp 0.000000 hr 0.000005 ibb 0.000018 l 0.000000 r 0.000000 sf 0.000000 sh 0.000014 sho 0.000598 so 0.000000 sv 0.000081 w 0.000000 wp 0.000355 dtype: float64
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')
0.000598
# verify all values between 1974 and 2019 are within plus/minus 0.06% of each other
(np.abs(1.0 - (l_sums / r_sums)) < .0006).all()
True
# verify that Lahman and Retrosheet have stats on exactly the same set of fielders
lahman_fielders = pd.merge(lahman_fielding['player_id'], lahman_people[['player_id', 'retro_id']])
l_fielders = set(lahman_fielders['retro_id'].unique())
r_fielders = set(retro_fielding['player_id'].unique())
r_fielders == l_fielders
False
# Lahman has no fielders that are not in Retrosheet
l_fielders - r_fielders
set()
# Retrosheet has one fielder that is not in Lahman
r_fielders - l_fielders
{'olivt102'}
# missing Lahmann fielder
missing_fielder = f'{(r_fielders - l_fielders).pop()}'
missing = retro_fielding.query(f'player_id == "{missing_fielder}"')
missing
game_id | player_id | pos | team_id | g | gs | inn_outs | tc | po | a | e | dp | tp | pb | xi | game_start | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
791764 | BOS197604190 | olivt102 | 2B | MIN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1976-04-19 11:05:00 | 1976 |
791786 | BOS197604200 | olivt102 | 2B | MIN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1976-04-20 15:03:00 | 1976 |
816021 | NYA197604180 | olivt102 | 2B | MIN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1976-04-18 14:07:00 | 1976 |
831226 | TEX197604110 | olivt102 | 2B | MIN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1976-04-11 00:00:00 | 1976 |
# Tony had no fielding Total Chances (tc)
missing['tc'].sum()
0
# Tony was on the field for 0 outs (inn_outs)
missing['inn_outs'].sum()
0
Tony Oliva (olivt102) started 4 games as a second baseman.
He had zero total chances and no outs were recorded during the time he played second base.
This sounds like a pinch hitter or a designated hitter, rather than a fielder. Let's check the box scores and player information on Baseball-Reference.
# right click on each generated link to see the box score on Baseball Reference
for game_id in missing['game_id']:
dh.game_id_to_url(game_id)
# right click on generated link to see Tony Oliva on Baseball Reference
bb_player_id = lahman_people.query('retro_id == "olivt102"')['bb_ref_id'].values[0]
dh.player_id_to_url(bb_player_id)
Tony Oliva lead off in each of these four away games and was immediately replaced, whether he got on base or not. He was only in the game to hit once. He might have been unable to run or field well for these games, but he could still hit, so he was in the lineup.
Tony was listed as the starting second baseman, even though he was a career right fielder. Tony was not going to play second base, the guy who replaced him, Jerry Terrell, was.
The cwdaily parser created a fielding record for Tony as he was on the lineup card as the starting second baseman.
It would be hard to argue that either the Lahman or Retrosheet data is wrong in this scenario. The only difference is that Retrosheet show 4 starts as a second baseman for Tony Oliva that Lahman does not.
The fielding players are effectively the same in Lahman and Retrosheet (1974 through 2019).
# fielding columns to compare
cols = set(retro_fielding.columns) & set(lahman_fielding.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)
8
cols
{'a', 'dp', 'e', 'g', 'gs', 'inn_outs', 'po', 'pos'}
# aggregate the stats in common per position over all players over all years (1974 thru 2019)
l = lahman_fielding[cols].copy()
r = retro_fielding[cols]
# work-around for Pandas 1.0.1 bugs
# 1) sum does not up-cast for nullable integer types
# 2) select_dtypes does not distinguish between nullable and non-nullable integer types
idx = l.dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in l.columns[idx]:
l[col] = l[col].astype('Int32')
l_sums = l.groupby('pos').agg('sum')
l_sums.sort_index(inplace=True)
r_sums = r.groupby('pos').agg('sum')
r_sums.sort_index(inplace=True)
# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))
inn_outs | dp | g | a | e | gs | po | |
---|---|---|---|---|---|---|---|
pos | |||||||
1B | 2.470157e-05 | 0.000737 | 0.000008 | 0.006393 | 0.000833 | 0.000029 | 0.000482 |
2B | 2.179551e-06 | 0.000007 | 0.000017 | 0.001008 | 0.000309 | 0.000063 | 0.002122 |
3B | 9.807978e-06 | 0.000908 | 0.000177 | 0.001483 | 0.000109 | 0.000019 | 0.001850 |
C | 5.448877e-07 | 0.000982 | 0.000059 | 0.007810 | 0.000683 | 0.000005 | 0.000668 |
CF | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
LF | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
OF | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
P | 4.722360e-06 | 0.002302 | 0.000105 | 0.001890 | 0.001415 | 0.000000 | 0.005532 |
RF | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SS | 8.354944e-06 | 0.000730 | 0.000004 | 0.001009 | 0.000102 | 0.000068 | 0.001301 |
# Lahman uses OF for sum of LF, CF, RF -- account for this
r_sums.loc['OF'] = r_sums.loc['LF'] + r_sums.loc['CF'] + r_sums.loc['RF']
r_sums = r_sums.drop(['LF', 'CF', 'RF'])
r_sums.sort_index(inplace=True)
r_sums
inn_outs | dp | g | a | e | gs | po | |
---|---|---|---|---|---|---|---|
pos | |||||||
1B | 5505722.0 | 171005.0 | 237097.0 | 138119.0 | 14409.0 | 205280.0 | 1798401.0 |
2B | 5505722.0 | 135428.0 | 235233.0 | 599404.0 | 19443.0 | 205280.0 | 422753.0 |
3B | 5505722.0 | 37465.0 | 237078.0 | 404527.0 | 27552.0 | 205280.0 | 149770.0 |
C | 5505722.0 | 13236.0 | 236247.0 | 106279.0 | 14642.0 | 205280.0 | 1349560.0 |
OF | 16517166.0 | 8256.0 | 735430.0 | 38885.0 | 24381.0 | 615840.0 | 1352944.0 |
P | 5505722.0 | 17807.0 | 695259.0 | 243922.0 | 16956.0 | 205280.0 | 112264.0 |
SS | 5505722.0 | 127447.0 | 232003.0 | 610365.0 | 29333.0 | 205280.0 | 319687.0 |
# The above will overcount games, as Lahman will have a player in the game once as an OF,
# whereas Retrosheet may have him in the game as both a LF and CF, for example.
r_of = retro_fielding.query('pos in ["LF", "CF", "RF"]')
total_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep=False).sum()
counted_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep='first').sum()
r_sums.loc['OF', 'g'] -= (total_dups - counted_dups)
r_sums
inn_outs | dp | g | a | e | gs | po | |
---|---|---|---|---|---|---|---|
pos | |||||||
1B | 5505722.0 | 171005.0 | 237097.0 | 138119.0 | 14409.0 | 205280.0 | 1798401.0 |
2B | 5505722.0 | 135428.0 | 235233.0 | 599404.0 | 19443.0 | 205280.0 | 422753.0 |
3B | 5505722.0 | 37465.0 | 237078.0 | 404527.0 | 27552.0 | 205280.0 | 149770.0 |
C | 5505722.0 | 13236.0 | 236247.0 | 106279.0 | 14642.0 | 205280.0 | 1349560.0 |
OF | 16517166.0 | 8256.0 | 711605.0 | 38885.0 | 24381.0 | 615840.0 | 1352944.0 |
P | 5505722.0 | 17807.0 | 695259.0 | 243922.0 | 16956.0 | 205280.0 | 112264.0 |
SS | 5505722.0 | 127447.0 | 232003.0 | 610365.0 | 29333.0 | 205280.0 | 319687.0 |
# compute relative differences
np.abs(1.0 - l_sums / r_sums)
inn_outs | dp | g | a | e | gs | po | |
---|---|---|---|---|---|---|---|
pos | |||||||
1B | 2.470157e-05 | 0.000737 | 0.000008 | 0.006393 | 0.000833 | 0.000029 | 0.000482 |
2B | 2.179551e-06 | 0.000007 | 0.000017 | 0.001008 | 0.000309 | 0.000063 | 0.002122 |
3B | 9.807978e-06 | 0.000908 | 0.000177 | 0.001483 | 0.000109 | 0.000019 | 0.001850 |
C | 5.448877e-07 | 0.000982 | 0.000059 | 0.007810 | 0.000683 | 0.000005 | 0.000668 |
OF | 6.599195e-06 | 0.001453 | 0.000311 | 0.002546 | 0.000656 | 0.000013 | 0.000072 |
P | 4.722360e-06 | 0.002302 | 0.000105 | 0.001890 | 0.001415 | 0.000000 | 0.005532 |
SS | 8.354944e-06 | 0.000730 | 0.000004 | 0.001009 | 0.000102 | 0.000068 | 0.001301 |
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max().max():8.6f}')
0.007810
# verify all values between 1974 and 2019 are within plus/minus 0.8% of each other
np.abs(1.0 - (l_sums / r_sums)).max().max() < 0.008
True
exclude = ['game_id', 'team_id', 'player_id', 'g', 'game_start', 'year']
cols = set(retro_pitching.columns) & set(retro_batting.columns) - set(exclude)
cols = list(cols)
cols
['gidp', 'triple', 'sf', 'hr', 'h', 'so', 'ab', 'hr4', 'hbp', 'double', 'r', 'ibb', 'sh', 'xi', 'bb', 'tb']
# sum over all pitchers over all years
p = retro_pitching[cols].agg('sum')
p
gidp 157192 triple 41532 sf 58275 hr 195985 h 1826823 so 1301700 ab 7012260 hr4 4710 hbp 58097 double 342178 r 923811 ibb 56297 sh 70839 xi 1060 bb 672165 tb 2840020 dtype: int64
# sum over all batters over all years
b = retro_batting[cols].agg('sum')
b
gidp 157192 triple 41532 sf 58275 hr 195985 h 1826823 so 1301700 ab 7012260 hr4 4710 hbp 58097 double 342178 r 923811 ibb 56297 sh 70839 xi 1060 bb 672165 tb 2840020 dtype: int64
# Retrosheet is completely consistent
p.equals(b)
True
exclude = ['lg_id', 'player_id', 'stint', 'team_id', 'year', 'g']
pkey = ['player_id', 'stint', 'year']
cols = set(lahman_pitching.columns) & set(lahman_batting.columns)
cols -= set(exclude)
cols
{'bb', 'gidp', 'h', 'hbp', 'hr', 'ibb', 'r', 'sf', 'sh', 'so'}
# sum over all pitchers over all years (1974 thru 2019)
p = lahman_pitching[cols].agg('sum')
p
gidp 157193.0 so 1301700.0 sf 58275.0 h 1826823.0 hr 195986.0 hbp 58097.0 r 923811.0 ibb 56296.0 sh 70838.0 bb 672166.0 dtype: float64
# sum over all batters over all years (1974 thru 2019)
b = lahman_batting[cols].agg('sum')
b
gidp 157178.0 so 1301700.0 sf 58275.0 h 1826823.0 hr 195985.0 hbp 58097.0 r 923811.0 ibb 56296.0 sh 70838.0 bb 672166.0 dtype: float64
# compute the relative differences
np.abs(1.0 - p/b)
gidp 0.000095 so 0.000000 sf 0.000000 h 0.000000 hr 0.000005 hbp 0.000000 r 0.000000 ibb 0.000000 sh 0.000000 bb 0.000000 dtype: float64
# find the max difference
np.abs(1.0 - p/b).max()
9.543320311999892e-05
# the biggest difference is less than 0.01%
np.abs(1.0 - p/b).max() < 0.0001
True
exclude = ['game_id', 'team_id', 'player_id', 'game_start', 'year']
cols = set(retro_batting.columns) & set(retro_team_game.columns) - set(exclude)
cols = list(cols)
len(cols)
17
cols
['gidp', 'triple', 'sf', 'hr', 'h', 'cs', 'ab', 'rbi', 'hbp', 'so', 'double', 'r', 'ibb', 'sh', 'sb', 'xi', 'bb']
# retro_batting was wrangled from the output of cwdaily
b = retro_batting[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
b = b.reset_index()
b.head(4)
game_id | team_id | gidp | triple | sf | hr | h | cs | ab | rbi | hbp | so | double | r | ibb | sh | sb | xi | bb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANA199704020 | ANA | 0 | 0 | 1 | 1 | 12 | 0 | 38 | 5 | 0 | 7 | 0 | 5 | 0 | 0 | 2 | 0 | 5 |
1 | ANA199704020 | BOS | 0 | 0 | 0 | 1 | 9 | 0 | 36 | 6 | 1 | 12 | 2 | 6 | 0 | 0 | 0 | 0 | 8 |
2 | ANA199704030 | ANA | 2 | 0 | 1 | 0 | 7 | 0 | 27 | 2 | 0 | 1 | 2 | 2 | 1 | 0 | 0 | 0 | 4 |
3 | ANA199704030 | BOS | 3 | 0 | 0 | 0 | 5 | 0 | 29 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)
game_id | team_id | gidp | triple | sf | hr | h | cs | ab | rbi | hbp | so | double | r | ibb | sh | sb | xi | bb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANA199704020 | ANA | 0 | 0 | 1 | 1 | 12 | 0 | 38 | 5 | 0 | 7 | 0 | 5 | 0 | 0 | 2 | 0 | 5 |
1 | ANA199704020 | BOS | 0 | 0 | 0 | 1 | 9 | 0 | 36 | 6 | 1 | 12 | 2 | 6 | 0 | 0 | 0 | 0 | 8 |
2 | ANA199704030 | ANA | 2 | 0 | 1 | 0 | 7 | 0 | 27 | 2 | 0 | 1 | 2 | 2 | 1 | 0 | 0 | 0 | 4 |
3 | ANA199704030 | BOS | 3 | 0 | 0 | 0 | 5 | 0 | 29 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Retrosheet is completely consistent between cwdaily and cwgame for batting
b.equals(tg)
True
exclude = ['lg_id', 'team_id', 'year', 'g']
key = ['team_id', 'year']
cols = set(lahman_batting.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)
12
cols
['triple', 'sf', 'hr', 'h', 'cs', 'ab', 'so', 'hbp', 'double', 'r', 'sb', 'bb']
# work-around for Pandas 1.0.1 bugs
# 1) sum does not up-cast for nullable integer types
# 2) select_dtypes does not distinguish between nullable and non-nullable int types
idx = lahman_batting[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in lahman_batting[cols].columns[idx]:
lahman_batting[col] = lahman_batting[col].astype('Int32')
idx = lahman_teams[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in lahman_teams[cols].columns[idx]:
lahman_teams[col] = lahman_teams[col].astype('Int32')
b = lahman_batting[key + cols].groupby(key).agg('sum')
b = b.reset_index()
b.head(4)
team_id | year | triple | sf | hr | h | cs | ab | so | hbp | double | r | sb | bb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANA | 1997 | 25.0 | 57 | 161.0 | 1531 | 72 | 5628 | 953 | 45 | 279.0 | 829.0 | 126 | 617.0 |
1 | ANA | 1998 | 27.0 | 41 | 147.0 | 1530 | 45 | 5630 | 1028 | 48 | 314.0 | 787.0 | 93 | 510.0 |
2 | ANA | 1999 | 22.0 | 42 | 158.0 | 1404 | 45 | 5494 | 1022 | 43 | 248.0 | 711.0 | 71 | 511.0 |
3 | ANA | 2000 | 34.0 | 43 | 236.0 | 1574 | 52 | 5628 | 1024 | 47 | 309.0 | 864.0 | 93 | 608.0 |
t = lahman_teams[key + cols].sort_values(key).reset_index(drop=True)
t.head(4)
team_id | year | triple | sf | hr | h | cs | ab | so | hbp | double | r | sb | bb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANA | 1997 | 25 | 57 | 161 | 1531 | 72 | 5628 | 953 | 45 | 279 | 829 | 126 | 617 |
1 | ANA | 1998 | 27 | 41 | 147 | 1530 | 45 | 5630 | 1028 | 48 | 314 | 787 | 93 | 510 |
2 | ANA | 1999 | 22 | 42 | 158 | 1404 | 45 | 5494 | 1022 | 43 | 248 | 711 | 71 | 511 |
3 | ANA | 2000 | 34 | 43 | 236 | 1574 | 52 | 5628 | 1024 | 47 | 309 | 864 | 93 | 608 |
# ensure the dtypes are the same
for col in t.columns:
if not col == 'team_id' and not col =='year':
b[col] = b[col].astype('int')
t[col] = t[col].astype('int')
b[cols].equals(t[cols])
True
Double plays (dp) are not compared because each fielder involved gets credit for a double play whereas the team as a whole only gets credit for the complete double play.
cols = ['a', 'e', 'po', 'pb']
cols
['a', 'e', 'po', 'pb']
# retro_fielding was wrangled from the output of cwdaily
f = retro_fielding[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
f = f.reset_index()
f.head(4)
game_id | team_id | a | e | po | pb | |
---|---|---|---|---|---|---|
0 | ANA199704020 | ANA | 6 | 1 | 27 | 0 |
1 | ANA199704020 | BOS | 8 | 1 | 27 | 0 |
2 | ANA199704030 | ANA | 13 | 0 | 27 | 0 |
3 | ANA199704030 | BOS | 9 | 0 | 24 | 0 |
# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(
['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)
game_id | team_id | a | e | po | pb | |
---|---|---|---|---|---|---|
0 | ANA199704020 | ANA | 6 | 1 | 27 | 0 |
1 | ANA199704020 | BOS | 8 | 1 | 27 | 0 |
2 | ANA199704030 | ANA | 13 | 0 | 27 | 0 |
3 | ANA199704030 | BOS | 9 | 0 | 24 | 0 |
# Retrosheet is completely consistent between cwdaily and cwgame for fielding
f.equals(tg)
True
exclude = ['lg_id', 'team_id', 'year', 'g', 'dp', 'player_id']
key = ['team_id', 'year']
cols = set(lahman_fielding.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)
1
cols
['e']
f = lahman_fielding[key + cols].groupby(key).agg('sum')
f = f.sort_index().reset_index()
f.head(4)
team_id | year | e | |
---|---|---|---|
0 | ANA | 1997 | 123 |
1 | ANA | 1998 | 106 |
2 | ANA | 1999 | 106 |
3 | ANA | 2000 | 134 |
t = lahman_teams[key+cols]
t = t.sort_values(key).reset_index(drop=True)
t.head(4)
team_id | year | e | |
---|---|---|---|
0 | ANA | 1997 | 123 |
1 | ANA | 1998 | 106 |
2 | ANA | 1999 | 106 |
3 | ANA | 2000 | 134 |
# ensure the dtypes are the same for errors (e)
f[cols] = f[cols].astype('int')
t[cols] = t[cols].astype('int')
(f[cols] - t[cols]).max()
e 2 dtype: int64
m = pd.merge(f, t,
left_on=['team_id', 'year'],
right_on=['team_id', 'year'],
suffixes=['_fielding', '_teams'])
m.query("e_fielding != e_teams")
team_id | year | e_fielding | e_teams | |
---|---|---|---|---|
653 | MIN | 1977 | 144 | 143 |
1014 | SDN | 2007 | 94 | 92 |
When comparing large numbers, it is best to examine their relative differences.
When comparing small numbers, it is best to look at their absolute differences
((f[cols] - t[cols]).max() <= 2).all()
True
cols = ['wp', 'bk', 'er']
cols
['wp', 'bk', 'er']
# retro_pitching was wrangled from the output of cwdaily
p = retro_pitching[cols].agg('sum')
p
wp 64799 bk 9919 er 841197 dtype: int64
# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[cols].agg('sum')
tg
wp 64799 bk 9919 er 841197 dtype: int64
# Retrosheet is completely consistent between cwdaily and cwgame for fielding
p.equals(tg)
True
Most of the common column names between pitching and team pitching are for the batting stats, not the pitching stats.
Notes:
exclude = ['lg_id', 'team_id', 'year', 'g', 'era',
'bb', 'h', 'hbp', 'hr', 'r', 'sf', 'so', 'sho', 'er']
key = ['team_id', 'year']
cols = set(lahman_pitching.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)
5
cols
['cg', 'l', 'sv', 'w', 'ip_outs']
p = lahman_pitching[key + cols].groupby(key).agg('sum').reset_index()
p.head(4)
team_id | year | cg | l | sv | w | ip_outs | |
---|---|---|---|---|---|---|---|
0 | ANA | 1997 | 9 | 78 | 39 | 84 | 4364 |
1 | ANA | 1998 | 3 | 77 | 52 | 85 | 4332 |
2 | ANA | 1999 | 4 | 92 | 37 | 70 | 4294 |
3 | ANA | 2000 | 5 | 80 | 46 | 82 | 4344 |
t = lahman_teams[key+cols].sort_values(key).reset_index(drop=True)
t.head(4)
team_id | year | cg | l | sv | w | ip_outs | |
---|---|---|---|---|---|---|---|
0 | ANA | 1997 | 9 | 78 | 39 | 84 | 4364 |
1 | ANA | 1998 | 3 | 77 | 52 | 85 | 4332 |
2 | ANA | 1999 | 4 | 92 | 37 | 70 | 4294 |
3 | ANA | 2000 | 5 | 80 | 46 | 82 | 4344 |
# dtypes need to be the same
for col in p.columns:
if not col == 'year' and not col == 'team_id':
p[col] = p[col].astype('int')
t[col] = t[col].astype('int')
np.abs(p[cols] - t[cols]).max()
cg 0 l 1 sv 0 w 1 ip_outs 0 dtype: int64
m = pd.merge(p, t, on = ['team_id','year'], suffixes=['_p', '_t'])
s_cols = m.columns.sort_values()
m = m[s_cols]
# display the differences
for col in cols:
cp = col+'_p'
ct = col+'_t'
query = cp + ' != '+ ct
tmp = m.query(query)[['team_id', 'year'] +[cp, ct]]
if len(tmp):
display(tmp)
team_id | year | l_p | l_t | |
---|---|---|---|---|
196 | CHA | 1979 | 86 | 87 |
329 | CLE | 1974 | 84 | 85 |
team_id | year | w_p | w_t | |
---|---|---|---|---|
407 | DET | 1979 | 84 | 85 |
1184 | TEX | 1974 | 83 | 84 |
Twice, the team value for wins and losses was 1 greater than the sum of that team's pitchers wins and losses.
Unlike shut outs and earned runs which are intentionally counted differently between individual and teams, wins and losses should be counted the same. The above shows a minor data discrepancy.
It may seem like a lot of work to cross check the data in so many different ways, but doing so provides accuracy bounds which are important to the subsequent data analysis.