Checking Retrosheet and Lahman Data Consistency¶

An incredible amount of work has gone into making the open-source Retrosheet and Lahman data sets as accurate as possible, but they are not perfectly accurate.

Prior to performing any data analysis, it is necessary to check the accuracy of the data. The limitations of the data will define the limitations of the analysis.

Performing data consistency tests is also a great way to validate the data processing pipeline and to learn about the data.

The pytest test suite supplied in this repo automatically runs a superset of the tests performed in this notebook. The purpose of this notebook is to explain the key data consistency tests in more detail to better understand the data.

A brief summary of the wrangled data is provided below at: MLB Data Summary

Data Consistency Tests¶

Prior to performing the data consistency tests, it is necessary to verify that the fields which should uniquely identify a record, actually do. This is performed in the section called Primary and Foreign Key tests.

Three types of data consistency tests will be performed.

1. Retrosheet Stats vs Lahman Stats

aggregate Retrosheet individual batting/pitching/fielding stats per game to season level
compare with Lahman's individual batting/pitching/fielding season stats

2. Individual Stats vs Team Stats

Retrosheet

aggregate Retrosheet individual batting/pitching/fielding stats per game to team level
compare with Retrosheet team stats per game

Notes:

Retrosheet individual batting/pitching/fielding stats per game were wrangled from the results of using the cwdaily parser on the Retrosheet play-by-play data. These correspond to the batting/pitching/fielding csv files in the Retrosheet wrangled directory.
Retrosheet team batting/pitching/fielding stats per game were wrangled from the results of using the cwgame parser on the Retrosheet play-by-play data. These correspond to the team_game csv file in the Retrosheet wrangled directory.

Lahman

aggregate individual batting/pitching/fielding stats per season to team level
compare with team stats per season

3. Batting Stats vs Pitching Allowed Stats

Retrosheet

aggregate individual batting stats per game and pitching stats per game to ensure that the pitchers are those that were faced by the batters
compare batting stats with pitcher allowed stats
- for example, the number of home runs hit by batters should equal the number of home runs allowed by pitchers

Lahman

aggregate individual batting stats per season and pitching stats per season to ensure that the pitchers are those that were faced by the batters
compare batting stats with pitcher allowed stats

Data Consistency Tolerances¶

All data between 1974 and 2019 inclusive will be used. Retrosheet has all play-by-play data over this time period.

The data consistency tests use several different hard-coded tolerance values:

relative tolerances range between 0% and 1%
absolute tolerances range between 0 and 2

It is reasonable to expect that:

data accuracy for new data, such as the upcoming 2020 season, will be similar to the data accuracy between 1974 and 2019
with these tight data tolerances, the data wrangling scripts must have worked correctly

Data Tests¶

Primary and Foreign Key Tests

Retrosheet vs Lahman
Retrosheet Batting vs Lahman Batting Players
Retrosheet Batting vs Lahman Batting Stats
Retrosheet Pitching vs Lahman Pitching Players
Retrosheet Pitching vs Lahman Pitching Stats
Retrosheet Fielding vs Lahman Fielding Players
Retrosheet Fielding vs Lahman Fielding Stats

Batting vs Pitching Allowed
Retrosheet Pitching Allowed vs Retrosheet Hitting
Lahman Pitching Allowed vs Lahman Hitting

Individual vs Team
Retrosheet Batting vs Retrosheet Team Batting
Lahman Batting vs Lahman Team Batting
Retrosheet Fielding vs Retrosheet Team Fielding
Lahman Fielding vs Lahman Team Fielding
Retrosheet Pitching vs Retrosheet Team Pitching
Lahman Pitching vs Lahman Team Pitching

Preliminaries¶

This notebook assumes that the Lahman and Retrosheet data sets have been downloaded and wrangled using the scripts in the ../download_scripts directory of this repo.

For this notebook, Retrosheet data from 1974 through 2019 inclusive is used.

The ../download_scripts/data_helper.py function: from_csv_with_types() uses pd.read_csv() with dtypes set to the type data read in from: <filename>_types.csv. This allows Pandas to use the previously optimized data types which require about 1/3rd as much memory.

MLB Data Summary¶

Main csv files.

Lahman

Stats per Player per Year:
- batting.csv
- pitching.csv
- fielding.csv
Stats per Team per Year:
- teams.csv -- contains team_id for both Lahman and Retrosheet
Other
- people.csv -- contains player_id for Lahman, Retrosheet and Baseball-Reference

Retrosheet

Stats per Player per Game:
- batting.csv.gz
- pitching.csv.gz
- fielding.csv.gz
Stats per Team per Game:
- team_game.csv.gz
Stats per Game:
- game.csv.gz

Imports and Setup¶

In [1]:

import os
import pandas as pd
import numpy as np
from pathlib import Path
import re
from scipy.stats import linregress

In [2]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [3]:

import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100 # increase dpi, will make figures larger and clearer

In [4]:

import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [5]:

data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

In [6]:

pd.set_option("display.max_columns", 50)

Load the Data¶

Loading all the data up front makes the code clearer, but uses more memory.

As optimized Pandas data types were persisted when the data was wrangled, the total memory usage is about 3 times less than if Pandas inferred the data types with pd.read_csv().

A minor drawback to using the smallest integer type that will hold the data, is that some Pandas methods do not compare different integer types properly without first prompting them to the same larger integer type. Aggregations work properly for all Pandas data types.

Notes:

every Retrosheet player who appears in a game is in retro_batting, even if they had no plate appearances
every Retrosheet team appears in retro_team_game
the Lahman stint value is incremented each time a player is traded in the same year

In [7]:

lahman_people = dh.from_csv_with_types(lahman_data / 'people.csv')
lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')
lahman_batting = dh.from_csv_with_types(lahman_data / 'batting.csv')
lahman_pitching = dh.from_csv_with_types(lahman_data / 'pitching.csv')
lahman_fielding = dh.from_csv_with_types(lahman_data / 'fielding.csv')

In [8]:

# restrict Lahman data to be between 1974 and 2019 inclusive
lahman_batting = lahman_batting.query('1974 <= year <= 2019')
lahman_pitching = lahman_pitching.query('1974 <= year <= 2019')
lahman_fielding = lahman_fielding.query('1974 <= year <= 2019')
lahman_teams = lahman_teams.query('1974 <= year <= 2019')

In [9]:

retro_batting = dh.from_csv_with_types(retrosheet_data / 'batting.csv.gz')
retro_pitching = dh.from_csv_with_types(retrosheet_data / 'pitching.csv.gz')
retro_fielding = dh.from_csv_with_types(retrosheet_data / 'fielding.csv.gz')
retro_team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz')
retro_game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz')

In [10]:

# restrict Retrosheet data to be between 1974 and 2019 inclusive
retro_batting = retro_batting.query('1974 <= year <= 2019')
retro_pitching = retro_pitching.query('1974 <= year <= 2019')
retro_fielding = retro_fielding.query('1974 <= year <= 2019')
retro_team_game = retro_team_game.query('1974 <= year <= 2019')
retro_game = retro_game.query('1974 <= game_start.dt.year <= 2019')

In [11]:

# verify these years exist in the downloaded data
(retro_batting['year'].agg(['min', 'max']) == (1974, 2019)).all()

Out[11]:

True

In [12]:

# verify all the years are in the downloaded data
retro_batting['year'].nunique() == (2019 - 1974) + 1

Out[12]:

True

Primary and Foreign Key Tests¶

Perform these tests before checking for data consistency.

In the optional ../download_scripts/postgres_load_data.py script, primary and foreign key constraints are used to enforce these data checks.

Lahman People and Lahman/Retrosheet Player IDs¶

The Lahman people.csv file contains a mapping between Lahman player_id and Retrosheet player_id.

Ensure that this mapping is "one to one" and "onto", which just means:

for every Lahman player_id there is exactly one Retrosheet player_id
for every Retrosheet player_id there is exactly one Lahman player_id

Although the code that follows allows for a missing mapping, there are no missing mappings for players between 1974 and 2019 (as of the December 2019 Lahman data update).

In [13]:

# verify Lahman player_id is unique
dh.is_unique(lahman_people, ['player_id'])

Out[13]:

True

In [14]:

# verify Lahman's retrosheet player_id is unqiue or missing
dh.is_unique(lahman_people, ['retro_id'], ignore_null=True)

Out[14]:

True

In [15]:

# get the unique player ids from Retrosheet
retro_players = pd.Series(retro_batting['player_id'].unique(), name='player_id')
print(f'There are {len(retro_players)} unique players in Retrosheet')

There are 9383 unique players in Retrosheet

In [16]:

# use an inner join to verify that the Lahman player_id to Retrosheet player_id
# mapping is one-to-one and onto
mapping = lahman_people[['player_id', 'retro_id']].merge(retro_players, how='inner', 
                    left_on=['retro_id'], right_on=['player_id'],
                    suffixes=('_lahman', '_retro'))
print(f'There are {len(mapping)} player_ids that match between Lahman and Retrosheet')

There are 9383 player_ids that match between Lahman and Retrosheet

In [17]:

len(retro_players) == len(mapping)

Out[17]:

True

In [18]:

# Given the above, the following must be True
r_players = set(retro_players)
l_players = set(lahman_people['retro_id'])
r_players.issubset(l_players)

Out[18]:

True

Lahman Teams and Lahman/Retrosheet Team IDs¶

Same analysis as above, but with (team_id, year) instead of player_id.

In [19]:

# verify Lahman (team_id, year) is unique
dh.is_unique(lahman_teams, ['team_id', 'year'])

Out[19]:

True

In [20]:

# verify Lahman's retrosheet (team__id, year) is unqiue
dh.is_unique(lahman_teams, ['team_id_retro', 'year'])

Out[20]:

True

In [21]:

# alternatively Panda's drop_duplicates() could have been used
retro_team_ids = set(zip(retro_team_game['team_id'], 
                         retro_team_game['year']))
retro_team_ids_df = pd.DataFrame(retro_team_ids, columns=['team_id', 'year'])
print(f'There are {len(retro_team_ids_df)} unique (team_id, year) in Retrosheet')

There are 1288 unique (team_id, year) in Retrosheet

In [22]:

# use an inner join to verify that the mapping is one-to-one and onto
mapping = lahman_teams.merge(retro_team_ids_df, how='inner', 
                             left_on=['team_id_retro', 'year'], 
                             right_on=['team_id','year'])
print(f'There are {len(mapping)} (team_id, year) that match between Lahman and Retrosheet')

There are 1288 (team_id, year) that match between Lahman and Retrosheet

In [23]:

len(retro_team_ids) == len(mapping)

Out[23]:

True

In [24]:

# Given the above, the following must be True
lahman_team_ids = set(zip(lahman_teams['team_id_retro'], lahman_teams['year']))
retro_team_ids.issubset(lahman_team_ids)

Out[24]:

True

Lahman Batting/Pitching/Fielding¶

In [25]:

dh.is_unique(lahman_batting, ['player_id', 'year', 'stint'])

Out[25]:

True

In [26]:

dh.is_unique(lahman_pitching, ['player_id', 'year', 'stint'])

Out[26]:

True

In [27]:

dh.is_unique(lahman_fielding, ['player_id', 'year', 'stint', 'pos'])

Out[27]:

True

Retrosheet Batting/Pitching/Fielding/Team_Game/Game¶

In [28]:

# Note: 1 duplicate key was cleaned during wrangling for data between 1948 and 2019
# Additional entries were cleaned for earlier years
# The two entries for Chris Young (younc004) were summed into a single entry

# Manually spot check with the box score for this game at Baseball Reference
# to verify that summing all stats (other than g) was the correct approach
dh.game_id_to_url('BOS201708250')

BOS201708250

In [29]:

dh.is_unique(retro_batting, ['player_id', 'game_id'])

Out[29]:

True

In [30]:

dh.is_unique(retro_pitching, ['player_id', 'game_id'])

Out[30]:

True

In [31]:

dh.is_unique(retro_fielding, ['player_id', 'game_id', 'pos'])

Out[31]:

True

In [32]:

dh.is_unique(retro_team_game, ['team_id', 'game_id'])

Out[32]:

True

In [33]:

dh.is_unique(retro_game, ['game_id'])

Out[33]:

True

Data Consistency Tests¶

The download scripts ensured that columns with the same meaning, were given the same name. This makes it easy to compare data between CSV files using their column names.

Retrosheet Batting vs Lahman Batting Players¶

In [34]:

# verify that Lahman and Retrosheet have stats on the same set of batters
lahman_batters = pd.merge(lahman_batting['player_id'], lahman_people[['player_id', 'retro_id']])
l_batters = set(lahman_batters['retro_id'].unique())
r_batters = set(retro_batting['player_id'].unique())
r_batters == l_batters

Out[34]:

True

Retrosheet Batting vs Lahman Batting Stats¶

In [35]:

# batting columns to compare
cols = set(retro_batting.columns) & set(lahman_batting.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)

Out[35]:

In [36]:

cols

Out[36]:

{'ab',
 'bb',
 'cs',
 'double',
 'g',
 'gidp',
 'h',
 'hbp',
 'hr',
 'ibb',
 'r',
 'rbi',
 'sb',
 'sf',
 'sh',
 'so',
 'triple'}

In [37]:

# aggregate the stats in common for all players for all years (1974 thru 2019)
l = lahman_batting[cols]
r = retro_batting[cols]

l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)

In [38]:

# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))

Out[38]:

ab        1.426074e-07
bb        1.487730e-06
cs        3.295273e-05
double    2.922456e-06
g         1.397775e-06
gidp      8.906306e-05
h         0.000000e+00
hbp       0.000000e+00
hr        0.000000e+00
ibb       1.776294e-05
r         0.000000e+00
rbi       5.724236e-06
sb        2.964764e-05
sf        0.000000e+00
sh        1.411652e-05
so        0.000000e+00
triple    0.000000e+00
dtype: float64

In [39]:

# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')

0.000089

In [40]:

# all 17 batting attributes from 1974-2019 between Lahman and Retrosheet
# are within plus/minus 0.01% of each other when summed
(np.abs(1.0 - (l_sums / r_sums)) < .0001).all()

Out[40]:

True

Retrosheet Pitching vs Lahman Pitching Players¶

In [41]:

# verify that Lahman and Retrosheet have stats on exactly the same set of pitchers
lahman_pitchers = pd.merge(lahman_pitching['player_id'], lahman_people[['player_id', 'retro_id']])
l_pitchers = set(lahman_pitchers['retro_id'].unique())
r_pitchers = set(retro_pitching['player_id'].unique())
r_pitchers == l_pitchers

Out[41]:

True

Retrosheet Pitching vs Lahman Pitching Stats¶

In [42]:

# pitching columns to compare
cols = set(retro_pitching.columns) & set(lahman_pitching.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)

Out[42]:

In [43]:

cols

Out[43]:

{'bb',
 'bk',
 'cg',
 'er',
 'g',
 'gf',
 'gidp',
 'gs',
 'h',
 'hbp',
 'hr',
 'ibb',
 'l',
 'r',
 'sf',
 'sh',
 'sho',
 'so',
 'sv',
 'w',
 'wp'}

In [44]:

# aggregate the stats in common for all players for all years (1974 thru 2019)
l = lahman_pitching[cols]
r = retro_pitching[cols]

l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)

In [45]:

# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))

Out[45]:

bb      0.000001
bk      0.000403
cg      0.000327
er      0.000002
g       0.000105
gf      0.000032
gidp    0.000006
gs      0.000000
h       0.000000
hbp     0.000000
hr      0.000005
ibb     0.000018
l       0.000000
r       0.000000
sf      0.000000
sh      0.000014
sho     0.000598
so      0.000000
sv      0.000081
w       0.000000
wp      0.000355
dtype: float64

In [46]:

# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')

0.000598

In [47]:

# verify all values between 1974 and 2019 are within plus/minus 0.06% of each other
(np.abs(1.0 - (l_sums / r_sums)) < .0006).all()

Out[47]:

True

Retrosheet Fielding vs Lahman Fielding Players¶

In [48]:

# verify that Lahman and Retrosheet have stats on exactly the same set of fielders
lahman_fielders = pd.merge(lahman_fielding['player_id'], lahman_people[['player_id', 'retro_id']])
l_fielders = set(lahman_fielders['retro_id'].unique())
r_fielders = set(retro_fielding['player_id'].unique())
r_fielders == l_fielders

Out[48]:

False

In [49]:

# Lahman has no fielders that are not in Retrosheet
l_fielders - r_fielders

Out[49]:

set()

In [50]:

# Retrosheet has one fielder that is not in Lahman
r_fielders - l_fielders

Out[50]:

{'olivt102'}

In [51]:

# missing Lahmann fielder
missing_fielder = f'{(r_fielders - l_fielders).pop()}'

In [52]:

missing = retro_fielding.query(f'player_id == "{missing_fielder}"')
missing

Out[52]:

	game_id	player_id	pos	team_id	gs	game_start	year
791764	BOS197604190	olivt102	2B	MIN	1	1976-04-19 11:05:00	1976
791786	BOS197604200	olivt102	2B	MIN	1	1976-04-20 15:03:00	1976
816021	NYA197604180	olivt102	2B	MIN	1	1976-04-18 14:07:00	1976
831226	TEX197604110	olivt102	2B	MIN	1	1976-04-11 00:00:00	1976

In [53]:

# Tony had no fielding Total Chances (tc)
missing['tc'].sum()

Out[53]:

In [54]:

# Tony was on the field for 0 outs (inn_outs)
missing['inn_outs'].sum()

Out[54]:

Tony Oliva (olivt102) started 4 games as a second baseman.
He had zero total chances and no outs were recorded during the time he played second base.

This sounds like a pinch hitter or a designated hitter, rather than a fielder. Let's check the box scores and player information on Baseball-Reference.

In [55]:

# right click on each generated link to see the box score on Baseball Reference
for game_id in missing['game_id']:
    dh.game_id_to_url(game_id)

In [56]:

# right click on generated link to see Tony Oliva on Baseball Reference
bb_player_id = lahman_people.query('retro_id == "olivt102"')['bb_ref_id'].values[0]
dh.player_id_to_url(bb_player_id)

olivato01

Tony Oliva lead off in each of these four away games and was immediately replaced, whether he got on base or not. He was only in the game to hit once. He might have been unable to run or field well for these games, but he could still hit, so he was in the lineup.

Tony was listed as the starting second baseman, even though he was a career right fielder. Tony was not going to play second base, the guy who replaced him, Jerry Terrell, was.

The cwdaily parser created a fielding record for Tony as he was on the lineup card as the starting second baseman.

It would be hard to argue that either the Lahman or Retrosheet data is wrong in this scenario. The only difference is that Retrosheet show 4 starts as a second baseman for Tony Oliva that Lahman does not.

The fielding players are effectively the same in Lahman and Retrosheet (1974 through 2019).

Retrosheet Fielding vs Lahman Fielding Stats¶

In [57]:

# fielding columns to compare
cols = set(retro_fielding.columns) & set(lahman_fielding.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)

Out[57]:

In [58]:

cols

Out[58]:

{'a', 'dp', 'e', 'g', 'gs', 'inn_outs', 'po', 'pos'}

In [59]:

# aggregate the stats in common per position over all players over all years (1974 thru 2019)
l = lahman_fielding[cols].copy()
r = retro_fielding[cols]

# work-around for Pandas 1.0.1 bugs
# 1) sum does not up-cast for nullable integer types
# 2) select_dtypes does not distinguish between nullable and non-nullable integer types
idx = l.dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in l.columns[idx]:
    l[col] = l[col].astype('Int32')

l_sums = l.groupby('pos').agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.groupby('pos').agg('sum')
r_sums.sort_index(inplace=True)

In [60]:

# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))

Out[60]:

	inn_outs	dp	g	a	e	gs	po
pos
1B	2.470157e-05	0.000737	0.000008	0.006393	0.000833	0.000029	0.000482
2B	2.179551e-06	0.000007	0.000017	0.001008	0.000309	0.000063	0.002122
3B	9.807978e-06	0.000908	0.000177	0.001483	0.000109	0.000019	0.001850
C	5.448877e-07	0.000982	0.000059	0.007810	0.000683	0.000005	0.000668
CF	NaN	NaN	NaN	NaN	NaN	NaN	NaN
LF	NaN	NaN	NaN	NaN	NaN	NaN	NaN
OF	NaN	NaN	NaN	NaN	NaN	NaN	NaN
P	4.722360e-06	0.002302	0.000105	0.001890	0.001415	0.000000	0.005532
RF	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SS	8.354944e-06	0.000730	0.000004	0.001009	0.000102	0.000068	0.001301

In [61]:

# Lahman uses OF for sum of LF, CF, RF -- account for this
r_sums.loc['OF'] = r_sums.loc['LF'] + r_sums.loc['CF'] + r_sums.loc['RF']
r_sums = r_sums.drop(['LF', 'CF', 'RF'])
r_sums.sort_index(inplace=True)
r_sums

Out[61]:

	inn_outs	dp	g	a	e	gs	po
pos
1B	5505722.0	171005.0	237097.0	138119.0	14409.0	205280.0	1798401.0
2B	5505722.0	135428.0	235233.0	599404.0	19443.0	205280.0	422753.0
3B	5505722.0	37465.0	237078.0	404527.0	27552.0	205280.0	149770.0
C	5505722.0	13236.0	236247.0	106279.0	14642.0	205280.0	1349560.0
OF	16517166.0	8256.0	735430.0	38885.0	24381.0	615840.0	1352944.0
P	5505722.0	17807.0	695259.0	243922.0	16956.0	205280.0	112264.0
SS	5505722.0	127447.0	232003.0	610365.0	29333.0	205280.0	319687.0

In [62]:

# The above will overcount games, as Lahman will have a player in the game once as an OF, 
# whereas Retrosheet may have him in the game as both a LF and CF, for example.
r_of = retro_fielding.query('pos in ["LF", "CF", "RF"]')

total_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep=False).sum()
counted_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep='first').sum()

r_sums.loc['OF', 'g'] -= (total_dups - counted_dups)
r_sums

Out[62]:

	inn_outs	dp	g	a	e	gs	po
pos
1B	5505722.0	171005.0	237097.0	138119.0	14409.0	205280.0	1798401.0
2B	5505722.0	135428.0	235233.0	599404.0	19443.0	205280.0	422753.0
3B	5505722.0	37465.0	237078.0	404527.0	27552.0	205280.0	149770.0
C	5505722.0	13236.0	236247.0	106279.0	14642.0	205280.0	1349560.0
OF	16517166.0	8256.0	711605.0	38885.0	24381.0	615840.0	1352944.0
P	5505722.0	17807.0	695259.0	243922.0	16956.0	205280.0	112264.0
SS	5505722.0	127447.0	232003.0	610365.0	29333.0	205280.0	319687.0

In [63]:

# compute relative differences
np.abs(1.0 - l_sums / r_sums)

Out[63]:

	inn_outs	dp	g	a	e	gs	po
pos
1B	2.470157e-05	0.000737	0.000008	0.006393	0.000833	0.000029	0.000482
2B	2.179551e-06	0.000007	0.000017	0.001008	0.000309	0.000063	0.002122
3B	9.807978e-06	0.000908	0.000177	0.001483	0.000109	0.000019	0.001850
C	5.448877e-07	0.000982	0.000059	0.007810	0.000683	0.000005	0.000668
OF	6.599195e-06	0.001453	0.000311	0.002546	0.000656	0.000013	0.000072
P	4.722360e-06	0.002302	0.000105	0.001890	0.001415	0.000000	0.005532
SS	8.354944e-06	0.000730	0.000004	0.001009	0.000102	0.000068	0.001301

In [64]:

# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max().max():8.6f}')

0.007810

In [65]:

# verify all values between 1974 and 2019 are within plus/minus 0.8% of each other
np.abs(1.0 - (l_sums / r_sums)).max().max() < 0.008

Out[65]:

True

Retrosheet Pitching Allowed vs Retrosheet Hitting¶

In [66]:

exclude = ['game_id', 'team_id', 'player_id', 'g', 'game_start', 'year']
cols = set(retro_pitching.columns) & set(retro_batting.columns) - set(exclude)
cols = list(cols)
cols

Out[66]:

['gidp',
 'triple',
 'sf',
 'hr',
 'h',
 'so',
 'ab',
 'hr4',
 'hbp',
 'double',
 'r',
 'ibb',
 'sh',
 'xi',
 'bb',
 'tb']

In [67]:

# sum over all pitchers over all years
p = retro_pitching[cols].agg('sum')
p

Out[67]:

gidp       157192
triple      41532
sf          58275
hr         195985
h         1826823
so        1301700
ab        7012260
hr4          4710
hbp         58097
double     342178
r          923811
ibb         56297
sh          70839
xi           1060
bb         672165
tb        2840020
dtype: int64

In [68]:

# sum over all batters over all years
b = retro_batting[cols].agg('sum')
b

Out[68]:

gidp       157192
triple      41532
sf          58275
hr         195985
h         1826823
so        1301700
ab        7012260
hr4          4710
hbp         58097
double     342178
r          923811
ibb         56297
sh          70839
xi           1060
bb         672165
tb        2840020
dtype: int64

In [69]:

# Retrosheet is completely consistent
p.equals(b)

Out[69]:

True

Lahman Pitching Allowed vs Lahman Hitting¶

In [70]:

exclude = ['lg_id', 'player_id', 'stint', 'team_id', 'year', 'g']
pkey = ['player_id', 'stint', 'year']
cols = set(lahman_pitching.columns) & set(lahman_batting.columns)
cols -= set(exclude)
cols

Out[70]:

{'bb', 'gidp', 'h', 'hbp', 'hr', 'ibb', 'r', 'sf', 'sh', 'so'}

In [71]:

# sum over all pitchers over all years (1974 thru 2019)
p = lahman_pitching[cols].agg('sum')
p

Out[71]:

gidp     157193.0
so      1301700.0
sf        58275.0
h       1826823.0
hr       195986.0
hbp       58097.0
r        923811.0
ibb       56296.0
sh        70838.0
bb       672166.0
dtype: float64

In [72]:

# sum over all batters over all years (1974 thru 2019)
b = lahman_batting[cols].agg('sum')
b

Out[72]:

gidp     157178.0
so      1301700.0
sf        58275.0
h       1826823.0
hr       195985.0
hbp       58097.0
r        923811.0
ibb       56296.0
sh        70838.0
bb       672166.0
dtype: float64

In [73]:

# compute the relative differences
np.abs(1.0 - p/b)

Out[73]:

gidp    0.000095
so      0.000000
sf      0.000000
h       0.000000
hr      0.000005
hbp     0.000000
r       0.000000
ibb     0.000000
sh      0.000000
bb      0.000000
dtype: float64

In [74]:

# find the max difference
np.abs(1.0 - p/b).max()

Out[74]:

9.543320311999892e-05

In [75]:

# the biggest difference is less than 0.01%
np.abs(1.0 - p/b).max() < 0.0001

Out[75]:

True

Retrosheet Batting vs Retrosheet Team Batting¶

In [76]:

exclude = ['game_id', 'team_id', 'player_id', 'game_start', 'year']
cols = set(retro_batting.columns) & set(retro_team_game.columns) - set(exclude)
cols = list(cols)
len(cols)

Out[76]:

In [77]:

cols

Out[77]:

['gidp',
 'triple',
 'sf',
 'hr',
 'h',
 'cs',
 'ab',
 'rbi',
 'hbp',
 'so',
 'double',
 'r',
 'ibb',
 'sh',
 'sb',
 'xi',
 'bb']

In [78]:

# retro_batting was wrangled from the output of cwdaily
b = retro_batting[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
b = b.reset_index()
b.head(4)

Out[78]:

	game_id	team_id	gidp	sf	hr	h	ab	rbi	hbp	so	double	r	ibb	sb	bb
0	ANA199704020	ANA	0	1	1	12	38	5	0	7	0	5	0	2	5
1	ANA199704020	BOS	0	0	1	9	36	6	1	12	2	6	0	0	8
2	ANA199704030	ANA	2	1	0	7	27	2	0	1	2	2	1	0	4
3	ANA199704030	BOS	3	0	0	5	29	0	0	5	0	0	0	0	0

In [79]:

# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)

Out[79]:

	game_id	team_id	gidp	sf	hr	h	ab	rbi	hbp	so	double	r	ibb	sb	bb
0	ANA199704020	ANA	0	1	1	12	38	5	0	7	0	5	0	2	5
1	ANA199704020	BOS	0	0	1	9	36	6	1	12	2	6	0	0	8
2	ANA199704030	ANA	2	1	0	7	27	2	0	1	2	2	1	0	4
3	ANA199704030	BOS	3	0	0	5	29	0	0	5	0	0	0	0	0

In [80]:

# Retrosheet is completely consistent between cwdaily and cwgame for batting
b.equals(tg)

Out[80]:

True

Lahman Batting vs Lahman Team Batting¶

In [81]:

exclude = ['lg_id', 'team_id', 'year', 'g']
key = ['team_id', 'year']
cols = set(lahman_batting.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)

Out[81]:

In [82]:

cols

Out[82]:

['triple', 'sf', 'hr', 'h', 'cs', 'ab', 'so', 'hbp', 'double', 'r', 'sb', 'bb']

In [83]:

# work-around for Pandas 1.0.1 bugs
# 1) sum does not up-cast for nullable integer types
# 2) select_dtypes does not distinguish between nullable and non-nullable int types
idx = lahman_batting[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in lahman_batting[cols].columns[idx]:
    lahman_batting[col] = lahman_batting[col].astype('Int32')

idx = lahman_teams[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])
for col in lahman_teams[cols].columns[idx]:
    lahman_teams[col] = lahman_teams[col].astype('Int32')

In [84]:

b = lahman_batting[key + cols].groupby(key).agg('sum')
b = b.reset_index()
b.head(4)

Out[84]:

	team_id	year	triple	sf	hr	h	cs	ab	so	hbp	double	r	sb	bb
0	ANA	1997	25.0	57	161.0	1531	72	5628	953	45	279.0	829.0	126	617.0
1	ANA	1998	27.0	41	147.0	1530	45	5630	1028	48	314.0	787.0	93	510.0
2	ANA	1999	22.0	42	158.0	1404	45	5494	1022	43	248.0	711.0	71	511.0
3	ANA	2000	34.0	43	236.0	1574	52	5628	1024	47	309.0	864.0	93	608.0

In [85]:

t = lahman_teams[key + cols].sort_values(key).reset_index(drop=True)
t.head(4)

Out[85]:

	team_id	year	triple	sf	hr	h	cs	ab	so	hbp	double	r	sb	bb
0	ANA	1997	25	57	161	1531	72	5628	953	45	279	829	126	617
1	ANA	1998	27	41	147	1530	45	5630	1028	48	314	787	93	510
2	ANA	1999	22	42	158	1404	45	5494	1022	43	248	711	71	511
3	ANA	2000	34	43	236	1574	52	5628	1024	47	309	864	93	608

In [86]:

# ensure the dtypes are the same
for col in t.columns:
    if not col == 'team_id' and not col =='year':
        b[col] = b[col].astype('int')
        t[col] = t[col].astype('int')

In [87]:

b[cols].equals(t[cols])

Out[87]:

True

Retrosheet Fielding vs Retrosheet Team Fielding¶

Double plays (dp) are not compared because each fielder involved gets credit for a double play whereas the team as a whole only gets credit for the complete double play.

In [88]:

cols = ['a', 'e', 'po', 'pb']
cols

Out[88]:

['a', 'e', 'po', 'pb']

In [89]:

# retro_fielding was wrangled from the output of cwdaily
f = retro_fielding[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
f = f.reset_index()
f.head(4)

Out[89]:

	game_id	team_id	a	e	po
0	ANA199704020	ANA	6	1	27
1	ANA199704020	BOS	8	1	27
2	ANA199704030	ANA	13	0	27
3	ANA199704030	BOS	9	0	24

In [90]:

# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(
    ['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)

Out[90]:

	game_id	team_id	a	e	po
0	ANA199704020	ANA	6	1	27
1	ANA199704020	BOS	8	1	27
2	ANA199704030	ANA	13	0	27
3	ANA199704030	BOS	9	0	24

In [91]:

# Retrosheet is completely consistent between cwdaily and cwgame for fielding
f.equals(tg)

Out[91]:

True

Lahman Fielding vs Lahman Team Fielding¶

In [92]:

exclude = ['lg_id', 'team_id', 'year', 'g', 'dp', 'player_id']
key = ['team_id', 'year']
cols = set(lahman_fielding.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)

Out[92]:

In [93]:

cols

Out[93]:

['e']

In [94]:

f = lahman_fielding[key + cols].groupby(key).agg('sum')
f = f.sort_index().reset_index()
f.head(4)

Out[94]:

	team_id	year	e
0	ANA	1997	123
1	ANA	1998	106
2	ANA	1999	106
3	ANA	2000	134

In [95]:

t = lahman_teams[key+cols]
t = t.sort_values(key).reset_index(drop=True)
t.head(4)

Out[95]:

	team_id	year	e
0	ANA	1997	123
1	ANA	1998	106
2	ANA	1999	106
3	ANA	2000	134

In [96]:

# ensure the dtypes are the same for errors (e)
f[cols] = f[cols].astype('int')
t[cols] = t[cols].astype('int')

In [97]:

(f[cols] - t[cols]).max()

Out[97]:

e    2
dtype: int64

In [98]:

m = pd.merge(f, t, 
             left_on=['team_id', 'year'],
             right_on=['team_id', 'year'],
             suffixes=['_fielding', '_teams'])

In [99]:

m.query("e_fielding != e_teams")

Out[99]:

	team_id	year	e_fielding	e_teams
653	MIN	1977	144	143
1014	SDN	2007	94	92

When comparing large numbers, it is best to examine their relative differences.
When comparing small numbers, it is best to look at their absolute differences

In [100]:

((f[cols]  - t[cols]).max() <= 2).all()

Out[100]:

True

Retrosheet Pitching vs Retrosheet Team Pitching¶

In [101]:

cols = ['wp', 'bk', 'er']
cols

Out[101]:

['wp', 'bk', 'er']

In [102]:

# retro_pitching was wrangled from the output of cwdaily
p = retro_pitching[cols].agg('sum')
p

Out[102]:

wp     64799
bk      9919
er    841197
dtype: int64

In [103]:

# retro_team_game was wrangled from the output of cwgame
tg = retro_team_game[cols].agg('sum')
tg

Out[103]:

wp     64799
bk      9919
er    841197
dtype: int64

In [104]:

# Retrosheet is completely consistent between cwdaily and cwgame for fielding
p.equals(tg)

Out[104]:

True

Lahman Pitching vs Lahman Team Pitching¶

Most of the common column names between pitching and team pitching are for the batting stats, not the pitching stats.

Notes:

era cannot be summed
shut outs (sho) per team are by MLB rules recorded differently that shut outs per pitcher
- if 2+ pitchers combine to pitch a shut out, the team gets credit for a shut out but neither pitcher does
earned runs (er) per team are by MLB rules recorded differently than earned runs per pitcher
- if a relief pitcher comes in with two outs, the inning having already been extended by an error, that relief pitcher can give up earned runs that are changed to him but not to the team as the inning should have been over

In [105]:

exclude = ['lg_id', 'team_id', 'year', 'g', 'era',
          'bb', 'h', 'hbp', 'hr', 'r', 'sf', 'so', 'sho', 'er']
key = ['team_id', 'year']
cols = set(lahman_pitching.columns) & set(lahman_teams.columns) - set(exclude)
cols = list(cols)
len(cols)

Out[105]:

In [106]:

cols

Out[106]:

['cg', 'l', 'sv', 'w', 'ip_outs']

In [107]:

p = lahman_pitching[key + cols].groupby(key).agg('sum').reset_index()
p.head(4)

Out[107]:

	team_id	year	cg	l	sv	w	ip_outs
0	ANA	1997	9	78	39	84	4364
1	ANA	1998	3	77	52	85	4332
2	ANA	1999	4	92	37	70	4294
3	ANA	2000	5	80	46	82	4344

In [108]:

t = lahman_teams[key+cols].sort_values(key).reset_index(drop=True)
t.head(4)

Out[108]:

	team_id	year	cg	l	sv	w	ip_outs
0	ANA	1997	9	78	39	84	4364
1	ANA	1998	3	77	52	85	4332
2	ANA	1999	4	92	37	70	4294
3	ANA	2000	5	80	46	82	4344

In [109]:

# dtypes need to be the same
for col in p.columns:
    if not col == 'year' and not col == 'team_id':
        p[col] = p[col].astype('int')
        t[col] = t[col].astype('int')

In [110]:

np.abs(p[cols] - t[cols]).max()

Out[110]:

cg         0
l          1
sv         0
w          1
ip_outs    0
dtype: int64

In [111]:

m = pd.merge(p, t, on = ['team_id','year'], suffixes=['_p', '_t'])
s_cols = m.columns.sort_values()
m = m[s_cols]

In [112]:

# display the differences
for col in cols:
    cp = col+'_p'
    ct = col+'_t'
    query = cp + ' != '+ ct
    tmp = m.query(query)[['team_id', 'year'] +[cp, ct]]
    if len(tmp):
        display(tmp)

	team_id	year	l_p	l_t
196	CHA	1979	86	87
329	CLE	1974	84	85

	team_id	year	w_p	w_t
407	DET	1979	84	85
1184	TEX	1974	83	84

Twice, the team value for wins and losses was 1 greater than the sum of that team's pitchers wins and losses.

Unlike shut outs and earned runs which are intentionally counted differently between individual and teams, wins and losses should be counted the same. The above shows a minor data discrepancy.

Summary¶

It may seem like a lot of work to cross check the data in so many different ways, but doing so provides accuracy bounds which are important to the subsequent data analysis.