We will be working with a file of Major League Baseball games from Retrosheet. Retrosheet compiles detailed statistics on baseball games from the 1800s through to today.
*Here, we'll create a Database of Major League Baseball games by compiling data from various sources.*
The main file we will be working from game_log.csv
, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize into several separate tables using SQL, providing a robust database of game-level statistics.
In addition to the main file, we have also included three 'helper' files, also sourced from Retrosheet:
park_codes.csv
person_codes.csv
team_codes.csv
These three helper files in some cases contain extra data, but will also make things easier as they will form the basis for three of our normalized tables.
Information regarding the columns of game_log.csv
can be found in game_log_fields.txt
import pandas as pd
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)
games = pd.read_csv('game_log.csv',low_memory=False)
parks = pd.read_csv('park_codes.csv')
persons = pd.read_csv('person_codes.csv')
teams = pd.read_csv('team_codes.csv')
Now, we will do some exploratory analysis, to understand the Data and it's columns.
games DataFrame
print(games.shape)
games.head()
(171907, 161)
date | number_of_game | day_of_week | v_name | v_league | v_game_number | h_name | h_league | h_game_number | v_score | h_score | length_outs | day_night | completion | forefeit | protest | park_id | attendance | length_minutes | v_line_score | h_line_score | v_at_bats | v_hits | v_doubles | v_triples | v_homeruns | v_rbi | v_sacrifice_hits | v_sacrifice_flies | v_hit_by_pitch | v_walks | v_intentional_walks | v_strikeouts | v_stolen_bases | v_caught_stealing | v_grounded_into_double | v_first_catcher_interference | v_left_on_base | v_pitchers_used | v_individual_earned_runs | v_team_earned_runs | v_wild_pitches | v_balks | v_putouts | v_assists | v_errors | v_passed_balls | v_double_plays | v_triple_plays | h_at_bats | h_hits | h_doubles | h_triples | h_homeruns | h_rbi | h_sacrifice_hits | h_sacrifice_flies | h_hit_by_pitch | h_walks | h_intentional_walks | h_strikeouts | h_stolen_bases | h_caught_stealing | h_grounded_into_double | h_first_catcher_interference | h_left_on_base | h_pitchers_used | h_individual_earned_runs | h_team_earned_runs | h_wild_pitches | h_balks | h_putouts | h_assists | h_errors | h_passed_balls | h_double_plays | h_triple_plays | hp_umpire_id | hp_umpire_name | 1b_umpire_id | 1b_umpire_name | 2b_umpire_id | 2b_umpire_name | 3b_umpire_id | 3b_umpire_name | lf_umpire_id | lf_umpire_name | rf_umpire_id | rf_umpire_name | v_manager_id | v_manager_name | h_manager_id | h_manager_name | winning_pitcher_id | winning_pitcher_name | losing_pitcher_id | losing_pitcher_name | saving_pitcher_id | saving_pitcher_name | winning_rbi_batter_id | winning_rbi_batter_id_name | v_starting_pitcher_id | v_starting_pitcher_name | h_starting_pitcher_id | h_starting_pitcher_name | v_player_1_id | v_player_1_name | v_player_1_def_pos | v_player_2_id | v_player_2_name | v_player_2_def_pos | v_player_3_id | v_player_3_name | v_player_3_def_pos | v_player_4_id | v_player_4_name | v_player_4_def_pos | v_player_5_id | v_player_5_name | v_player_5_def_pos | v_player_6_id | v_player_6_name | v_player_6_def_pos | v_player_7_id | v_player_7_name | v_player_7_def_pos | v_player_8_id | v_player_8_name | v_player_8_def_pos | v_player_9_id | v_player_9_name | v_player_9_def_pos | h_player_1_id | h_player_1_name | h_player_1_def_pos | h_player_2_id | h_player_2_name | h_player_2_def_pos | h_player_3_id | h_player_3_name | h_player_3_def_pos | h_player_4_id | h_player_4_name | h_player_4_def_pos | h_player_5_id | h_player_5_name | h_player_5_def_pos | h_player_6_id | h_player_6_name | h_player_6_def_pos | h_player_7_id | h_player_7_name | h_player_7_def_pos | h_player_8_id | h_player_8_name | h_player_8_def_pos | h_player_9_id | h_player_9_name | h_player_9_def_pos | additional_info | acquisition_info | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18710504 | 0 | Thu | CL1 | NaN | 1 | FW1 | NaN | 1 | 0 | 2 | 54.0 | D | NaN | NaN | NaN | FOR01 | 200.0 | 120.0 | 000000000 | 010010000 | 30.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 6.0 | 1.0 | NaN | -1.0 | NaN | 4.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 27.0 | 9.0 | 0.0 | 3.0 | 0.0 | 0.0 | 31.0 | 4.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | -1.0 | NaN | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 27.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | boakj901 | John Boake | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | paboc101 | Charlie Pabor | lennb101 | Bill Lennon | mathb101 | Bobby Mathews | prata101 | Al Pratt | NaN | NaN | NaN | NaN | prata101 | Al Pratt | mathb101 | Bobby Mathews | whitd102 | Deacon White | 2.0 | kimbg101 | Gene Kimball | 4.0 | paboc101 | Charlie Pabor | 7.0 | allia101 | Art Allison | 8.0 | white104 | Elmer White | 9.0 | prata101 | Al Pratt | 1.0 | sutte101 | Ezra Sutton | 5.0 | carlj102 | Jim Carleton | 3.0 | bassj101 | John Bass | 6.0 | selmf101 | Frank Sellman | 5.0 | mathb101 | Bobby Mathews | 1.0 | foraj101 | Jim Foran | 3.0 | goldw101 | Wally Goldsmith | 6.0 | lennb101 | Bill Lennon | 2.0 | caret101 | Tom Carey | 4.0 | mince101 | Ed Mincher | 7.0 | mcdej101 | James McDermott | 8.0 | kellb105 | Bill Kelly | 9.0 | NaN | Y |
1 | 18710505 | 0 | Fri | BS1 | NaN | 1 | WS3 | NaN | 1 | 20 | 18 | 54.0 | D | NaN | NaN | NaN | WAS01 | 5000.0 | 145.0 | 107000435 | 640113030 | 41.0 | 13.0 | 1.0 | 2.0 | 0.0 | 13.0 | 0.0 | 0.0 | 0.0 | 18.0 | NaN | 5.0 | 3.0 | NaN | -1.0 | NaN | 12.0 | 1.0 | 6.0 | 6.0 | 1.0 | 0.0 | 27.0 | 13.0 | 10.0 | 1.0 | 2.0 | 0.0 | 49.0 | 14.0 | 2.0 | 0.0 | 0.0 | 11.0 | 0.0 | 0.0 | 0.0 | 10.0 | NaN | 2.0 | 1.0 | NaN | -1.0 | NaN | 14.0 | 1.0 | 7.0 | 7.0 | 0.0 | 0.0 | 27.0 | 20.0 | 10.0 | 2.0 | 3.0 | 0.0 | dobsh901 | Henry Dobson | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | wrigh101 | Harry Wright | younn801 | Nick Young | spala101 | Al Spalding | braia102 | Asa Brainard | NaN | NaN | NaN | NaN | spala101 | Al Spalding | braia102 | Asa Brainard | wrigg101 | George Wright | 6.0 | barnr102 | Ross Barnes | 4.0 | birdd102 | Dave Birdsall | 9.0 | mcvec101 | Cal McVey | 2.0 | wrigh101 | Harry Wright | 8.0 | goulc101 | Charlie Gould | 3.0 | schah101 | Harry Schafer | 5.0 | conef101 | Fred Cone | 7.0 | spala101 | Al Spalding | 1.0 | watef102 | Fred Waterman | 5.0 | forcd101 | Davy Force | 6.0 | mille105 | Everett Mills | 3.0 | allid101 | Doug Allison | 2.0 | hallg101 | George Hall | 7.0 | leona101 | Andy Leonard | 4.0 | braia102 | Asa Brainard | 1.0 | burrh101 | Henry Burroughs | 9.0 | berth101 | Henry Berthrong | 8.0 | HTBF | Y |
2 | 18710506 | 0 | Sat | CL1 | NaN | 2 | RC1 | NaN | 1 | 12 | 4 | 54.0 | D | NaN | NaN | NaN | RCK01 | 1000.0 | 140.0 | 610020003 | 010020100 | 49.0 | 11.0 | 1.0 | 1.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | NaN | -1.0 | NaN | 10.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 27.0 | 12.0 | 8.0 | 5.0 | 0.0 | 0.0 | 36.0 | 7.0 | 2.0 | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 3.0 | 5.0 | NaN | -1.0 | NaN | 5.0 | 1.0 | 3.0 | 3.0 | 1.0 | 0.0 | 27.0 | 12.0 | 13.0 | 3.0 | 0.0 | 0.0 | mawnj901 | J.H. Manny | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | paboc101 | Charlie Pabor | hasts101 | Scott Hastings | prata101 | Al Pratt | fishc102 | Cherokee Fisher | NaN | NaN | NaN | NaN | prata101 | Al Pratt | fishc102 | Cherokee Fisher | whitd102 | Deacon White | 2.0 | kimbg101 | Gene Kimball | 4.0 | paboc101 | Charlie Pabor | 7.0 | allia101 | Art Allison | 8.0 | white104 | Elmer White | 9.0 | prata101 | Al Pratt | 1.0 | sutte101 | Ezra Sutton | 5.0 | carlj102 | Jim Carleton | 3.0 | bassj101 | John Bass | 6.0 | mackd101 | Denny Mack | 3.0 | addyb101 | Bob Addy | 4.0 | fishc102 | Cherokee Fisher | 1.0 | hasts101 | Scott Hastings | 8.0 | ham-r101 | Ralph Ham | 5.0 | ansoc101 | Cap Anson | 2.0 | sagep101 | Pony Sager | 6.0 | birdg101 | George Bird | 7.0 | stirg101 | Gat Stires | 9.0 | NaN | Y |
3 | 18710508 | 0 | Mon | CL1 | NaN | 3 | CH1 | NaN | 1 | 12 | 14 | 54.0 | D | NaN | NaN | NaN | CHI01 | 5000.0 | 150.0 | 101403111 | 077000000 | 46.0 | 15.0 | 2.0 | 1.0 | 2.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | NaN | -1.0 | NaN | 7.0 | 1.0 | 6.0 | 6.0 | 0.0 | 0.0 | 27.0 | 15.0 | 11.0 | 6.0 | 0.0 | 0.0 | 43.0 | 11.0 | 2.0 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 4.0 | NaN | 2.0 | 1.0 | NaN | -1.0 | NaN | 6.0 | 1.0 | 4.0 | 4.0 | 0.0 | 0.0 | 27.0 | 14.0 | 7.0 | 2.0 | 0.0 | 0.0 | willg901 | Gardner Willard | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | paboc101 | Charlie Pabor | woodj106 | Jimmy Wood | zettg101 | George Zettlein | prata101 | Al Pratt | NaN | NaN | NaN | NaN | prata101 | Al Pratt | zettg101 | George Zettlein | whitd102 | Deacon White | 2.0 | kimbg101 | Gene Kimball | 4.0 | paboc101 | Charlie Pabor | 7.0 | allia101 | Art Allison | 8.0 | white104 | Elmer White | 9.0 | prata101 | Al Pratt | 1.0 | sutte101 | Ezra Sutton | 5.0 | carlj102 | Jim Carleton | 3.0 | bassj101 | John Bass | 6.0 | mcatb101 | Bub McAtee | 3.0 | kingm101 | Marshall King | 8.0 | hodec101 | Charlie Hodes | 2.0 | woodj106 | Jimmy Wood | 4.0 | simmj101 | Joe Simmons | 9.0 | folet101 | Tom Foley | 7.0 | duffe101 | Ed Duffy | 6.0 | pinke101 | Ed Pinkham | 5.0 | zettg101 | George Zettlein | 1.0 | NaN | Y |
4 | 18710509 | 0 | Tue | BS1 | NaN | 2 | TRO | NaN | 1 | 9 | 5 | 54.0 | D | NaN | NaN | NaN | TRO01 | 3250.0 | 145.0 | 000002232 | 101003000 | 46.0 | 17.0 | 4.0 | 1.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 2.0 | NaN | 0.0 | 1.0 | NaN | -1.0 | NaN | 12.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 27.0 | 12.0 | 5.0 | 0.0 | 1.0 | 0.0 | 36.0 | 9.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | NaN | 0.0 | 2.0 | NaN | -1.0 | NaN | 7.0 | 1.0 | 3.0 | 3.0 | 1.0 | 0.0 | 27.0 | 11.0 | 7.0 | 3.0 | 0.0 | 0.0 | leroi901 | Isaac Leroy | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | wrigh101 | Harry Wright | pikel101 | Lip Pike | spala101 | Al Spalding | mcmuj101 | John McMullin | NaN | NaN | NaN | NaN | spala101 | Al Spalding | mcmuj101 | John McMullin | wrigg101 | George Wright | 6.0 | barnr102 | Ross Barnes | 4.0 | birdd102 | Dave Birdsall | 9.0 | mcvec101 | Cal McVey | 2.0 | wrigh101 | Harry Wright | 8.0 | goulc101 | Charlie Gould | 3.0 | schah101 | Harry Schafer | 5.0 | conef101 | Fred Cone | 7.0 | spala101 | Al Spalding | 1.0 | flync101 | Clipper Flynn | 9.0 | mcgem101 | Mike McGeary | 2.0 | yorkt101 | Tom York | 8.0 | mcmuj101 | John McMullin | 1.0 | kings101 | Steve King | 7.0 | beave101 | Edward Beavens | 4.0 | bells101 | Steve Bellan | 5.0 | pikel101 | Lip Pike | 3.0 | cravb101 | Bill Craver | 6.0 | HTBF | Y |
games.tail()
date | number_of_game | day_of_week | v_name | v_league | v_game_number | h_name | h_league | h_game_number | v_score | h_score | length_outs | day_night | completion | forefeit | protest | park_id | attendance | length_minutes | v_line_score | h_line_score | v_at_bats | v_hits | v_doubles | v_triples | v_homeruns | v_rbi | v_sacrifice_hits | v_sacrifice_flies | v_hit_by_pitch | v_walks | v_intentional_walks | v_strikeouts | v_stolen_bases | v_caught_stealing | v_grounded_into_double | v_first_catcher_interference | v_left_on_base | v_pitchers_used | v_individual_earned_runs | v_team_earned_runs | v_wild_pitches | v_balks | v_putouts | v_assists | v_errors | v_passed_balls | v_double_plays | v_triple_plays | h_at_bats | h_hits | h_doubles | h_triples | h_homeruns | h_rbi | h_sacrifice_hits | h_sacrifice_flies | h_hit_by_pitch | h_walks | h_intentional_walks | h_strikeouts | h_stolen_bases | h_caught_stealing | h_grounded_into_double | h_first_catcher_interference | h_left_on_base | h_pitchers_used | h_individual_earned_runs | h_team_earned_runs | h_wild_pitches | h_balks | h_putouts | h_assists | h_errors | h_passed_balls | h_double_plays | h_triple_plays | hp_umpire_id | hp_umpire_name | 1b_umpire_id | 1b_umpire_name | 2b_umpire_id | 2b_umpire_name | 3b_umpire_id | 3b_umpire_name | lf_umpire_id | lf_umpire_name | rf_umpire_id | rf_umpire_name | v_manager_id | v_manager_name | h_manager_id | h_manager_name | winning_pitcher_id | winning_pitcher_name | losing_pitcher_id | losing_pitcher_name | saving_pitcher_id | saving_pitcher_name | winning_rbi_batter_id | winning_rbi_batter_id_name | v_starting_pitcher_id | v_starting_pitcher_name | h_starting_pitcher_id | h_starting_pitcher_name | v_player_1_id | v_player_1_name | v_player_1_def_pos | v_player_2_id | v_player_2_name | v_player_2_def_pos | v_player_3_id | v_player_3_name | v_player_3_def_pos | v_player_4_id | v_player_4_name | v_player_4_def_pos | v_player_5_id | v_player_5_name | v_player_5_def_pos | v_player_6_id | v_player_6_name | v_player_6_def_pos | v_player_7_id | v_player_7_name | v_player_7_def_pos | v_player_8_id | v_player_8_name | v_player_8_def_pos | v_player_9_id | v_player_9_name | v_player_9_def_pos | h_player_1_id | h_player_1_name | h_player_1_def_pos | h_player_2_id | h_player_2_name | h_player_2_def_pos | h_player_3_id | h_player_3_name | h_player_3_def_pos | h_player_4_id | h_player_4_name | h_player_4_def_pos | h_player_5_id | h_player_5_name | h_player_5_def_pos | h_player_6_id | h_player_6_name | h_player_6_def_pos | h_player_7_id | h_player_7_name | h_player_7_def_pos | h_player_8_id | h_player_8_name | h_player_8_def_pos | h_player_9_id | h_player_9_name | h_player_9_def_pos | additional_info | acquisition_info | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
171902 | 20161002 | 0 | Sun | MIL | NL | 162 | COL | NL | 162 | 6 | 4 | 60.0 | D | NaN | NaN | NaN | DEN02 | 27762.0 | 203.0 | 0200000202 | 1100100010 | 39.0 | 10.0 | 4.0 | 1.0 | 2.0 | 6.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.0 | 12.0 | 2.0 | 1.0 | 0.0 | 0.0 | 8.0 | 7.0 | 4.0 | 4.0 | 1.0 | 0.0 | 30.0 | 12.0 | 1.0 | 0.0 | 0.0 | 0.0 | 41.0 | 13.0 | 4.0 | 0.0 | 1.0 | 4.0 | 1.0 | 0.0 | 1.0 | 3.0 | 0.0 | 11.0 | 0.0 | 1.0 | 0.0 | 0.0 | 12.0 | 5.0 | 6.0 | 6.0 | 0.0 | 0.0 | 30.0 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | barrs901 | Scott Barry | woodt901 | Tom Woodring | randt901 | Tony Randazzo | ortir901 | Roberto Ortiz | NaN | NaN | NaN | NaN | counc001 | Craig Counsell | weisw001 | Walt Weiss | thort001 | Tyler Thornburg | rusic001 | Chris Rusin | knebc001 | Corey Knebel | susaa001 | Andrew Susac | cravt001 | Tyler Cravy | marqg001 | German Marquez | villj001 | Jonathan Villar | 5.0 | genns001 | Scooter Gennett | 4.0 | cartc002 | Chris Carter | 3.0 | santd002 | Domingo Santana | 9.0 | pereh001 | Hernan Perez | 8.0 | arcio002 | Orlando Arcia | 6.0 | susaa001 | Andrew Susac | 2.0 | elmoj001 | Jake Elmore | 7.0 | cravt001 | Tyler Cravy | 1.0 | blacc001 | Charlie Blackmon | 8.0 | dahld001 | David Dahl | 7.0 | arenn001 | Nolan Arenado | 5.0 | gonzc001 | Carlos Gonzalez | 9.0 | murpt002 | Tom Murphy | 2.0 | pattj005 | Jordan Patterson | 3.0 | valap001 | Pat Valaika | 4.0 | adamc001 | Cristhian Adames | 6.0 | marqg001 | German Marquez | 1.0 | NaN | Y |
171903 | 20161002 | 0 | Sun | NYN | NL | 162 | PHI | NL | 162 | 2 | 5 | 51.0 | D | NaN | NaN | NaN | PHI13 | 36935.0 | 159.0 | 000001100 | 00100031x | 33.0 | 8.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 9.0 | 1.0 | 1.0 | 1.0 | 0.0 | 6.0 | 6.0 | 3.0 | 3.0 | 0.0 | 0.0 | 24.0 | 12.0 | 3.0 | 1.0 | 2.0 | 0.0 | 33.0 | 10.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 1.0 | 0.0 | 2.0 | 0.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 | 7.0 | 5.0 | 2.0 | 2.0 | 0.0 | 0.0 | 27.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | barkl901 | Lance Barksdale | herna901 | Angel Hernandez | barrt901 | Ted Barrett | littw901 | Will Little | NaN | NaN | NaN | NaN | collt801 | Terry Collins | mackp101 | Pete Mackanin | murrc002 | Colton Murray | goede001 | Erik Goeddel | nerih001 | Hector Neris | hernc005 | Cesar Hernandez | ynoag001 | Gabriel Ynoa | eickj001 | Jerad Eickhoff | granc001 | Curtis Granderson | 8.0 | cabra002 | Asdrubal Cabrera | 6.0 | brucj001 | Jay Bruce | 9.0 | dudal001 | Lucas Duda | 3.0 | johnk003 | Kelly Johnson | 4.0 | confm001 | Michael Conforto | 7.0 | campe001 | Eric Campbell | 5.0 | plawk001 | Kevin Plawecki | 2.0 | ynoag001 | Gabriel Ynoa | 1.0 | hernc005 | Cesar Hernandez | 4.0 | parej002 | Jimmy Paredes | 7.0 | herro001 | Odubel Herrera | 8.0 | franm004 | Maikel Franco | 5.0 | howar001 | Ryan Howard | 3.0 | ruppc001 | Cameron Rupp | 2.0 | blana001 | Andres Blanco | 6.0 | altha001 | Aaron Altherr | 9.0 | eickj001 | Jerad Eickhoff | 1.0 | NaN | Y |
171904 | 20161002 | 0 | Sun | LAN | NL | 162 | SFN | NL | 162 | 1 | 7 | 51.0 | D | NaN | NaN | NaN | SFO03 | 41445.0 | 184.0 | 000100000 | 23000002x | 30.0 | 4.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 | 7.0 | 7.0 | 7.0 | 0.0 | 0.0 | 24.0 | 5.0 | 1.0 | 0.0 | 0.0 | 0.0 | 39.0 | 16.0 | 3.0 | 1.0 | 0.0 | 7.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 11.0 | 2.0 | 1.0 | 0.0 | 0.0 | 12.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 27.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | knigb901 | Brian Knight | westj901 | Joe West | fleta901 | Andy Fletcher | danlk901 | Kerwin Danley | NaN | NaN | NaN | NaN | robed001 | Dave Roberts | bochb002 | Bruce Bochy | moorm003 | Matt Moore | maedk001 | Kenta Maeda | NaN | NaN | poseb001 | Buster Posey | maedk001 | Kenta Maeda | moorm003 | Matt Moore | kendh001 | Howie Kendrick | 7.0 | turnj001 | Justin Turner | 5.0 | seagc001 | Corey Seager | 6.0 | puigy001 | Yasiel Puig | 9.0 | gonza003 | Adrian Gonzalez | 3.0 | grany001 | Yasmani Grandal | 2.0 | pedej001 | Joc Pederson | 8.0 | utlec001 | Chase Utley | 4.0 | maedk001 | Kenta Maeda | 1.0 | spand001 | Denard Span | 8.0 | beltb001 | Brandon Belt | 3.0 | poseb001 | Buster Posey | 2.0 | pench001 | Hunter Pence | 9.0 | crawb001 | Brandon Crawford | 6.0 | pagaa001 | Angel Pagan | 7.0 | panij002 | Joe Panik | 4.0 | gillc001 | Conor Gillaspie | 5.0 | moorm003 | Matt Moore | 1.0 | NaN | Y |
171905 | 20161002 | 0 | Sun | PIT | NL | 162 | SLN | NL | 162 | 4 | 10 | 51.0 | D | NaN | NaN | NaN | STL10 | 44615.0 | 192.0 | 000020200 | 00100360x | 35.0 | 9.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | 11.0 | 0.0 | 1.0 | 0.0 | 0.0 | 8.0 | 6.0 | 8.0 | 8.0 | 0.0 | 0.0 | 24.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 36.0 | 12.0 | 2.0 | 0.0 | 1.0 | 10.0 | 0.0 | 2.0 | 0.0 | 4.0 | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 | 3.0 | 4.0 | 4.0 | 0.0 | 0.0 | 27.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | cuzzp901 | Phil Cuzzi | ticht901 | Todd Tichenor | vanol901 | Larry Vanover | marqa901 | Alfonso Marquez | NaN | NaN | NaN | NaN | hurdc001 | Clint Hurdle | mathm001 | Mike Matheny | broxj001 | Jonathan Broxton | nicaj001 | Juan Nicasio | NaN | NaN | piscs001 | Stephen Piscotty | voger001 | Ryan Vogelsong | waina001 | Adam Wainwright | jasoj001 | John Jaso | 3.0 | polag001 | Gregory Polanco | 9.0 | mccua001 | Andrew McCutchen | 8.0 | kangj001 | Jung Ho Kang | 5.0 | joycm001 | Matt Joyce | 7.0 | hansa001 | Alen Hanson | 4.0 | fryee001 | Eric Fryer | 2.0 | florp001 | Pedro Florimon | 6.0 | voger001 | Ryan Vogelsong | 1.0 | carpm002 | Matt Carpenter | 3.0 | diaza003 | Aledmys Diaz | 6.0 | moliy001 | Yadier Molina | 2.0 | piscs001 | Stephen Piscotty | 9.0 | peraj001 | Jhonny Peralta | 5.0 | mossb001 | Brandon Moss | 7.0 | gyorj001 | Jedd Gyorko | 4.0 | gricr001 | Randal Grichuk | 8.0 | waina001 | Adam Wainwright | 1.0 | NaN | Y |
171906 | 20161002 | 0 | Sun | MIA | NL | 161 | WAS | NL | 162 | 7 | 10 | 51.0 | D | NaN | NaN | NaN | WAS11 | 28730.0 | 216.0 | 000230020 | 03023002x | 38.0 | 14.0 | 1.0 | 1.0 | 2.0 | 7.0 | 1.0 | 0.0 | 0.0 | 3.0 | 2.0 | 10.0 | 1.0 | 1.0 | 1.0 | 0.0 | 8.0 | 7.0 | 10.0 | 10.0 | 1.0 | 0.0 | 24.0 | 11.0 | 0.0 | 0.0 | 1.0 | 0.0 | 30.0 | 10.0 | 2.0 | 0.0 | 1.0 | 10.0 | 1.0 | 1.0 | 1.0 | 8.0 | 0.0 | 3.0 | 2.0 | 0.0 | 1.0 | 0.0 | 7.0 | 6.0 | 7.0 | 7.0 | 1.0 | 0.0 | 27.0 | 11.0 | 0.0 | 0.0 | 1.0 | 0.0 | tumpj901 | John Tumpane | porta901 | Alan Porter | onorb901 | Brian O'Nora | kellj901 | Jeff Kellogg | NaN | NaN | NaN | NaN | mattd001 | Don Mattingly | baked002 | Dusty Baker | schem001 | Max Scherzer | brica001 | Austin Brice | melam001 | Mark Melancon | difow001 | Wilmer Difo | koeht001 | Tom Koehler | schem001 | Max Scherzer | gordd002 | Dee Gordon | 4.0 | telit001 | Tomas Telis | 2.0 | pradm001 | Martin Prado | 5.0 | yelic001 | Christian Yelich | 8.0 | bourj002 | Justin Bour | 3.0 | scrux001 | Xavier Scruggs | 7.0 | hoodd001 | Destin Hood | 9.0 | hecha001 | Adeiny Hechavarria | 6.0 | koeht001 | Tom Koehler | 1.0 | turnt001 | Trea Turner | 8.0 | reveb001 | Ben Revere | 7.0 | harpb003 | Bryce Harper | 9.0 | zimmr001 | Ryan Zimmerman | 3.0 | drews001 | Stephen Drew | 5.0 | difow001 | Wilmer Difo | 4.0 | espid001 | Danny Espinosa | 6.0 | lobaj001 | Jose Lobaton | 2.0 | schem001 | Max Scherzer | 1.0 | NaN | Y |
It looks like the game log has a record of over 170,000 games. It looks like these games are chronologically ordered and occur between 1871 and 2016.
For each game we have:
We have a "game_log_fields.txt" file that tell us that the player number corresponds with the order in which they batted.
It's worth noting that there is no natural primary key column for this table.
parks DataFrame
print(parks.shape)
parks.head()
(252, 9)
park_id | name | aka | city | state | start | end | league | notes | |
---|---|---|---|---|---|---|---|---|---|
0 | ALB01 | Riverside Park | NaN | Albany | NY | 09/11/1880 | 05/30/1882 | NL | TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882 |
1 | ALT01 | Columbia Park | NaN | Altoona | PA | 04/30/1884 | 05/31/1884 | UA | NaN |
2 | ANA01 | Angel Stadium of Anaheim | Edison Field; Anaheim Stadium | Anaheim | CA | 04/19/1966 | NaN | AL | NaN |
3 | ARL01 | Arlington Stadium | NaN | Arlington | TX | 04/21/1972 | 10/03/1993 | AL | NaN |
4 | ARL02 | Rangers Ballpark in Arlington | The Ballpark in Arlington; Ameriquest Fl | Arlington | TX | 04/11/1994 | NaN | AL | NaN |
This seems to be a list of all baseball parks. There are IDs which seem to match with the game log, as well as names, nicknames, city and league.
persons DataFrame
print(persons.shape)
persons.head()
(20494, 7)
id | last | first | player_debut | mgr_debut | coach_debut | ump_debut | |
---|---|---|---|---|---|---|---|
0 | aardd001 | Aardsma | David | 04/06/2004 | NaN | NaN | NaN |
1 | aaroh101 | Aaron | Hank | 04/13/1954 | NaN | NaN | NaN |
2 | aarot101 | Aaron | Tommie | 04/10/1962 | NaN | 04/06/1979 | NaN |
3 | aased001 | Aase | Don | 07/26/1977 | NaN | NaN | NaN |
4 | abada001 | Abad | Andy | 09/10/2001 | NaN | NaN | NaN |
This seems to be a list of people with IDs. The IDs look like they match up with those used in the game log. There are debut dates, for players, managers, coaches and umpires. We can see that some people might have been one or more of these roles.
It also looks like coaches and managers are two different things in baseball. After some research, managers are what would be called a 'coach' or 'head coach' in other sports, and coaches are more specialized, like base coaches. It also seems like coaches aren't recorded in the game log.
teams DataFrame
print(teams.shape)
teams.head()
(150, 8)
team_id | league | start | end | city | nickname | franch_id | seq | |
---|---|---|---|---|---|---|---|---|
0 | ALT | UA | 1884 | 1884 | Altoona | Mountain Cities | ALT | 1 |
1 | ARI | NL | 1998 | 0 | Arizona | Diamondbacks | ARI | 1 |
2 | BFN | NL | 1879 | 1885 | Buffalo | Bisons | BFN | 1 |
3 | BFP | PL | 1890 | 1890 | Buffalo | Bisons | BFP | 1 |
4 | BL1 | NaN | 1872 | 1874 | Baltimore | Canaries | BL1 | 1 |
This seems to be a list of all teams, with team_ids which seem to match the game log. Interestingly, there is a franch_id, let's take a look at this:
teams['franch_id'].value_counts()
BS1 4 MLA 3 PHA 3 SE1 3 TRN 3 LAA 3 BR3 3 WS2 2 CN2 2 SL2 2 PT1 2 BLA 2 IND 2 NY2 2 PH1 2 BL2 2 WS9 2 CH2 2 SLU 2 CL3 2 HR1 2 FLO 2 LS2 2 SL4 2 MON 2 BSP 2 WS1 2 PRO 1 HOU 1 CL2 1 CH1 1 WOR 1 NY4 1 KCF 1 CHP 1 PH4 1 CN3 1 CHA 1 PH2 1 NYP 1 CNU 1 ARI 1 BLF 1 RC1 1 CN1 1 KEO 1 IN3 1 KCN 1 TOR 1 TL1 1 PTF 1 BR4 1 BFP 1 WSU 1 BR1 1 PH3 1 CLP 1 BSU 1 BRP 1 NH1 1 CL1 1 KCU 1 CL6 1 BLU 1 SLF 1 WIL 1 TRO 1 FW1 1 RC2 1 WS8 1 PHP 1 WS4 1 CHU 1 IN1 1 BUF 1 WS3 1 ELI 1 SR2 1 ALT 1 RIC 1 PTP 1 PTU 1 TBA 1 BL4 1 ML3 1 WS5 1 SPU 1 TL2 1 MLU 1 SL1 1 DTN 1 CHF 1 DET 1 BOS 1 CLE 1 LS1 1 PHI 1 BFN 1 BRF 1 IN2 1 MID 1 SEA 1 NYN 1 BR2 1 SR1 1 CL5 1 SDN 1 COL 1 WS6 1 BL1 1 KC2 1 WS7 1 PHU 1 ML1 1 KCA 1 Name: franch_id, dtype: int64
We might have franch_id occurring a few times for some teams, let's look at the first one in more detail.
teams[teams['franch_id'] == 'BS1']
team_id | league | start | end | city | nickname | franch_id | seq | |
---|---|---|---|---|---|---|---|---|
21 | BS1 | NaN | 1871 | 1875 | Boston | Braves | BS1 | 1 |
22 | BSN | NL | 1876 | 1952 | Boston | Braves | BS1 | 2 |
23 | MLN | NL | 1953 | 1965 | Milwaukee | Braves | BS1 | 3 |
24 | ATL | NL | 1966 | 0 | Atlanta | Braves | BS1 | 4 |
It appears that teams move between leagues and cities. The team_id changes when this happens, franch_id (which is probably 'Franchise') helps us tie all of this together.
In the game log, each player has a defensive position listed, which seems to be a number between 1-10. Doing some research around this, I found this article which gives us a list of names for each numbered position:
- Pitcher
The 10th position isn't included, it may be a way of describing a designated hitter that does not field. I can find a retrosheet page that indicates that position 0 is used for this, but we don't have any position 0 in our data. I have chosen to make this an 'Unknown Position' so I'm not including data based on a hunch.
Wikipedia tells us there are currently two leagues - the American (AL) and National (NL). Let's start by finding out what leagues are listed in the main game log:
games['h_league'].value_counts(dropna=False)
NL 88867 AL 74712 AA 5039 FL 1243 NaN 1086 PL 532 UA 428 Name: h_league, dtype: int64
It looks like most of our games fall into the two current leagues, but that there are four other leagues. Let's write a quick function to get some info on the years of these leagues:
import datetime as dt
import numpy as np
def league_info(l):
league_games = games[games['h_league'] == l]
first_appr = league_games['date'].min()
last_appr = league_games['date'].max()
try:
first_appr = dt.datetime.strptime(str(first_appr), "%Y%m%d").strftime("%Y")
last_appr = dt.datetime.strptime(str(last_appr), "%Y%m%d").strftime("%Y")
except ValueError:
pass
print("{} appered from {} to {}".format(l, first_appr, last_appr))
for i in games['h_league'].unique():
league_info(i)
nan appered from nan to nan NL appered from 1876 to 2016 AA appered from 1882 to 1891 UA appered from 1884 to 1884 PL appered from 1890 to 1890 AL appered from 1901 to 2016 FL appered from 1914 to 1915
Now we have some years which will help us do some research. After some googling we come up with:
It also looks like we have about 1000 games where the home team doesn't have a value for league.
import sqlite3
db = 'mlb.db'
def run_query(q):
with sqlite3.connect(db) as conn:
return pd.read_sql(q, conn)
def run_command(c):
with sqlite3.connect(db) as conn:
conn.isolation_level = None
conn.execute(c)
def show_tables():
q = '''
SELECT
name,
type
FROM sqlite_master
WHERE type IN ('table', 'view')
'''
return run_query(q)
DataFrame.to_sql()
to create tables for each of our dataframes in a new SQLite database, mlb.db
table_list = {
"game_log" : games,
"park_codes" : parks,
"person_codes" : persons,
"team_codes" : teams
}
with sqlite3.connect(db) as conn:
for key, val in table_list.items():
conn.execute('DROP TABLE IF EXISTS {};'.format(key))
val.to_sql(key, conn, index=False)
game_log
DataFrame doesn't have a unique column. We will create a new column game_id
in it's corresponding table, which will act as a primary key,c1 = '''
ALTER TABLE game_log
add game_id TEXT;
'''
try:
run_command(c1)
except:
pass
c2 = '''
UPDATE game_log
SET game_id = h_name || date || number_of_game
WHERE game_id IS NULL;
'''
run_command(c2)
q1 = '''
SELECT
game_id,
h_name,
date,
number_of_game
FROM game_log
LIMIT 5;
'''
run_query(q1)
game_id | h_name | date | number_of_game | |
---|---|---|---|---|
0 | FW1187105040 | FW1 | 18710504 | 0 |
1 | WS3187105050 | WS3 | 18710505 | 0 |
2 | RC1187105060 | RC1 | 18710506 | 0 |
3 | CH1187105080 | CH1 | 18710508 | 0 |
4 | TRO187105090 | TRO | 18710509 | 0 |
game_log table
run_query("select * from game_log limit 2")
date | number_of_game | day_of_week | v_name | v_league | v_game_number | h_name | h_league | h_game_number | v_score | h_score | length_outs | day_night | completion | forefeit | protest | park_id | attendance | length_minutes | v_line_score | h_line_score | v_at_bats | v_hits | v_doubles | v_triples | v_homeruns | v_rbi | v_sacrifice_hits | v_sacrifice_flies | v_hit_by_pitch | v_walks | v_intentional_walks | v_strikeouts | v_stolen_bases | v_caught_stealing | v_grounded_into_double | v_first_catcher_interference | v_left_on_base | v_pitchers_used | v_individual_earned_runs | v_team_earned_runs | v_wild_pitches | v_balks | v_putouts | v_assists | v_errors | v_passed_balls | v_double_plays | v_triple_plays | h_at_bats | h_hits | h_doubles | h_triples | h_homeruns | h_rbi | h_sacrifice_hits | h_sacrifice_flies | h_hit_by_pitch | h_walks | h_intentional_walks | h_strikeouts | h_stolen_bases | h_caught_stealing | h_grounded_into_double | h_first_catcher_interference | h_left_on_base | h_pitchers_used | h_individual_earned_runs | h_team_earned_runs | h_wild_pitches | h_balks | h_putouts | h_assists | h_errors | h_passed_balls | h_double_plays | h_triple_plays | hp_umpire_id | hp_umpire_name | 1b_umpire_id | 1b_umpire_name | 2b_umpire_id | 2b_umpire_name | 3b_umpire_id | 3b_umpire_name | lf_umpire_id | lf_umpire_name | rf_umpire_id | rf_umpire_name | v_manager_id | v_manager_name | h_manager_id | h_manager_name | winning_pitcher_id | winning_pitcher_name | losing_pitcher_id | losing_pitcher_name | saving_pitcher_id | saving_pitcher_name | winning_rbi_batter_id | winning_rbi_batter_id_name | v_starting_pitcher_id | v_starting_pitcher_name | h_starting_pitcher_id | h_starting_pitcher_name | v_player_1_id | v_player_1_name | v_player_1_def_pos | v_player_2_id | v_player_2_name | v_player_2_def_pos | v_player_3_id | v_player_3_name | v_player_3_def_pos | v_player_4_id | v_player_4_name | v_player_4_def_pos | v_player_5_id | v_player_5_name | v_player_5_def_pos | v_player_6_id | v_player_6_name | v_player_6_def_pos | v_player_7_id | v_player_7_name | v_player_7_def_pos | v_player_8_id | v_player_8_name | v_player_8_def_pos | v_player_9_id | v_player_9_name | v_player_9_def_pos | h_player_1_id | h_player_1_name | h_player_1_def_pos | h_player_2_id | h_player_2_name | h_player_2_def_pos | h_player_3_id | h_player_3_name | h_player_3_def_pos | h_player_4_id | h_player_4_name | h_player_4_def_pos | h_player_5_id | h_player_5_name | h_player_5_def_pos | h_player_6_id | h_player_6_name | h_player_6_def_pos | h_player_7_id | h_player_7_name | h_player_7_def_pos | h_player_8_id | h_player_8_name | h_player_8_def_pos | h_player_9_id | h_player_9_name | h_player_9_def_pos | additional_info | acquisition_info | game_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18710504 | 0 | Thu | CL1 | None | 1 | FW1 | None | 1 | 0 | 2 | 54.0 | D | None | None | None | FOR01 | 200.0 | 120.0 | 000000000 | 010010000 | 30.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | None | 6.0 | 1.0 | None | -1.0 | None | 4.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 27.0 | 9.0 | 0.0 | 3.0 | 0.0 | 0.0 | 31.0 | 4.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | None | 0.0 | 0.0 | None | -1.0 | None | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 27.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | boakj901 | John Boake | None | None | None | None | None | None | None | None | None | None | paboc101 | Charlie Pabor | lennb101 | Bill Lennon | mathb101 | Bobby Mathews | prata101 | Al Pratt | None | None | None | None | prata101 | Al Pratt | mathb101 | Bobby Mathews | whitd102 | Deacon White | 2.0 | kimbg101 | Gene Kimball | 4.0 | paboc101 | Charlie Pabor | 7.0 | allia101 | Art Allison | 8.0 | white104 | Elmer White | 9.0 | prata101 | Al Pratt | 1.0 | sutte101 | Ezra Sutton | 5.0 | carlj102 | Jim Carleton | 3.0 | bassj101 | John Bass | 6.0 | selmf101 | Frank Sellman | 5.0 | mathb101 | Bobby Mathews | 1.0 | foraj101 | Jim Foran | 3.0 | goldw101 | Wally Goldsmith | 6.0 | lennb101 | Bill Lennon | 2.0 | caret101 | Tom Carey | 4.0 | mince101 | Ed Mincher | 7.0 | mcdej101 | James McDermott | 8.0 | kellb105 | Bill Kelly | 9.0 | None | Y | FW1187105040 |
1 | 18710505 | 0 | Fri | BS1 | None | 1 | WS3 | None | 1 | 20 | 18 | 54.0 | D | None | None | None | WAS01 | 5000.0 | 145.0 | 107000435 | 640113030 | 41.0 | 13.0 | 1.0 | 2.0 | 0.0 | 13.0 | 0.0 | 0.0 | 0.0 | 18.0 | None | 5.0 | 3.0 | None | -1.0 | None | 12.0 | 1.0 | 6.0 | 6.0 | 1.0 | 0.0 | 27.0 | 13.0 | 10.0 | 1.0 | 2.0 | 0.0 | 49.0 | 14.0 | 2.0 | 0.0 | 0.0 | 11.0 | 0.0 | 0.0 | 0.0 | 10.0 | None | 2.0 | 1.0 | None | -1.0 | None | 14.0 | 1.0 | 7.0 | 7.0 | 0.0 | 0.0 | 27.0 | 20.0 | 10.0 | 2.0 | 3.0 | 0.0 | dobsh901 | Henry Dobson | None | None | None | None | None | None | None | None | None | None | wrigh101 | Harry Wright | younn801 | Nick Young | spala101 | Al Spalding | braia102 | Asa Brainard | None | None | None | None | spala101 | Al Spalding | braia102 | Asa Brainard | wrigg101 | George Wright | 6.0 | barnr102 | Ross Barnes | 4.0 | birdd102 | Dave Birdsall | 9.0 | mcvec101 | Cal McVey | 2.0 | wrigh101 | Harry Wright | 8.0 | goulc101 | Charlie Gould | 3.0 | schah101 | Harry Schafer | 5.0 | conef101 | Fred Cone | 7.0 | spala101 | Al Spalding | 1.0 | watef102 | Fred Waterman | 5.0 | forcd101 | Davy Force | 6.0 | mille105 | Everett Mills | 3.0 | allid101 | Doug Allison | 2.0 | hallg101 | George Hall | 7.0 | leona101 | Andy Leonard | 4.0 | braia102 | Asa Brainard | 1.0 | burrh101 | Henry Burroughs | 9.0 | berth101 | Henry Berthrong | 8.0 | HTBF | Y | WS3187105050 |
park_codes table
run_query("select * from park_codes limit 2")
park_id | name | aka | city | state | start | end | league | notes | |
---|---|---|---|---|---|---|---|---|---|
0 | ALB01 | Riverside Park | None | Albany | NY | 09/11/1880 | 05/30/1882 | NL | TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882 |
1 | ALT01 | Columbia Park | None | Altoona | PA | 04/30/1884 | 05/31/1884 | UA | None |
person_codes table
run_query("select * from person_codes limit 2")
id | last | first | player_debut | mgr_debut | coach_debut | ump_debut | |
---|---|---|---|---|---|---|---|
0 | aardd001 | Aardsma | David | 04/06/2004 | None | None | None |
1 | aaroh101 | Aaron | Hank | 04/13/1954 | None | None | None |
team_codes table
run_query("select * from team_codes limit 2")
team_id | league | start | end | city | nickname | franch_id | seq | |
---|---|---|---|---|---|---|---|---|
0 | ALT | UA | 1884 | 1884 | Altoona | Mountain Cities | ALT | 1 |
1 | ARI | NL | 1998 | 0 | Arizona | Diamondbacks | ARI | 1 |
q = '''
SELECT * from pragma_table_info('game_log')
'''
run_query(q)
cid | name | type | notnull | dflt_value | pk | |
---|---|---|---|---|---|---|
0 | 0 | date | INTEGER | 0 | None | 0 |
1 | 1 | number_of_game | INTEGER | 0 | None | 0 |
2 | 2 | day_of_week | TEXT | 0 | None | 0 |
3 | 3 | v_name | TEXT | 0 | None | 0 |
4 | 4 | v_league | TEXT | 0 | None | 0 |
5 | 5 | v_game_number | INTEGER | 0 | None | 0 |
6 | 6 | h_name | TEXT | 0 | None | 0 |
7 | 7 | h_league | TEXT | 0 | None | 0 |
8 | 8 | h_game_number | INTEGER | 0 | None | 0 |
9 | 9 | v_score | INTEGER | 0 | None | 0 |
10 | 10 | h_score | INTEGER | 0 | None | 0 |
11 | 11 | length_outs | REAL | 0 | None | 0 |
12 | 12 | day_night | TEXT | 0 | None | 0 |
13 | 13 | completion | TEXT | 0 | None | 0 |
14 | 14 | forefeit | TEXT | 0 | None | 0 |
15 | 15 | protest | TEXT | 0 | None | 0 |
16 | 16 | park_id | TEXT | 0 | None | 0 |
17 | 17 | attendance | REAL | 0 | None | 0 |
18 | 18 | length_minutes | REAL | 0 | None | 0 |
19 | 19 | v_line_score | TEXT | 0 | None | 0 |
20 | 20 | h_line_score | TEXT | 0 | None | 0 |
21 | 21 | v_at_bats | REAL | 0 | None | 0 |
22 | 22 | v_hits | REAL | 0 | None | 0 |
23 | 23 | v_doubles | REAL | 0 | None | 0 |
24 | 24 | v_triples | REAL | 0 | None | 0 |
25 | 25 | v_homeruns | REAL | 0 | None | 0 |
26 | 26 | v_rbi | REAL | 0 | None | 0 |
27 | 27 | v_sacrifice_hits | REAL | 0 | None | 0 |
28 | 28 | v_sacrifice_flies | REAL | 0 | None | 0 |
29 | 29 | v_hit_by_pitch | REAL | 0 | None | 0 |
30 | 30 | v_walks | REAL | 0 | None | 0 |
31 | 31 | v_intentional_walks | REAL | 0 | None | 0 |
32 | 32 | v_strikeouts | REAL | 0 | None | 0 |
33 | 33 | v_stolen_bases | REAL | 0 | None | 0 |
34 | 34 | v_caught_stealing | REAL | 0 | None | 0 |
35 | 35 | v_grounded_into_double | REAL | 0 | None | 0 |
36 | 36 | v_first_catcher_interference | REAL | 0 | None | 0 |
37 | 37 | v_left_on_base | REAL | 0 | None | 0 |
38 | 38 | v_pitchers_used | REAL | 0 | None | 0 |
39 | 39 | v_individual_earned_runs | REAL | 0 | None | 0 |
40 | 40 | v_team_earned_runs | REAL | 0 | None | 0 |
41 | 41 | v_wild_pitches | REAL | 0 | None | 0 |
42 | 42 | v_balks | REAL | 0 | None | 0 |
43 | 43 | v_putouts | REAL | 0 | None | 0 |
44 | 44 | v_assists | REAL | 0 | None | 0 |
45 | 45 | v_errors | REAL | 0 | None | 0 |
46 | 46 | v_passed_balls | REAL | 0 | None | 0 |
47 | 47 | v_double_plays | REAL | 0 | None | 0 |
48 | 48 | v_triple_plays | REAL | 0 | None | 0 |
49 | 49 | h_at_bats | REAL | 0 | None | 0 |
50 | 50 | h_hits | REAL | 0 | None | 0 |
51 | 51 | h_doubles | REAL | 0 | None | 0 |
52 | 52 | h_triples | REAL | 0 | None | 0 |
53 | 53 | h_homeruns | REAL | 0 | None | 0 |
54 | 54 | h_rbi | REAL | 0 | None | 0 |
55 | 55 | h_sacrifice_hits | REAL | 0 | None | 0 |
56 | 56 | h_sacrifice_flies | REAL | 0 | None | 0 |
57 | 57 | h_hit_by_pitch | REAL | 0 | None | 0 |
58 | 58 | h_walks | REAL | 0 | None | 0 |
59 | 59 | h_intentional_walks | REAL | 0 | None | 0 |
60 | 60 | h_strikeouts | REAL | 0 | None | 0 |
61 | 61 | h_stolen_bases | REAL | 0 | None | 0 |
62 | 62 | h_caught_stealing | REAL | 0 | None | 0 |
63 | 63 | h_grounded_into_double | REAL | 0 | None | 0 |
64 | 64 | h_first_catcher_interference | REAL | 0 | None | 0 |
65 | 65 | h_left_on_base | REAL | 0 | None | 0 |
66 | 66 | h_pitchers_used | REAL | 0 | None | 0 |
67 | 67 | h_individual_earned_runs | REAL | 0 | None | 0 |
68 | 68 | h_team_earned_runs | REAL | 0 | None | 0 |
69 | 69 | h_wild_pitches | REAL | 0 | None | 0 |
70 | 70 | h_balks | REAL | 0 | None | 0 |
71 | 71 | h_putouts | REAL | 0 | None | 0 |
72 | 72 | h_assists | REAL | 0 | None | 0 |
73 | 73 | h_errors | REAL | 0 | None | 0 |
74 | 74 | h_passed_balls | REAL | 0 | None | 0 |
75 | 75 | h_double_plays | REAL | 0 | None | 0 |
76 | 76 | h_triple_plays | REAL | 0 | None | 0 |
77 | 77 | hp_umpire_id | TEXT | 0 | None | 0 |
78 | 78 | hp_umpire_name | TEXT | 0 | None | 0 |
79 | 79 | 1b_umpire_id | TEXT | 0 | None | 0 |
80 | 80 | 1b_umpire_name | TEXT | 0 | None | 0 |
81 | 81 | 2b_umpire_id | TEXT | 0 | None | 0 |
82 | 82 | 2b_umpire_name | TEXT | 0 | None | 0 |
83 | 83 | 3b_umpire_id | TEXT | 0 | None | 0 |
84 | 84 | 3b_umpire_name | TEXT | 0 | None | 0 |
85 | 85 | lf_umpire_id | TEXT | 0 | None | 0 |
86 | 86 | lf_umpire_name | TEXT | 0 | None | 0 |
87 | 87 | rf_umpire_id | TEXT | 0 | None | 0 |
88 | 88 | rf_umpire_name | TEXT | 0 | None | 0 |
89 | 89 | v_manager_id | TEXT | 0 | None | 0 |
90 | 90 | v_manager_name | TEXT | 0 | None | 0 |
91 | 91 | h_manager_id | TEXT | 0 | None | 0 |
92 | 92 | h_manager_name | TEXT | 0 | None | 0 |
93 | 93 | winning_pitcher_id | TEXT | 0 | None | 0 |
94 | 94 | winning_pitcher_name | TEXT | 0 | None | 0 |
95 | 95 | losing_pitcher_id | TEXT | 0 | None | 0 |
96 | 96 | losing_pitcher_name | TEXT | 0 | None | 0 |
97 | 97 | saving_pitcher_id | TEXT | 0 | None | 0 |
98 | 98 | saving_pitcher_name | TEXT | 0 | None | 0 |
99 | 99 | winning_rbi_batter_id | TEXT | 0 | None | 0 |
100 | 100 | winning_rbi_batter_id_name | TEXT | 0 | None | 0 |
101 | 101 | v_starting_pitcher_id | TEXT | 0 | None | 0 |
102 | 102 | v_starting_pitcher_name | TEXT | 0 | None | 0 |
103 | 103 | h_starting_pitcher_id | TEXT | 0 | None | 0 |
104 | 104 | h_starting_pitcher_name | TEXT | 0 | None | 0 |
105 | 105 | v_player_1_id | TEXT | 0 | None | 0 |
106 | 106 | v_player_1_name | TEXT | 0 | None | 0 |
107 | 107 | v_player_1_def_pos | REAL | 0 | None | 0 |
108 | 108 | v_player_2_id | TEXT | 0 | None | 0 |
109 | 109 | v_player_2_name | TEXT | 0 | None | 0 |
110 | 110 | v_player_2_def_pos | REAL | 0 | None | 0 |
111 | 111 | v_player_3_id | TEXT | 0 | None | 0 |
112 | 112 | v_player_3_name | TEXT | 0 | None | 0 |
113 | 113 | v_player_3_def_pos | REAL | 0 | None | 0 |
114 | 114 | v_player_4_id | TEXT | 0 | None | 0 |
115 | 115 | v_player_4_name | TEXT | 0 | None | 0 |
116 | 116 | v_player_4_def_pos | REAL | 0 | None | 0 |
117 | 117 | v_player_5_id | TEXT | 0 | None | 0 |
118 | 118 | v_player_5_name | TEXT | 0 | None | 0 |
119 | 119 | v_player_5_def_pos | REAL | 0 | None | 0 |
120 | 120 | v_player_6_id | TEXT | 0 | None | 0 |
121 | 121 | v_player_6_name | TEXT | 0 | None | 0 |
122 | 122 | v_player_6_def_pos | REAL | 0 | None | 0 |
123 | 123 | v_player_7_id | TEXT | 0 | None | 0 |
124 | 124 | v_player_7_name | TEXT | 0 | None | 0 |
125 | 125 | v_player_7_def_pos | REAL | 0 | None | 0 |
126 | 126 | v_player_8_id | TEXT | 0 | None | 0 |
127 | 127 | v_player_8_name | TEXT | 0 | None | 0 |
128 | 128 | v_player_8_def_pos | REAL | 0 | None | 0 |
129 | 129 | v_player_9_id | TEXT | 0 | None | 0 |
130 | 130 | v_player_9_name | TEXT | 0 | None | 0 |
131 | 131 | v_player_9_def_pos | REAL | 0 | None | 0 |
132 | 132 | h_player_1_id | TEXT | 0 | None | 0 |
133 | 133 | h_player_1_name | TEXT | 0 | None | 0 |
134 | 134 | h_player_1_def_pos | REAL | 0 | None | 0 |
135 | 135 | h_player_2_id | TEXT | 0 | None | 0 |
136 | 136 | h_player_2_name | TEXT | 0 | None | 0 |
137 | 137 | h_player_2_def_pos | REAL | 0 | None | 0 |
138 | 138 | h_player_3_id | TEXT | 0 | None | 0 |
139 | 139 | h_player_3_name | TEXT | 0 | None | 0 |
140 | 140 | h_player_3_def_pos | REAL | 0 | None | 0 |
141 | 141 | h_player_4_id | TEXT | 0 | None | 0 |
142 | 142 | h_player_4_name | TEXT | 0 | None | 0 |
143 | 143 | h_player_4_def_pos | REAL | 0 | None | 0 |
144 | 144 | h_player_5_id | TEXT | 0 | None | 0 |
145 | 145 | h_player_5_name | TEXT | 0 | None | 0 |
146 | 146 | h_player_5_def_pos | REAL | 0 | None | 0 |
147 | 147 | h_player_6_id | TEXT | 0 | None | 0 |
148 | 148 | h_player_6_name | TEXT | 0 | None | 0 |
149 | 149 | h_player_6_def_pos | REAL | 0 | None | 0 |
150 | 150 | h_player_7_id | TEXT | 0 | None | 0 |
151 | 151 | h_player_7_name | TEXT | 0 | None | 0 |
152 | 152 | h_player_7_def_pos | REAL | 0 | None | 0 |
153 | 153 | h_player_8_id | TEXT | 0 | None | 0 |
154 | 154 | h_player_8_name | TEXT | 0 | None | 0 |
155 | 155 | h_player_8_def_pos | REAL | 0 | None | 0 |
156 | 156 | h_player_9_id | TEXT | 0 | None | 0 |
157 | 157 | h_player_9_name | TEXT | 0 | None | 0 |
158 | 158 | h_player_9_def_pos | REAL | 0 | None | 0 |
159 | 159 | additional_info | TEXT | 0 | None | 0 |
160 | 160 | acquisition_info | TEXT | 0 | None | 0 |
161 | 161 | game_id | TEXT | 0 | None | 0 |
The following are opportunities for normalization of our data:
We'll start by creating new tables which don't contain any foreign key relations. It's important to start with these tables, as other tables will have relations to these tables, and so these tables will need to exist first.
We will create following tables:
person
park
league
appearance_type
person
c1 = '''
CREATE TABLE IF NOT EXISTS person (
person_id TEXT PRIMARY KEY,
first_name TEXT,
last_name TEXT
);
'''
c2 = '''
INSERT OR IGNORE INTO person
SELECT
id,
first,
last
FROM person_codes;
'''
q = '''
SELECT
*
FROM person
LIMIT 5;
'''
run_command(c1)
run_command(c2)
run_query(q)
person_id | first_name | last_name | |
---|---|---|---|
0 | aardd001 | David | Aardsma |
1 | aaroh101 | Hank | Aaron |
2 | aarot101 | Tommie | Aaron |
3 | aased001 | Don | Aase |
4 | abada001 | Andy | Abad |
park
c1 = '''
CREATE TABLE IF NOT EXISTS park (
park_id TEXT PRIMARY KEY,
name TEXT,
nickname TEXT,
city TEXT,
state TEXT,
notes TEXT
);
'''
c2 = '''
INSERT OR IGNORE INTO park
SELECT
park_id,
name,
aka,
city,
state,
notes
FROM park_codes;
'''
q = '''
SELECT
*
FROM park
LIMIT 5;
'''
run_command(c1)
run_command(c2)
run_query(q)
park_id | name | nickname | city | state | notes | |
---|---|---|---|---|---|---|
0 | ALB01 | Riverside Park | None | Albany | NY | TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882 |
1 | ALT01 | Columbia Park | None | Altoona | PA | None |
2 | ANA01 | Angel Stadium of Anaheim | Edison Field; Anaheim Stadium | Anaheim | CA | None |
3 | ARL01 | Arlington Stadium | None | Arlington | TX | None |
4 | ARL02 | Rangers Ballpark in Arlington | The Ballpark in Arlington; Ameriquest Fl | Arlington | TX | None |
league
c1 = '''
CREATE TABLE IF NOT EXISTS league (
league_id TEXT PRIMARY KEY,
league_name TEXT
);
'''
c2 = '''
INSERT OR IGNORE INTO league
VALUES
("NL", "National League"),
("AL", "American League"),
("AA", "American Association"),
("FL", "Federal League"),
("PL", "Players League"),
("UA", "Union Association")
;
'''
q = '''
SELECT
*
FROM league;
'''
run_command(c1)
run_command(c2)
run_query(q)
league_id | league_name | |
---|---|---|
0 | NL | National League |
1 | AL | American League |
2 | AA | American Association |
3 | FL | Federal League |
4 | PL | Players League |
5 | UA | Union Association |
appearance_type
We have a appearance_type.csv file, which contains all values need for this table.
c1 = '''DROP TABLE IF EXISTS appearance_type'''
c2 = '''
CREATE TABLE IF NOT EXISTS appearance_type (
appearance_type_id TEXT PRIMARY KEY,
name TEXT,
category TEXT
);
'''
run_command(c1)
run_command(c2)
appearance_pd = pd.read_csv('appearance_type.csv')
with sqlite3.connect('mlb.db') as conn:
appearance_pd.to_sql('appearance_type',
conn,
index=False,
if_exists='append')
q = '''
SELECT
*
FROM appearance_type;
'''
run_query(q)
appearance_type_id | name | category | |
---|---|---|---|
0 | O1 | Batter 1 | offense |
1 | O2 | Batter 2 | offense |
2 | O3 | Batter 3 | offense |
3 | O4 | Batter 4 | offense |
4 | O5 | Batter 5 | offense |
5 | O6 | Batter 6 | offense |
6 | O7 | Batter 7 | offense |
7 | O8 | Batter 8 | offense |
8 | O9 | Batter 9 | offense |
9 | D1 | Pitcher | defense |
10 | D2 | Catcher | defense |
11 | D3 | 1st Base | defense |
12 | D4 | 2nd Base | defense |
13 | D5 | 3rd Base | defense |
14 | D6 | Shortstop | defense |
15 | D7 | Left Field | defense |
16 | D8 | Center Field | defense |
17 | D9 | Right Field | defense |
18 | D10 | Unknown Position | defense |
19 | UHP | Home Plate | umpire |
20 | U1B | First Base | umpire |
21 | U2B | Second Base | umpire |
22 | U3B | Third Base | umpire |
23 | ULF | Left Field | umpire |
24 | URF | Right Field | umpire |
25 | MM | Manager | manager |
26 | AWP | Winning Pitcher | award |
27 | ALP | Losing Pitcher | award |
28 | ASP | Saving Pitcher | award |
29 | AWB | Winning RBI Batter | award |
30 | PSP | Starting Pitcher | pitcher |
Re-using the run_command() function defined earlier, we can add a single line to enable enforcement of foreign key restraints:
def run_command(c): with sqlite3.connect(DB) as conn: conn.execute('PRAGMA foreign_keys = ON;') conn.isolation_level = None conn.execute(c)
team
def run_command(c):
with sqlite3.connect(db) as conn:
conn.execute('PRAGMA foreign_keys = ON;')
conn.isolation_level = None
conn.execute(c)
c1 = """
CREATE TABLE IF NOT EXISTS team (
team_id TEXT PRIMARY KEY,
league_id TEXT,
city TEXT,
nickname TEXT,
franch_id TEXT,
FOREIGN KEY (league_id) REFERENCES league(league_id)
);
"""
c2 = """
INSERT OR IGNORE INTO team
SELECT
team_id,
league,
city,
nickname,
franch_id
FROM team_codes;
"""
q = """
SELECT * FROM team
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
team_id | league_id | city | nickname | franch_id | |
---|---|---|---|---|---|
0 | ALT | UA | Altoona | Mountain Cities | ALT |
1 | ARI | NL | Arizona | Diamondbacks | ARI |
2 | BFN | NL | Buffalo | Bisons | BFN |
3 | BFP | PL | Buffalo | Bisons | BFP |
4 | BL1 | None | Baltimore | Canaries | BL1 |
game
c1 = """
CREATE TABLE IF NOT EXISTS game (
game_id TEXT PRIMARY KEY,
date TEXT,
number_of_game INTEGER,
park_id TEXT,
length_outs INTEGER,
day BOOLEAN,
completion TEXT,
forefeit TEXT,
protest TEXT,
attendance INTEGER,
legnth_minutes INTEGER,
additional_info TEXT,
acquisition_info TEXT,
FOREIGN KEY (park_id) REFERENCES park(park_id)
);
"""
c2 = """
INSERT OR IGNORE INTO game
SELECT
game_id,
date,
number_of_game,
park_id,
length_outs,
CASE
WHEN day_night = "D" THEN 1
WHEN day_night = "N" THEN 0
ELSE NULL
END
AS day,
completion,
forefeit,
protest,
attendance,
length_minutes,
additional_info,
acquisition_info
FROM game_log;
"""
q = """
SELECT * FROM game
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
game_id | date | number_of_game | park_id | length_outs | day | completion | forefeit | protest | attendance | legnth_minutes | additional_info | acquisition_info | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FW1187105040 | 18710504 | 0 | FOR01 | 54 | 1 | None | None | None | 200 | 120 | None | Y |
1 | WS3187105050 | 18710505 | 0 | WAS01 | 54 | 1 | None | None | None | 5000 | 145 | HTBF | Y |
2 | RC1187105060 | 18710506 | 0 | RCK01 | 54 | 1 | None | None | None | 1000 | 140 | None | Y |
3 | CH1187105080 | 18710508 | 0 | CHI01 | 54 | 1 | None | None | None | 5000 | 150 | None | Y |
4 | TRO187105090 | 18710509 | 0 | TRO01 | 54 | 1 | None | None | None | 3250 | 145 | HTBF | Y |
Team Appearance
The team_appearance table has a compound primary key composed of the team name and the game ID. In addition, a boolean column home is used to differentiate between the home and the away team. The rest of the columns are scores or statistics that in our original game log are repeated for each of the home and away teams.
In order to insert this data cleanly, we'll need to use a UNION clause. This will combine data from home team and versus team.
c1 = """
CREATE TABLE IF NOT EXISTS team_appearance (
team_id TEXT,
game_id TEXT,
home BOOLEAN,
league_id TEXT,
score INTEGER,
line_score TEXT,
at_bats INTEGER,
hits INTEGER,
doubles INTEGER,
triples INTEGER,
homeruns INTEGER,
rbi INTEGER,
sacrifice_hits INTEGER,
sacrifice_flies INTEGER,
hit_by_pitch INTEGER,
walks INTEGER,
intentional_walks INTEGER,
strikeouts INTEGER,
stolen_bases INTEGER,
caught_stealing INTEGER,
grounded_into_double INTEGER,
first_catcher_interference INTEGER,
left_on_base INTEGER,
pitchers_used INTEGER,
individual_earned_runs INTEGER,
team_earned_runs INTEGER,
wild_pitches INTEGER,
balks INTEGER,
putouts INTEGER,
assists INTEGER,
errors INTEGER,
passed_balls INTEGER,
double_plays INTEGER,
triple_plays INTEGER,
PRIMARY KEY (team_id, game_id),
FOREIGN KEY (team_id) REFERENCES team(team_id),
FOREIGN KEY (game_id) REFERENCES game(game_id),
FOREIGN KEY (league_id) REFERENCES league(league_id)
);
"""
run_command(c1)
c2 = """
INSERT OR IGNORE INTO team_appearance
SELECT
h_name,
game_id,
1 AS home,
h_league,
h_score,
h_line_score,
h_at_bats,
h_hits,
h_doubles,
h_triples,
h_homeruns,
h_rbi,
h_sacrifice_hits,
h_sacrifice_flies,
h_hit_by_pitch,
h_walks,
h_intentional_walks,
h_strikeouts,
h_stolen_bases,
h_caught_stealing,
h_grounded_into_double,
h_first_catcher_interference,
h_left_on_base,
h_pitchers_used,
h_individual_earned_runs,
h_team_earned_runs,
h_wild_pitches,
h_balks,
h_putouts,
h_assists,
h_errors,
h_passed_balls,
h_double_plays,
h_triple_plays
FROM game_log
UNION
SELECT
v_name,
game_id,
0 AS home,
v_league,
v_score,
v_line_score,
v_at_bats,
v_hits,
v_doubles,
v_triples,
v_homeruns,
v_rbi,
v_sacrifice_hits,
v_sacrifice_flies,
v_hit_by_pitch,
v_walks,
v_intentional_walks,
v_strikeouts,
v_stolen_bases,
v_caught_stealing,
v_grounded_into_double,
v_first_catcher_interference,
v_left_on_base,
v_pitchers_used,
v_individual_earned_runs,
v_team_earned_runs,
v_wild_pitches,
v_balks,
v_putouts,
v_assists,
v_errors,
v_passed_balls,
v_double_plays,
v_triple_plays
from game_log;
"""
run_command(c2)
q = """
SELECT * FROM team_appearance
WHERE game_id = (SELECT MIN(game_id) FROM team_appearance)
OR game_id = (SELECT MAX(game_id) FROM team_appearance)
ORDER By game_id, home;
"""
run_query(q)
team_id | game_id | home | league_id | score | line_score | at_bats | hits | doubles | triples | homeruns | rbi | sacrifice_hits | sacrifice_flies | hit_by_pitch | walks | intentional_walks | strikeouts | stolen_bases | caught_stealing | grounded_into_double | first_catcher_interference | left_on_base | pitchers_used | individual_earned_runs | team_earned_runs | wild_pitches | balks | putouts | assists | errors | passed_balls | double_plays | triple_plays | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SLU | ALT188404300 | 0 | UA | 15 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
1 | ALT | ALT188404300 | 1 | UA | 2 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
2 | BSU | WSU188409250 | 0 | UA | 2 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
3 | WSU | WSU188409250 | 1 | UA | 10 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
Person Appearance
The final table we need to create is person_appearance. It has foreign key relations to four tables: - team - person - game - appearance_type
The person_appearance table will be used to store information on appearances in games by managers, players, and umpires as detailed in the appearance_type table.
We'll need to use a similar technique to insert data as we used with the team_appearance table, however we will have to write much larger queries - one for each column instead of one for each team as before. We will need to work out for each column what the appearance_type_id will be by cross-referencing the columns with the appearance_type table.
We have decided to create an integer primary key for this table, because having every column be a compound primary quickly becomes cumbersome when writing queries.
When we get to the offensive and defensive positions for both teams, we essentially are performing 36 permutations: 2 (home, away) * 2 (offense + defense) * 9 (9 positions).
To save us from manually copying this out, we can instead use a loop and python string formatting to generate the queries:
template = """
INSERT INTO person_appearance (
game_id,
team_id,
person_id,
appearance_type_id
)
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"O{num}"
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL
UNION
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"D" || CAST({hv}_player_{num}_def_pos AS INT)
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL;
"""
run_command(c1)
run_command(c2)
for hv in ["h","v"]:
for num in range(1,10):
query_vars = {
"hv": hv,
"num": num
}
# run commmand is a helper function which runs
# a query against our database.
run_command(template.format(**query_vars))
However, we will still need to manually write down queries for inserting data related to umpires, managers, pitchers and awards.
c0 = "DROP TABLE IF EXISTS person_appearance"
run_command(c0)
c1 = """
CREATE TABLE person_appearance (
appearance_id INTEGER PRIMARY KEY,
person_id TEXT,
team_id TEXT,
game_id TEXT,
appearance_type_id,
FOREIGN KEY (person_id) REFERENCES person(person_id),
FOREIGN KEY (team_id) REFERENCES team(team_id),
FOREIGN KEY (game_id) REFERENCES game(game_id),
FOREIGN KEY (appearance_type_id) REFERENCES appearance_type(appearance_type_id)
);
"""
c2 = """
INSERT OR IGNORE INTO person_appearance (
game_id,
team_id,
person_id,
appearance_type_id
)
SELECT
game_id,
NULL,
hp_umpire_id,
"UHP"
FROM game_log
WHERE hp_umpire_id IS NOT NULL
UNION
SELECT
game_id,
NULL,
[1b_umpire_id],
"U1B"
FROM game_log
WHERE "1b_umpire_id" IS NOT NULL
UNION
SELECT
game_id,
NULL,
[2b_umpire_id],
"U2B"
FROM game_log
WHERE [2b_umpire_id] IS NOT NULL
UNION
SELECT
game_id,
NULL,
[3b_umpire_id],
"U3B"
FROM game_log
WHERE [3b_umpire_id] IS NOT NULL
UNION
SELECT
game_id,
NULL,
lf_umpire_id,
"ULF"
FROM game_log
WHERE lf_umpire_id IS NOT NULL
UNION
SELECT
game_id,
NULL,
rf_umpire_id,
"URF"
FROM game_log
WHERE rf_umpire_id IS NOT NULL
UNION
SELECT
game_id,
v_name,
v_manager_id,
"MM"
FROM game_log
WHERE v_manager_id IS NOT NULL
UNION
SELECT
game_id,
h_name,
h_manager_id,
"MM"
FROM game_log
WHERE h_manager_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
winning_pitcher_id,
"AWP"
FROM game_log
WHERE winning_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score < v_score THEN h_name
ELSE v_name
END,
losing_pitcher_id,
"ALP"
FROM game_log
WHERE losing_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
saving_pitcher_id,
"ASP"
FROM game_log
WHERE saving_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
winning_rbi_batter_id,
"AWB"
FROM game_log
WHERE winning_rbi_batter_id IS NOT NULL
UNION
SELECT
game_id,
v_name,
v_starting_pitcher_id,
"PSP"
FROM game_log
WHERE v_starting_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
h_name,
h_starting_pitcher_id,
"PSP"
FROM game_log
WHERE h_starting_pitcher_id IS NOT NULL;
"""
template = """
INSERT INTO person_appearance (
game_id,
team_id,
person_id,
appearance_type_id
)
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"O{num}"
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL
UNION
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"D" || CAST({hv}_player_{num}_def_pos AS INT)
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL;
"""
run_command(c1)
run_command(c2)
for hv in ["h","v"]:
for num in range(1,10):
query_vars = {
"hv": hv,
"num": num
}
run_command(template.format(**query_vars))
print(run_query("SELECT COUNT(DISTINCT game_id) games_game FROM game"))
print(run_query("SELECT COUNT(DISTINCT game_id) games_person_appearance FROM person_appearance"))
q = """
SELECT
pa.*,
at.name,
at.category
FROM person_appearance pa
INNER JOIN appearance_type at on at.appearance_type_id = pa.appearance_type_id
WHERE PA.game_id = (
SELECT max(game_id)
FROM person_appearance
)
ORDER BY team_id, appearance_type_id
"""
run_query(q)
games_game 0 171907 games_person_appearance 0 171907
appearance_id | person_id | team_id | game_id | appearance_type_id | name | category | |
---|---|---|---|---|---|---|---|
0 | 1646114 | steab101 | None | WSU188409250 | UHP | Home Plate | umpire |
1 | 1646116 | murnt101 | BSU | WSU188409250 | MM | Manager | manager |
2 | 1646115 | crane101 | BSU | WSU188409250 | PSP | Starting Pitcher | pitcher |
3 | 1646118 | scanm801 | WSU | WSU188409250 | MM | Manager | manager |
4 | 1646117 | dailh101 | WSU | WSU188409250 | PSP | Starting Pitcher | pitcher |
Drop the tables we created to hold our unnormalized data:
show_tables()
name | type | |
---|---|---|
0 | person | table |
1 | park | table |
2 | league | table |
3 | team | table |
4 | game | table |
5 | team_appearance | table |
6 | game_log | table |
7 | park_codes | table |
8 | person_codes | table |
9 | team_codes | table |
10 | appearance_type | table |
11 | person_appearance | table |
tables = [
"game_log",
"park_codes",
"team_codes",
"person_codes"
]
for t in tables:
c = '''
DROP TABLE {}
'''.format(t)
run_command(c)
*Our new database consists of the following mentioned tables*
show_tables()
name | type | |
---|---|---|
0 | person | table |
1 | park | table |
2 | league | table |
3 | team | table |
4 | game | table |
5 | team_appearance | table |
6 | appearance_type | table |
7 | person_appearance | table |
ml.db
with normalised tables¶