The Fight for McLaren Dominance: Lewis Hamilton vs Fernando Alonso

1. Project Overview

I obtained this data from Kaggle, which can be found here

We'll be focusing on the main race on Sunday.

In this project, I use different metrics to compare the two McLaren drivers in the 2007 season: Fernando Alonso (from Spain) and Lewis Hamilton (from the UK). A 2x world champion, Alonso has driven in F1 since 2001 and known to derive immense power from poorly engineered cars. Conversely, this is Hamilton's first season in Formula 1; he was previously racing in GP2 (a former F1 feeder).

Hamilton (22 years old in 2007) is younger and therefore less experienced driver than Alonso (26 years in 2007), so we would expect Alonso to perform better.

The metrics we'll be using to evaluate each driver will be their ability to:

  • Begin towards the front of the pack (good grid/starting positioning).
  • Maintaining/Improving their positioning in each lap of a given race.
  • Win races.
  • Win points.

Wins and points will be more heavily weighted since these directly dictate a driver's standing in a season. However, grid and lap positioning heavily influence these two factors and can help undercover any nuances between drivers who have a similar number of wins/points.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<span style="font-weight:bold">
NOTE: The raw code for this notebook is by default hidden for easier reading.

To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.</span>''')
Out[1]:
NOTE: The raw code for this notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

2. Data Cleaning

Data Cleaning mostly consisted of flagging outliers, setting place holders for missing values, checking consistency between related variables, and renaming columns. After the files were cleaned, I used SQL to join relevant fields from different tables and then made some final changes prior to this analysis.

3. Preparation for Analysis

We'll start out by importing the relevant libraries and data. Please note you won't see anything in this section unless you toggle on the raw code.

In [2]:
# import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# load up datasets
re = pd.read_pickle('final_data/agg_results')
ham_2007 = pd.read_pickle('final_data/ham_results')
ham_2007_lap = pd.read_pickle('final_data/lap_times')
lap_all = pd.read_pickle('final_data/lap_times_all')
ham_quali_2007 = pd.read_pickle('final_data/qualifying')

4. Grid Positioning

Right before the actual race begins, all the drivers position themselves along the track. The order of this positioning is determined by each driver's performance in qualifying. We deem that one driver outperforms another in any given race if their grid positioning is better/ahead of their teammate's grid positioning.

In [3]:
# explore relationship between qualifying/grid positioning and wins/podiums/positions on aggregate level
re_pos = re[['race', 'dname', 'position', 'grid']].copy()
re_pos['win'] = (re_pos.position == 1).astype('int64')
re_win_agg = re_pos.groupby(['grid'], as_index=False)['win'].sum()
re_win_agg = re_win_agg.loc[re_win_agg['grid'] > 0 ,]

fig = px.line(re_win_agg, x='grid', y='win', range_x=[0, 33], labels={'win': 'Number of Wins', 'grid': 'Grid Position'},
             title='Relationship of Starting Position and Wins (1950-2017)')
fig.show()