I obtained this data from Kaggle, which can be found here
We'll be focusing on the main race on Sunday.
In this project, I use different metrics to compare the two McLaren drivers in the 2007 season: Fernando Alonso (from Spain) and Lewis Hamilton (from the UK). A 2x world champion, Alonso has driven in F1 since 2001 and known to derive immense power from poorly engineered cars. Conversely, this is Hamilton's first season in Formula 1; he was previously racing in GP2 (a former F1 feeder).
Hamilton (22 years old in 2007) is younger and therefore less experienced driver than Alonso (26 years in 2007), so we would expect Alonso to perform better.
The metrics we'll be using to evaluate each driver will be their ability to:
Wins and points will be more heavily weighted since these directly dictate a driver's standing in a season. However, grid and lap positioning heavily influence these two factors and can help undercover any nuances between drivers who have a similar number of wins/points.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<span style="font-weight:bold">
NOTE: The raw code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.</span>''')
Data Cleaning mostly consisted of flagging outliers, setting place holders for missing values, checking consistency between related variables, and renaming columns. After the files were cleaned, I used SQL to join relevant fields from different tables and then made some final changes prior to this analysis.
We'll start out by importing the relevant libraries and data. Please note you won't see anything in this section unless you toggle on the raw code.
# import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
# load up datasets
re = pd.read_pickle('final_data/agg_results')
ham_2007 = pd.read_pickle('final_data/ham_results')
ham_2007_lap = pd.read_pickle('final_data/lap_times')
lap_all = pd.read_pickle('final_data/lap_times_all')
ham_quali_2007 = pd.read_pickle('final_data/qualifying')
Right before the actual race begins, all the drivers position themselves along the track. The order of this positioning is determined by each driver's performance in qualifying. We deem that one driver outperforms another in any given race if their grid positioning is better/ahead of their teammate's grid positioning.
# explore relationship between qualifying/grid positioning and wins/podiums/positions on aggregate level
re_pos = re[['race', 'dname', 'position', 'grid']].copy()
re_pos['win'] = (re_pos.position == 1).astype('int64')
re_win_agg = re_pos.groupby(['grid'], as_index=False)['win'].sum()
re_win_agg = re_win_agg.loc[re_win_agg['grid'] > 0 ,]
fig = px.line(re_win_agg, x='grid', y='win', range_x=[0, 33], labels={'win': 'Number of Wins', 'grid': 'Grid Position'},
title='Relationship of Starting Position and Wins (1950-2017)')
fig.show()
In the graph above, we see that final results are heavily influenced by grid (initial) positioning. The closer to first one starts, the more likely they'll finish towards first as well.
From this point forward, all of our analysis will focus on the 2007 season.
ham_2007_piv = ham_2007.copy()
ham_2007_piv['race_date'] = ham_2007_piv['race_date'].astype('str')
ham_2007_piv['race'] = ham_2007_piv['race_date'] + '_' + ham_2007_piv['race']
ham_2007_piv = ham_2007_piv.pivot(index='race', columns='dname', values=['grid', 'position', 'points', 'fastestLapSpeed',
'fastestLapTime']).reset_index()
# make pivoted version for easier analysis
ham_2007_piv['race'] = ham_2007_piv['race'].str.split('_', n=1, expand=False)
ham_2007_piv['race_date'] = pd.to_datetime(ham_2007_piv['race'].str[0])
ham_2007_piv['race_'] = ham_2007_piv['race'].str[1]
ham_2007_piv = ham_2007_piv[['race_date', 'race_', 'grid', 'position', 'points', 'fastestLapSpeed', 'fastestLapTime']]
ham_2007_piv.rename(columns={'race_':'race'}, inplace=True)
ham_2007_piv.sort_values(by=['race_date', 'race'], inplace=True)
# create dataframe focusing on positioning
ham_2007_piv_pos = ham_2007_piv[['race_date', 'race', 'position', 'grid']].copy()
ham_2007_piv_pos['grid'] = ham_2007_piv_pos['grid'].astype('float')
ham_2007_piv_pos['position'] = ham_2007_piv_pos['position'].astype('float')
# number of positions behind first place
ham_2007_piv_pos['alon_grid_from1'] = ham_2007_piv_pos['grid']['Fernando Alonso'] - 1
ham_2007_piv_pos['ham_grid_from1'] = ham_2007_piv_pos['grid']['Lewis Hamilton'] - 1
fig = go.Figure()
fig.add_trace(go.Bar(x=ham_2007_piv_pos.race, y=ham_2007_piv_pos.alon_grid_from1,
name='Alonso', marker_color='#1C8356'))
fig.add_trace(go.Bar(x=ham_2007_piv_pos.race, y=ham_2007_piv_pos.ham_grid_from1,
name='Hamilton', marker_color='#2E91E5'))
fig.update_layout(barmode='group', xaxis_tickangle=-45,
yaxis=dict(title='Number of Positions'),
xaxis=dict(title='Race'),
title='Number of Positions Each Driver Began Behind 1st Place (2007 Season)')
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()
The graph above illustrates the starting position of Alonso and Hamilton for each race in the 2007 season; the shorter the bar, the closer each driver started towards first place.
We also observe the following:
Hamilton succeeded in strengthening his chance of finishing in a podium position more frequently than his McLaren counterpart.
# plot number of times Hamilton began ahead of alonso and vice-versa
grid_comp = ham_2007[['race', 'dname', 'grid']].copy()
grid_comp['grid_dif_ham'] = grid_comp.groupby('race')['grid'].diff()
grid_comp['grid_dif_alon'] = grid_comp.groupby('race')['grid'].diff(-1)
grid_comp['ham_ahead'] = (grid_comp.grid_dif_ham < 0).astype('int64')
grid_comp['alon_ahead'] = (grid_comp.grid_dif_alon < 0).astype('int64')
# aggregate number of times started ahead of teammate
grid_comp = grid_comp.groupby(['dname'])[['ham_ahead', 'alon_ahead']].sum()
grid_comp.reset_index(inplace=True)
# consolidate into one column since only one of the two drivers are going to be ahead at a time
grid_comp['ahead'] = grid_comp['ham_ahead'] + grid_comp['alon_ahead']
grid_comp = grid_comp[['dname', 'ahead']]
grid_comp['pct_ahead'] = grid_comp['ahead']/17
fig = go.Figure(data=[go.Pie(labels=grid_comp.dname, values=grid_comp.pct_ahead, textinfo='label+percent',
insidetextorientation='radial', hole=.3)])
colors=['#1C8356', '#2E91E5']
fig.update_traces(marker=dict(colors=colors))
fig.update_layout(title_text='Percentage of Races Where Driver Began Ahead of Teammate (2007 Season)')
fig.show()
Hamilton started closer to first place in 10 of the 17 races in the 2007 season, making him the better driver in this first metric. Given the trend we saw before, we would expect Hamilton to have better finishing positions and subsequently gain more points.
This next stage of the analysis looks at which position each driver tended to stay in each lap and position volatility. As we touched upon before, a better driver will produce consistent results and spend more time towards the front.
fig = px.violin(ham_2007_lap, y='position', x='dname', box=True, points='all', color='dname',
color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
labels={'dname': 'Driver', 'position':'Position'},
title='Distribution of Positions in Each Lap Raced (2007 Season)')
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()
The graph presents the following:
Focusing on consistency, we'll look at the standard deviation of each driver's positioning throughout each race. The smaller the standard deviation, the less their position varied. Since both Hamilton and Alonso (nearly) always began towards the front of the pack, they only had a few positions to gain while simultaneously having many positions to lose. Therefore, any improvement will marginally impact the standard deviation while any loss in position will likely have a large impact.
# look at difference in SD - smaller one means more consistent
ham_lap_stddev = ham_2007_lap.groupby(['race_date', 'race', 'dname'])['position'].std()
ham_lap_stddev = ham_lap_stddev.unstack()
ham_lap_stddev.reset_index(inplace=True)
ham_lap_stddev['dif'] = ham_lap_stddev['Lewis Hamilton'] - ham_lap_stddev['Fernando Alonso']
# add column to indicate which driver better
ham_lap_stddev['better'] = np.where(ham_lap_stddev.dif < 0 , 'Hamilton More Consistent',
(np.where(ham_lap_stddev.dif > 0, 'Alonso More Consistent', 'Equally Consistent')))
# function below sets the color based on whether marker is above/below y = 0 line
def SetColor(x):
if(x < 0):
return "#2E91E5"
elif(x > 0):
return "#1C8356"
else:
return "red"
fig = go.Figure(data=go.Scatter(x=ham_lap_stddev.race, y=ham_lap_stddev.dif, mode='markers',
marker_symbol='star-square', text=ham_lap_stddev.better))
fig.update_traces(marker=dict(size=12, color=list(map(SetColor, ham_lap_stddev.dif)),
line=dict(width=2,
color='Black')),
selector=dict(mode='markers'))
fig.update_layout(title='Comparison of Consistency Maintained Throughout Races (2007 Season)', xaxis=dict(title='Race'),
yaxis=dict(title='Difference in Standard Deviation'))
fig.add_shape(type='line', x0=-1, y0=0, x1=17, y1=0, line=dict(color='Black', dash='dot'),)
fig.show()
Breaking this all down we see that:
While this plot illustrates that Alonso's positioning fluctuated more than that of Hamilton's, the violin plot demonstrated him as the more dependable driver. Yet more importantly, Hamilton drove in the front of the pack in considerably more laps, which more heavily influences points and final positioning. As a result, Hamilton outperformed Alonso again and so far is the more talented driver this season.
As we discussed earlier, wins is one of the easiest ways to decide which driver performs better in a season.
Hamilton came in first in:
Alonso came in first in:
Surprisingly, both McLaren drivers ended up with the same number of wins at the end of the season.
We'll take a closer look at the positions Hamilton and Alonso finished in order to distinguish the two drivers.
# number of positions each driver finished behind first
ham_2007_pos_piv = ham_2007_piv[['race', 'position']].copy()
ham_2007_pos_piv['position'] = ham_2007_pos_piv['position'].astype('int64')
ham_2007_pos_piv['ham_fin1'] = ham_2007_pos_piv['position']['Lewis Hamilton'] - 1
ham_2007_pos_piv['alon_fin1'] = ham_2007_pos_piv['position']['Fernando Alonso'] - 1
fig = go.Figure()
fig.add_trace(go.Bar(x=ham_2007_pos_piv.race, y=ham_2007_pos_piv.alon_fin1,
name='Alonso', marker_color='#1C8356'))
fig.add_trace(go.Bar(x=ham_2007_pos_piv.race, y=ham_2007_pos_piv.ham_fin1,
name='Hamilton', marker_color='#2E91E5'))
fig.update_layout(barmode='group', xaxis_tickangle=-45,
yaxis=dict(title='Number of Positions'),
xaxis=dict(title='Race'),
title='Number of Positions Each Driver Finished Behind 1st Place (2007 Season)')
fig.update_layout(
yaxis = dict(
tickmode = 'linear',
tick0 = 0,
dtick = 2
)
)
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()
The chart above exhibits the following:
While these two produce nearly identical statistics, Hamilton slightly edges out Alonso by finishing in second place one more time than his teammate.
# difference in positions
ham_2007_pos_piv['end_dif'] = ham_2007_pos_piv['position']['Fernando Alonso'] - ham_2007_pos_piv['position']['Lewis Hamilton']
# add column to indicate which driver finished ahead
ham_2007_pos_piv['Leading_Driver'] = np.where(ham_2007_pos_piv.end_dif < 0 , 'Alonso', 'Hamilton')
fig = px.bar(ham_2007_pos_piv, x='race', y='end_dif',
labels={'race': 'Race', 'end_dif': 'Difference'},
color='end_dif', hover_data=['race', 'Leading_Driver'],
title='Number of Positions Hamilton Finished Ahead of Alonso (2007 Season)')
fig.show()
# plot number of times Hamilton finished ahead of alonso and vice-versa
end_comp = ham_2007[['race', 'dname', 'position']].copy()
end_comp['end_dif_ham'] = end_comp.groupby('race')['position'].diff()
end_comp['end_dif_alon'] = end_comp.groupby('race')['position'].diff(-1)
end_comp['ham_ahead'] = (end_comp.end_dif_ham < 0).astype('int64')
end_comp['alon_ahead'] = (end_comp.end_dif_alon < 0).astype('int64')
# aggregate number of times started ahead of teammate
end_comp = end_comp.groupby(['dname'])[['ham_ahead', 'alon_ahead']].sum()
end_comp.reset_index(inplace=True)
# consolidate into one column since only one of the two drivers are going to be ahead at a time
end_comp['ahead'] = end_comp['ham_ahead'] + end_comp['alon_ahead']
end_comp = end_comp[['dname', 'ahead']]
end_comp['pct_ahead'] = end_comp['ahead']/17
fig = go.Figure(data=[go.Pie(labels=end_comp.dname, values=end_comp.pct_ahead, textinfo='label+percent',
insidetextorientation='radial', hole=.3)])
colors=['#1C8356', '#2E91E5']
fig.update_traces(marker=dict(colors=colors))
fig.update_layout(title_text='Percentage of Races Where Driver Finished Ahead of Teammate (2007 Season)')
fig.show()
In the two plots above, we notice that:
Evidently, we have favorable findings for both drivers.
# box plot for Alonso/Fernando of finishing positions - median seems to equal 75th percentile for hamilton
ham_2007_pos = ham_2007[['race_date', 'race', 'dname', 'grid', 'position']].copy()
ham_2007_pos['grid'] = ham_2007_pos['grid'].astype('int64')
ham_2007_pos['position'] = ham_2007_pos['position'].astype('int64')
fig = px.box(ham_2007_pos, y='dname', x='position', color='dname',
color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
labels={'dname': 'Driver', 'position': 'Ending Position'},
title='Distribution of Finishing Position (2007 Season)')
fig.show()
The box plots provide these conclusions:
While Alonso achieved better placement than his teammate in more races, he usually finished ahead by only a single position. However, when it comes to podium placement, the most important component of final position, Hamilton holds the crown. Furthermore, Hamilton tended to finish in better positions when looking at descriptive statistics. While the comparison in this metric is much closer than the previous ones, Hamilton portrays himself as the superior driver in terms of final positioning.
As mentioned earlier, a driver's points determine where they place in the overall standings at the end of a season. While positioning and points are directly linked, we can still explore the subtleties to draw further distinctions between Alonso and Hamilton.
# interactive graph - number of points
ham_2007_finish = ham_2007[['race_date', 'race', 'dname', 'position', 'points']].copy()
ham_2007_finish['win'] = (ham_2007_finish.position == 1).astype('int64')
ham_2007_finish['sum_pts'] = ham_2007_finish.groupby(['dname'])['points'].cumsum()
fig = px.bar(ham_2007_finish, x='sum_pts', y='dname', animation_frame='race', color='dname',
labels={'dname': 'Driver', 'sum_pts': 'Points Earned'},
color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'})
fig.update_traces(dict(marker_line_width=1, marker_line_color="black"))
fig.update_layout(xaxis=dict(range=[0, 120]))
fig.update_layout(title_text="Points Earned By Each Driver (2007 Season)")
# add buttons
fig.update_layout(
updatemenus = [
{
'buttons': [
{
'args': [None, {'frame': {'duration': 800, 'redraw': False},
'fromcurrent': True, 'transition': {'duration': 800,
'easing': 'linear'}}],
'label': 'Play',
'method': 'animate'
},
{
'args': [[None], {'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration':0}}],
'label': 'Pause',
'method': 'animate'
}
]
}
]
)
fig.show()
At the end of the season, both drivers finished with 109 points. As a result, the two most decisive factors in determining the better driver fail to clearly distinguish which driver outperforms the other.
# explore difference in points per race
ham_2007_pts_piv = ham_2007_piv[['race', 'points']].copy()
ham_2007_pts_piv['points'] = ham_2007_pts_piv['points'].astype('float')
ham_2007_pts_piv['dif'] = ham_2007_pts_piv['points']['Lewis Hamilton'] - ham_2007_pts_piv['points']['Fernando Alonso']
ham_2007_pts_piv['Leader'] = np.where(ham_2007_pts_piv.dif < 0 , 'Alonso', 'Hamilton')
fig = px.bar(ham_2007_pts_piv, x='race', y='dif',
labels={'race': 'Race', 'dif': 'Difference'},
color='dif', hover_data=['race', 'Leader'],
title='Difference in Points (2007 Season)')
fig.show()
Alonso scored more points in 10 races, which directly corresponds to the 10 races where he finished ahead of Hamilton. Once again, when Alonso scored more, it was only by a couple points. Meanwhile, Hamilton created a larger gap when he scored more points.
# box plot for Alonso/Fernando of points- median seems to equal 75th percentile for hamilton
ham_2007_pts = ham_2007[['race', 'dname', 'points']].copy()
fig = px.box(ham_2007_pts, y='dname', x='points', color='dname',
color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
labels={'dname': 'Driver', 'points': 'Points'}, title='Distribution of Points (2007 Season)')
fig.show()
These distributions highlight that:
In conclusion, we derive similar insights here as we did in the finishing position comparison since the two go hand in hand. Hamilton barely beats Alonso in this criterion too.
Putting all the metrics together, we can sum up what we found as the following:
Grid positioning: Hamilton is the favorite in this category.
Driving positioning lap-to-lap: Hamilton achieves favorable positioning on a given lap.
Finishing positions: Alonso finishes ahead of Hamilton more often. However, Hamilton is superior where it counts most, podium placement. The difference in 2nd place finishes actually resulted in Hamilton finishing second and Alonso third in the drivers standing for the 2007 season.
Points: Alonso has more races with more points but Hamilton has more favorable statistics.
Final Comparison:
Overall: Hamilton was the better driver in the 2007 season although they were very well-matched. This is is especially impressive since this was his rookie season while Alonso had a few years of experience under his belt.
While Hamilton was victorious in each metric, it was a very evenly-matched comparison. Both drivers performed at a very high caliber and finished in the top 3 drivers standings for the 2007 season. Furthermore, as a team they completely dominated the charts as the only other team besides McLaren to win any races was Ferrari. It would be interesting to compare them again further down the line if they ever became teammates again.