Notebook

The Fight for McLaren Dominance: Lewis Hamilton vs Fernando Alonso¶

The Fight for McLaren Dominance: Lewis Hamilton vs Fernando Alonso

1. Project Overview¶

I obtained this data from Kaggle, which can be found here

We'll be focusing on the main race on Sunday.

In this project, I use different metrics to compare the two McLaren drivers in the 2007 season: Fernando Alonso (from Spain) and Lewis Hamilton (from the UK). A 2x world champion, Alonso has driven in F1 since 2001 and known to derive immense power from poorly engineered cars. Conversely, this is Hamilton's first season in Formula 1; he was previously racing in GP2 (a former F1 feeder).

Hamilton (22 years old in 2007) is younger and therefore less experienced driver than Alonso (26 years in 2007), so we would expect Alonso to perform better.

The metrics we'll be using to evaluate each driver will be their ability to:

Begin towards the front of the pack (good grid/starting positioning).
Maintaining/Improving their positioning in each lap of a given race.
Win races.
Win points.

Wins and points will be more heavily weighted since these directly dictate a driver's standing in a season. However, grid and lap positioning heavily influence these two factors and can help undercover any nuances between drivers who have a similar number of wins/points.

In [1]:

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<span style="font-weight:bold">
NOTE: The raw code for this notebook is by default hidden for easier reading.

To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.</span>''')

Out[1]:

NOTE: The raw code for this notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

2. Data Cleaning¶

Data Cleaning mostly consisted of flagging outliers, setting place holders for missing values, checking consistency between related variables, and renaming columns. After the files were cleaned, I used SQL to join relevant fields from different tables and then made some final changes prior to this analysis.

3. Preparation for Analysis¶

We'll start out by importing the relevant libraries and data. Please note you won't see anything in this section unless you toggle on the raw code.

In [2]:

# import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# load up datasets
re = pd.read_pickle('final_data/agg_results')
ham_2007 = pd.read_pickle('final_data/ham_results')
ham_2007_lap = pd.read_pickle('final_data/lap_times')
lap_all = pd.read_pickle('final_data/lap_times_all')
ham_quali_2007 = pd.read_pickle('final_data/qualifying')

4. Grid Positioning¶

Right before the actual race begins, all the drivers position themselves along the track. The order of this positioning is determined by each driver's performance in qualifying. We deem that one driver outperforms another in any given race if their grid positioning is better/ahead of their teammate's grid positioning.

In [3]:

# explore relationship between qualifying/grid positioning and wins/podiums/positions on aggregate level
re_pos = re[['race', 'dname', 'position', 'grid']].copy()
re_pos['win'] = (re_pos.position == 1).astype('int64')
re_win_agg = re_pos.groupby(['grid'], as_index=False)['win'].sum()
re_win_agg = re_win_agg.loc[re_win_agg['grid'] > 0 ,]

fig = px.line(re_win_agg, x='grid', y='win', range_x=[0, 33], labels={'win': 'Number of Wins', 'grid': 'Grid Position'},
             title='Relationship of Starting Position and Wins (1950-2017)')
fig.show()

In the graph above, we see that final results are heavily influenced by grid (initial) positioning. The closer to first one starts, the more likely they'll finish towards first as well.

From this point forward, all of our analysis will focus on the 2007 season.

In [4]:

ham_2007_piv = ham_2007.copy()
ham_2007_piv['race_date'] = ham_2007_piv['race_date'].astype('str')
ham_2007_piv['race'] = ham_2007_piv['race_date'] + '_' + ham_2007_piv['race']
ham_2007_piv = ham_2007_piv.pivot(index='race', columns='dname', values=['grid', 'position', 'points', 'fastestLapSpeed',
                                                                'fastestLapTime']).reset_index()

# make pivoted version for easier analysis
ham_2007_piv['race'] = ham_2007_piv['race'].str.split('_', n=1, expand=False)
ham_2007_piv['race_date'] = pd.to_datetime(ham_2007_piv['race'].str[0])
ham_2007_piv['race_'] = ham_2007_piv['race'].str[1]
ham_2007_piv = ham_2007_piv[['race_date', 'race_', 'grid', 'position', 'points', 'fastestLapSpeed', 'fastestLapTime']]
ham_2007_piv.rename(columns={'race_':'race'}, inplace=True)
ham_2007_piv.sort_values(by=['race_date', 'race'], inplace=True)

# create dataframe focusing on positioning
ham_2007_piv_pos = ham_2007_piv[['race_date', 'race', 'position', 'grid']].copy()
ham_2007_piv_pos['grid'] = ham_2007_piv_pos['grid'].astype('float')
ham_2007_piv_pos['position'] = ham_2007_piv_pos['position'].astype('float')

# number of positions behind first place
ham_2007_piv_pos['alon_grid_from1'] = ham_2007_piv_pos['grid']['Fernando Alonso'] - 1
ham_2007_piv_pos['ham_grid_from1'] = ham_2007_piv_pos['grid']['Lewis Hamilton'] - 1

In [5]:

fig = go.Figure()

fig.add_trace(go.Bar(x=ham_2007_piv_pos.race, y=ham_2007_piv_pos.alon_grid_from1,
                    name='Alonso', marker_color='#1C8356'))

fig.add_trace(go.Bar(x=ham_2007_piv_pos.race, y=ham_2007_piv_pos.ham_grid_from1,
                    name='Hamilton', marker_color='#2E91E5'))

fig.update_layout(barmode='group', xaxis_tickangle=-45,
                  yaxis=dict(title='Number of Positions'),
                  xaxis=dict(title='Race'),
                  title='Number of Positions Each Driver Began Behind 1st Place (2007 Season)')
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()

The graph above illustrates the starting position of Alonso and Hamilton for each race in the 2007 season; the shorter the bar, the closer each driver started towards first place.

We also observe the following:

Alonso always began in the top 3, with the exception of the 2007 French Grand Prix and the 2007 Hungarian Grand Prix.
Hamilton always started in the top 3 before each race, besides the 2007 European Grand Prix.

Hamilton succeeded in strengthening his chance of finishing in a podium position more frequently than his McLaren counterpart.

In [6]:

# plot number of times Hamilton began ahead of alonso and vice-versa
grid_comp = ham_2007[['race', 'dname', 'grid']].copy()
grid_comp['grid_dif_ham'] = grid_comp.groupby('race')['grid'].diff()
grid_comp['grid_dif_alon'] = grid_comp.groupby('race')['grid'].diff(-1)
grid_comp['ham_ahead'] = (grid_comp.grid_dif_ham < 0).astype('int64')
grid_comp['alon_ahead'] = (grid_comp.grid_dif_alon < 0).astype('int64')

# aggregate number of times started ahead of teammate
grid_comp = grid_comp.groupby(['dname'])[['ham_ahead', 'alon_ahead']].sum()
grid_comp.reset_index(inplace=True)

# consolidate into one column since only one of the two drivers are going to be ahead at a time
grid_comp['ahead'] = grid_comp['ham_ahead'] + grid_comp['alon_ahead']
grid_comp = grid_comp[['dname', 'ahead']]
grid_comp['pct_ahead'] = grid_comp['ahead']/17

In [7]:

fig = go.Figure(data=[go.Pie(labels=grid_comp.dname, values=grid_comp.pct_ahead, textinfo='label+percent',
                            insidetextorientation='radial', hole=.3)])

colors=['#1C8356', '#2E91E5']
fig.update_traces(marker=dict(colors=colors))
fig.update_layout(title_text='Percentage of Races Where Driver Began Ahead of Teammate (2007 Season)')

fig.show()

Hamilton started closer to first place in 10 of the 17 races in the 2007 season, making him the better driver in this first metric. Given the trend we saw before, we would expect Hamilton to have better finishing positions and subsequently gain more points.

5. Lap by Lap Positioning¶

This next stage of the analysis looks at which position each driver tended to stay in each lap and position volatility. As we touched upon before, a better driver will produce consistent results and spend more time towards the front.

In [8]:

fig = px.violin(ham_2007_lap, y='position', x='dname', box=True, points='all', color='dname',
                color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
                labels={'dname': 'Driver', 'position':'Position'},
                title='Distribution of Positions in Each Lap Raced (2007 Season)')
fig['layout']['yaxis']['autorange'] = "reversed"
fig.show()

The graph presents the following:

Hamilton spent more time closer towards the front than his teammate since the width of his plot exceeds that of Alonso's towards the front of the grid.
Hamilton's positioning was centered second place while Alonso's median positioning lingered in third place.
Alonso exhibited more consistency since Hamilton's violin plot extends farther down towards the back of the grid while Alonso's plot doesn't reach as extreme values.

Focusing on consistency, we'll look at the standard deviation of each driver's positioning throughout each race. The smaller the standard deviation, the less their position varied. Since both Hamilton and Alonso (nearly) always began towards the front of the pack, they only had a few positions to gain while simultaneously having many positions to lose. Therefore, any improvement will marginally impact the standard deviation while any loss in position will likely have a large impact.

In [9]:

# look at difference in SD - smaller one means more consistent
ham_lap_stddev = ham_2007_lap.groupby(['race_date', 'race', 'dname'])['position'].std()
ham_lap_stddev = ham_lap_stddev.unstack()
ham_lap_stddev.reset_index(inplace=True)
ham_lap_stddev['dif'] = ham_lap_stddev['Lewis Hamilton'] - ham_lap_stddev['Fernando Alonso']

# add column to indicate which driver better
ham_lap_stddev['better'] = np.where(ham_lap_stddev.dif < 0 , 'Hamilton More Consistent', 
         (np.where(ham_lap_stddev.dif > 0, 'Alonso More Consistent', 'Equally Consistent')))

# function below sets the color based on whether marker is above/below y = 0 line
def SetColor(x):
    if(x < 0):
        return "#2E91E5"
    elif(x > 0):
        return "#1C8356"
    else:
        return "red"


fig = go.Figure(data=go.Scatter(x=ham_lap_stddev.race, y=ham_lap_stddev.dif, mode='markers',
                                marker_symbol='star-square', text=ham_lap_stddev.better))
fig.update_traces(marker=dict(size=12, color=list(map(SetColor, ham_lap_stddev.dif)),
                              line=dict(width=2,
                                        color='Black')),
                  selector=dict(mode='markers'))

fig.update_layout(title='Comparison of Consistency Maintained Throughout Races (2007 Season)', xaxis=dict(title='Race'),
                  yaxis=dict(title='Difference in Standard Deviation'))

fig.add_shape(type='line', x0=-1, y0=0, x1=17, y1=0, line=dict(color='Black', dash='dot'),)

fig.show()

Breaking this all down we see that:

Hamilton exhibited more consistency in 9 out of 17 races, signified by the blue markers below the dotted line.
Alonso had a smaller standard deviation in 7 out of 17 races, indicated by the green markers above the dotted line.
Both drivers had the same standard deviation in Monaco, shown by the one red marker.

While this plot illustrates that Alonso's positioning fluctuated more than that of Hamilton's, the violin plot demonstrated him as the more dependable driver. Yet more importantly, Hamilton drove in the front of the pack in considerably more laps, which more heavily influences points and final positioning. As a result, Hamilton outperformed Alonso again and so far is the more talented driver this season.

6a. End of Race - Wins¶

As we discussed earlier, wins is one of the easiest ways to decide which driver performs better in a season.

Hamilton came in first in:

United States Grand Prix
Canadian Grand Prix
Hungarian Grand Prix
Japanese Grand Prix

Alonso came in first in:

European Grand Prix
Italian Grand Prix
Monaco Grand Prix
Malaysian Grand Prix

Surprisingly, both McLaren drivers ended up with the same number of wins at the end of the season.

We'll take a closer look at the positions Hamilton and Alonso finished in order to distinguish the two drivers.

In [10]:

# number of positions each driver finished behind first
ham_2007_pos_piv = ham_2007_piv[['race', 'position']].copy()
ham_2007_pos_piv['position'] = ham_2007_pos_piv['position'].astype('int64')
ham_2007_pos_piv['ham_fin1'] = ham_2007_pos_piv['position']['Lewis Hamilton'] - 1
ham_2007_pos_piv['alon_fin1'] = ham_2007_pos_piv['position']['Fernando Alonso'] - 1

fig = go.Figure()

fig.add_trace(go.Bar(x=ham_2007_pos_piv.race, y=ham_2007_pos_piv.alon_fin1,
                    name='Alonso', marker_color='#1C8356'))

fig.add_trace(go.Bar(x=ham_2007_pos_piv.race, y=ham_2007_pos_piv.ham_fin1,
                    name='Hamilton', marker_color='#2E91E5'))

fig.update_layout(barmode='group', xaxis_tickangle=-45,
                  yaxis=dict(title='Number of Positions'),
                  xaxis=dict(title='Race'),
                  title='Number of Positions Each Driver Finished Behind 1st Place (2007 Season)')

fig.update_layout(
    yaxis = dict(
        tickmode = 'linear',
        tick0 = 0,
        dtick = 2
    )
)

fig['layout']['yaxis']['autorange'] = "reversed"

fig.show()

The chart above exhibits the following:

Hamilton finished on a podium position in 12 of the 17 races, observed by the blue bar having a height less than or equal to two.
Hamilton finished in 1st place 4 times, 2nd place 5 times, and 3rd place 3 times.
Alonso also had 12 races where he finished in the top 3.
Alonso finished in 1st 4 times, 2nd 4 times, and 3rd 4 times.

While these two produce nearly identical statistics, Hamilton slightly edges out Alonso by finishing in second place one more time than his teammate.

In [11]:

# difference in positions
ham_2007_pos_piv['end_dif'] = ham_2007_pos_piv['position']['Fernando Alonso'] - ham_2007_pos_piv['position']['Lewis Hamilton']

# add column to indicate which driver finished ahead
ham_2007_pos_piv['Leading_Driver'] = np.where(ham_2007_pos_piv.end_dif < 0 , 'Alonso', 'Hamilton')

fig = px.bar(ham_2007_pos_piv, x='race', y='end_dif',
              labels={'race': 'Race', 'end_dif': 'Difference'},
             color='end_dif', hover_data=['race', 'Leading_Driver'],
             title='Number of Positions Hamilton Finished Ahead of Alonso (2007 Season)')
fig.show()

In [12]:

# plot number of times Hamilton finished ahead of alonso and vice-versa
end_comp = ham_2007[['race', 'dname', 'position']].copy()
end_comp['end_dif_ham'] = end_comp.groupby('race')['position'].diff()
end_comp['end_dif_alon'] = end_comp.groupby('race')['position'].diff(-1)
end_comp['ham_ahead'] = (end_comp.end_dif_ham < 0).astype('int64')
end_comp['alon_ahead'] = (end_comp.end_dif_alon < 0).astype('int64')

# aggregate number of times started ahead of teammate
end_comp = end_comp.groupby(['dname'])[['ham_ahead', 'alon_ahead']].sum()
end_comp.reset_index(inplace=True)

# consolidate into one column since only one of the two drivers are going to be ahead at a time
end_comp['ahead'] = end_comp['ham_ahead'] + end_comp['alon_ahead']
end_comp = end_comp[['dname', 'ahead']]
end_comp['pct_ahead'] = end_comp['ahead']/17

fig = go.Figure(data=[go.Pie(labels=end_comp.dname, values=end_comp.pct_ahead, textinfo='label+percent',
                            insidetextorientation='radial', hole=.3)])

colors=['#1C8356', '#2E91E5']
fig.update_traces(marker=dict(colors=colors))
fig.update_layout(title_text='Percentage of Races Where Driver Finished Ahead of Teammate (2007 Season)')

fig.show()

In the two plots above, we notice that:

Hamilton finished ahead of Alonso in 7 of the 17 races as seen in the pie chart above.
Alonso typically only finished ahead of his teammate by a few positions - evident by the height of the bars below the x-axis in the graph above the pie chart.
Alonso only held a one position advantage in 60% of the races where he did finish better than his teammate.
Conversely, Hamilton tended to put more space between Alonso when he was the leading McLaren driver.

Evidently, we have favorable findings for both drivers.

In [13]:

# box plot for Alonso/Fernando of finishing positions - median seems to equal 75th percentile for hamilton
ham_2007_pos = ham_2007[['race_date', 'race', 'dname', 'grid', 'position']].copy()
ham_2007_pos['grid'] = ham_2007_pos['grid'].astype('int64')
ham_2007_pos['position'] = ham_2007_pos['position'].astype('int64')

fig = px.box(ham_2007_pos, y='dname', x='position', color='dname',
            color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
            labels={'dname': 'Driver', 'position': 'Ending Position'},
             title='Distribution of Finishing Position (2007 Season)')
fig.show()

The box plots provide these conclusions:

Hamilton's median finishing position is 2nd place, better than Alonso's value of 3rd place.
Hamilton had more races where he finished closer to first place but had some where he finished unusually far behind, since his box plot is skewed right
Alonso's box plot is also slightly skewed right, though less so than Hamilton's. This means that the number of races where his ending position was towards the front was more evenly split with the number of races where he placed farther back.

While Alonso achieved better placement than his teammate in more races, he usually finished ahead by only a single position. However, when it comes to podium placement, the most important component of final position, Hamilton holds the crown. Furthermore, Hamilton tended to finish in better positions when looking at descriptive statistics. While the comparison in this metric is much closer than the previous ones, Hamilton portrays himself as the superior driver in terms of final positioning.

6b. End of Race - Points Earned¶

As mentioned earlier, a driver's points determine where they place in the overall standings at the end of a season. While positioning and points are directly linked, we can still explore the subtleties to draw further distinctions between Alonso and Hamilton.

In [14]:

# interactive graph - number of points
ham_2007_finish = ham_2007[['race_date', 'race', 'dname', 'position', 'points']].copy()
ham_2007_finish['win'] = (ham_2007_finish.position == 1).astype('int64')
ham_2007_finish['sum_pts'] = ham_2007_finish.groupby(['dname'])['points'].cumsum()
fig = px.bar(ham_2007_finish, x='sum_pts', y='dname', animation_frame='race', color='dname',
            labels={'dname': 'Driver', 'sum_pts': 'Points Earned'},
            color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'})

fig.update_traces(dict(marker_line_width=1, marker_line_color="black"))
fig.update_layout(xaxis=dict(range=[0, 120]))
fig.update_layout(title_text="Points Earned By Each Driver (2007 Season)")

# add buttons
fig.update_layout(
    updatemenus = [
        {
            'buttons': [
                {
                    'args': [None, {'frame': {'duration': 800, 'redraw': False},
                                   'fromcurrent': True, 'transition': {'duration': 800,
                                                                      'easing': 'linear'}}],
                    'label': 'Play',
                    'method': 'animate'
                },
                {
                    'args': [[None], {'frame': {'duration': 0, 'redraw': False},
                                     'mode': 'immediate',
                                     'transition': {'duration':0}}],
                    'label': 'Pause',
                    'method': 'animate'
                }
            ]
            
        }
    ]
)                
                  
fig.show()

At the end of the season, both drivers finished with 109 points. As a result, the two most decisive factors in determining the better driver fail to clearly distinguish which driver outperforms the other.

In [15]:

# explore difference in points per race
ham_2007_pts_piv = ham_2007_piv[['race', 'points']].copy()
ham_2007_pts_piv['points'] = ham_2007_pts_piv['points'].astype('float')
ham_2007_pts_piv['dif'] = ham_2007_pts_piv['points']['Lewis Hamilton'] - ham_2007_pts_piv['points']['Fernando Alonso']

ham_2007_pts_piv['Leader'] = np.where(ham_2007_pts_piv.dif < 0 , 'Alonso', 'Hamilton')

fig = px.bar(ham_2007_pts_piv, x='race', y='dif',
              labels={'race': 'Race', 'dif': 'Difference'},
             color='dif', hover_data=['race', 'Leader'],
             title='Difference in Points (2007 Season)')
fig.show()

Alonso scored more points in 10 races, which directly corresponds to the 10 races where he finished ahead of Hamilton. Once again, when Alonso scored more, it was only by a couple points. Meanwhile, Hamilton created a larger gap when he scored more points.

In [16]:

# box plot for Alonso/Fernando of points- median seems to equal 75th percentile for hamilton
ham_2007_pts = ham_2007[['race', 'dname', 'points']].copy()
fig = px.box(ham_2007_pts, y='dname', x='points', color='dname',
            color_discrete_map={'Fernando Alonso': '#1C8356', 'Lewis Hamilton': '#2E91E5'},
            labels={'dname': 'Driver', 'points': 'Points'}, title='Distribution of Points (2007 Season)')
fig.show()

These distributions highlight that:

The two McLaren drivers scored the same number of points but Hamilton's median points were higher than Alonso's.
Hamilton frequently scored on the higher end of the points spectrum but had a few races where he scored a limited number of points, dragging his average points down. This is evident by the left skew of his box plot
Alonso's box plot tells the opposite story. While his skewness is less extreme than Hamilton's, it still indicates that Alonso repeatedly scored on the lower end of the points spectrum. Yet he still had a few races where he amassed a large number of points.

In conclusion, we derive similar insights here as we did in the finishing position comparison since the two go hand in hand. Hamilton barely beats Alonso in this criterion too.

7. Putting it All Together¶

Putting all the metrics together, we can sum up what we found as the following:

Grid positioning: Hamilton is the favorite in this category.
Driving positioning lap-to-lap: Hamilton achieves favorable positioning on a given lap.
Finishing positions: Alonso finishes ahead of Hamilton more often. However, Hamilton is superior where it counts most, podium placement. The difference in 2nd place finishes actually resulted in Hamilton finishing second and Alonso third in the drivers standing for the 2007 season.
Points: Alonso has more races with more points but Hamilton has more favorable statistics.
Final Comparison:
- Lewis Hamilton: ranks higher in 4 of the metrics we looked at.
- Fernando Alonso: ranks higher in 0 of the metrics we looked at.

Overall: Hamilton was the better driver in the 2007 season although they were very well-matched. This is is especially impressive since this was his rookie season while Alonso had a few years of experience under his belt.

While Hamilton was victorious in each metric, it was a very evenly-matched comparison. Both drivers performed at a very high caliber and finished in the top 3 drivers standings for the 2007 season. Furthermore, as a team they completely dominated the charts as the only other team besides McLaren to win any races was Ferrari. It would be interesting to compare them again further down the line if they ever became teammates again.