Tennis & Data Science

Short description:
Tennis & Data science is a Foundley project with Mladen Jovanović as a mentor. The main goal of this project is data exploration, prepocessing, feature extraction and visualization. Mladen provided a dataset "03_tenisai_data.csv" which contains information about the tennis match between Novak Đoković and Rodger Federer from Wimbledon men's single finals in 2015.
In following cells reader can get to know with the results that I found interesting, statistic of the match in general and code used to process data and visualize it.

Importing necessary libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn

Loading data

In [2]:
df = pd.read_csv('03_tennisai_data.csv')

Dataset preview

In [3]:
print("Dimensions of the dataset: ", df.shape)
df.head()
Dimensions of the dataset:  (422, 52)
Out[3]:
match_id frame_pos p1 p1_sets p1_games p1_points p1_set1 p1_set2 p1_set3 p1_set4 ... lob rally_length rally_1-4 rally_5-8 rally_9+ expected_ballhit_count detected_ballhit_count game_id point_description strokes_timestamps
0 3 350 djokovic 0 0 0 -1 -1 -1 -1 ... 0 1 1 0 0 1 1 0 1st serve wide, ace. [26.192]
1 3 975 djokovic 0 0 0 -1 -1 -1 -1 ... 0 2 1 0 0 3 3 0 1st serve wide, fault (net). 2nd serve to body... [41.239, 49.412, 50.341]
2 3 1425 djokovic 0 0 0 -1 -1 -1 -1 ... 0 3 1 0 0 3 3 0 1st serve wide; forehand return down the middl... [65.945, 66.688, 67.802]
3 3 1850 djokovic 0 0 15 -1 -1 -1 -1 ... 0 1 1 0 0 1 2 0 1st serve down the T, ace. [85.264, 86.378]
4 3 2275 djokovic 0 0 15 -1 -1 -1 -1 ... 0 3 1 0 0 4 4 0 1st serve wide, fault (net). 2nd serve down th... [101.053, 109.97, 110.899, 112.385]

5 rows × 52 columns

The dataset is structured from 422 rows representing each point played in matches, and 52 columns for storing information about players' names, point winner, strokes, current game, set, point, etc. Every row also contains point description columns that includes point flow with detailed description of that specific point.

Check for NaNs

In [4]:
# Check for NaN values in dataset
print("There is%s NaNs in dataset" %(" not" if not df.isnull().values.any() else ""))
There is not NaNs in dataset

Match result

In [5]:
# Evaluate final result for every set
p1_set_list = [df['p1_set'+str(i)].iloc[-1] for i in range(1,6) if df['p1_set'+str(i)].iloc[-1] != -1]
p2_set_list = [df['p2_set'+str(i)].iloc[-1] for i in range(1,6) if df['p2_set'+str(i)].iloc[-1] != -1]

# Evaluate set score for final set
if(df['p1_sets'].iloc[-1] == df['p2_sets'].iloc[-1]):
    if(df['matchpoint'].iloc[-1]==1):
        p1_set_list.append(df['p1_games'].iloc[-1]+1)
        p2_set_list.append(df['p2_games'].iloc[-1])
    elif(df['matchpoint'].iloc[-1]==2):
        p1_set_list.append(df['p1_games'].iloc[-1])
        p2_set_list.append(df['p2_games'].iloc[-1]+1)
    else:
        raise ValueError("There is no winner!")
        
# Match winner
player1_name = df['p1'].iloc[0][0].upper() + df['p1'].iloc[0][1::]
player2_name = df['p2'].iloc[0][0].upper() + df['p2'].iloc[0][1::]
player_name = [player1_name, player2_name]
match_winner = player1_name if(p1_set_list > p2_set_list) else player2_name

# Dictionary for match result data
match_result = {}

# Adding sets in dictionary
for num_set in range(len(p1_set_list)):
    match_result['Set ' + str(num_set+1)] = [p1_set_list[num_set], p2_set_list[num_set]]
    
# Adding winner in dictionary
if(match_winner == player1_name):
    match_result['Winner'] = [True, False]
else:
    match_result['Winner'] = [False, True]
    
match_result_df = pd.DataFrame(data=match_result)
match_result_df.index = [player1_name, player2_name]
match_result_df.head()
Out[5]:
Set 1 Set 2 Set 3 Set 4 Set 5 Winner
Djokovic 7 1 7 4 13 True
Federer 6 6 6 6 12 False

Columns for sets and games for players 1 and 2 was used to track result to final row, where 'matchpoint' column was used to determine a winner.

Serve analysis

Overall serve analysis

In [6]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Service overall', fontweight='bold')

for i,p in zip(range(2),range(1,3)):
    # Points by serve
    serve_df = df.loc[(df['server']==player_name[i].lower())]
    
    # Number of services
    N_serve = serve_df.shape[0]
    
    # Aces
    N_aces = serve_df.loc[serve_df['ace'] == p].shape[0]
    
    # Double faults
    N_double_f = serve_df.loc[serve_df['double_fault'] == p].shape[0]
    # Single faults
    N_single_f = serve_df.loc[serve_df['2nd_serve'] == p].shape[0]
    
    values = [
        N_aces,
        N_single_f,
        N_double_f, 
        serve_df.shape[0] - (N_aces+N_single_f+N_double_f)
    ]
    names = [
        'Aces',
        'Single faults',
        'Double faults',
        'Other'
    ]
    
    ax[i].pie(values,
              labels=names,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[i].set_title(player_name[i])

plt.show()

From the pie chart shown above, we can conclude that Federer is much more confident in his serve having more aces (almost three times) and less double faults (almost double) among all his serves comparing to Đoković.

Service by sets

In [7]:
# Number of sets
N_sets = len(p1_set_list)

# List for important data
aces_set = []
first_fault_set = []
second_fault_set = []
points_won_ratio = []
point_duration = []

for i,p in zip(range(2),range(1,3)):
    # Points by serve
    serve_df = df.loc[(df['server']==player_name[i].lower())]
    
    
    aces_set_ = []
    first_fault_set_ = []
    second_fault_set_ = []
    points_won_ratio_ = []
    point_duration_ = []
    
    for s in range(1, N_sets+1):
        # Current set
        if(s == 1):
            set_df = serve_df.loc[serve_df['p'+str(p)+'_set'+str(s)] == -1]
        else:
            set_df = serve_df.loc[(serve_df['p'+str(p)+'_set'+str(s)] == -1) &
                                  (serve_df['p'+str(p)+'_set'+str(s-1)] != -1)]
        
        # Aces in current set
        N_ace = set_df.loc[set_df['ace'] == p].shape[0]
        
        # Double faults
        N_double_fault = set_df.loc[set_df['double_fault'] == p].shape[0]
        # Single faults
        N_single_fault = set_df.loc[set_df['2nd_serve'] == p].shape[0]
        
        # Points won and lost in first serve
        N_won_ratio = 100*set_df.loc[(set_df['pt_won_by'] == p)].shape[0]/set_df.shape[0]
        
        # Append values to list
        aces_set_.append(N_ace)
        first_fault_set_.append(N_single_fault)
        second_fault_set_.append(N_double_fault)
        points_won_ratio_.append(N_won_ratio)
        
        # Average point duration
        point_duration_.append(np.mean(set_df['point_duration']))
        
    aces_set.append(aces_set_)
    first_fault_set.append(first_fault_set_)
    second_fault_set.append(second_fault_set_)
    points_won_ratio.append(points_won_ratio_)
    point_duration.append(point_duration_)
    
    
# x axis
sets = np.arange(1, N_sets+1)
set_names = tuple(['Set '+str(s) for s in sets])

# Aces per sets
plt.figure(figsize=(12,6))
plt.bar(sets-0.1, aces_set[0], width=0.2, color='r', align='center')
plt.bar(sets+0.1, aces_set[1], width=0.2, color='g', align='center')
plt.ylabel('Number of aces')
plt.xticks(sets, set_names)
plt.title('Aces per set')
plt.legend(player_name)
plt.grid('minor')
plt.show()
    
# Faults per sets
plt.figure(figsize=(12,6))
plt.bar(sets-0.15, first_fault_set[0], width=0.1, color='y', align='center')
plt.bar(sets-0.05, first_fault_set[1], width=0.1, color='r', align='center')
plt.bar(sets+0.05, second_fault_set[0], width=0.1, color='g', align='center')
plt.bar(sets+0.15, second_fault_set[1], width=0.1, color='b', align='center')
plt.xticks(sets, set_names)
plt.ylabel('Service faults')
plt.title('Service faults per set')
plt.legend([player_name[0]+'\'s single fault', player_name[1]+'\'s single fault',
            player_name[0]+'\'s double fault', player_name[1]+'\'s double fault'])
plt.grid('minor')
plt.show()


# Percentage of points won per set
plt.figure(figsize=(12,6))
plt.bar(sets-0.1, points_won_ratio[0], width=0.2, color='m', align='center')
plt.bar(sets+0.1, points_won_ratio[1], width=0.2, color='c', align='center')
plt.ylabel('Points won by server [%]')
plt.xticks(sets, set_names)
plt.ylim([0, 100])
plt.title('Server\'s points won per set')
plt.legend(player_name)
plt.grid('minor')
plt.show()

# Average point duration per set
plt.figure(figsize=(12,6))
plt.bar(sets-0.1, point_duration[0], width=0.2, color='g', align='center')
plt.bar(sets+0.1, point_duration[1], width=0.2, color='y', align='center')
plt.ylabel('Average point duration [s]')
plt.xticks(sets, set_names)
plt.title('Average winning point duration per set')
plt.legend(player_name)
plt.grid('minor')
plt.show()

Number Đoković's aces seem to drop during the match slightly more aggressive on average. Federer dominated in last set scoring 6 times more aces than his opponent. Interestingly, from number of aces and the number of faults in each set per player, Federer appears to have lost sets in which he had much better service than Đoković (more aces and less faults). Looking at the percent of points won by a server and average point duration of points on each players' service, every set, except second, players has played pretty balanced comparing to each other. The only one different is the second set in which Đoković lost 70% of the points in his service, but with with average point duration of nearly a minute, so result of this set (1-6) didn't actually mean that Đoković "gave up" on this site.

First and second service

In [8]:
fs_pts = []
ss_pts = []

for i,p in zip(range(2),range(1,3)):
    # Points by serve
    serve_df = df.loc[df['server']==player_name[i].lower()]
    
    # First service data
    fs_df = serve_df.loc[serve_df['2nd_serve']==0]
    
    # Second service data
    ss_df = serve_df.loc[serve_df['2nd_serve']==p]
    
    # First service won
    Nfs_won = fs_df.loc[fs_df['pt_won_by'] == p].shape[0]
    # First service lost
    Nfs_lost = fs_df.shape[0] - Nfs_won
    fs_pts.append([Nfs_won, Nfs_lost])
    
    # Second service won
    Nss_won = ss_df.loc[ss_df['pt_won_by'] == p].shape[0]
    # Second service lost
    Nss_lost = ss_df.shape[0] - Nss_won
    ss_pts.append([Nss_won, Nss_lost])
    
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8,9), dpi=120, facecolor='white')
ax = axes.ravel()
plt.suptitle('Points won in first and second service', fontweight='bold')

for i in range(2):
    ax[i*2].pie([fs_pts[i][0], fs_pts[i][1]],
              labels=['Server', 'Opponent'],
                labeldistance=1.15,
                wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
                autopct = '%.2f%%')
    ax[i*2].set_title(player_name[i]+'\'s first serve')
    
    ax[i*2+1].pie([ss_pts[i][0], ss_pts[i][1]],
                labels=['Server', 'Opponent'],
                labeldistance=1.15,
                wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
                autopct = '%.2f%%')
    ax[i*2+1].set_title(player_name[i]+'\'s second serve')

plt.show()

Service number turned out to have a great influence of point outcome. Both players won around 75% of points on their first service, but only around 50% when the first service was an error. Đoković even has fewer points won on his second service than his opponent.

Players service type and direction analysis

Having a point description column in dataset helps going through all types of services played in the match. In the following cells, all types of service, including faults, are explored.

In [9]:
# Player 1
serve_df = df.loc[df['server']==player_name[0].lower()]

fs_df1 = pd.DataFrame(0,
                    columns=['ace', 'service winner', 'fault (long)', 'fault (wide)', 'fault (net)', 'fault (wide and long)', 'regular'],
                    index = ['wide', 'down the T', 'to body'],
                    dtype = int)
fs_df1.index = ['wide', 'down the T', 'to body']

ss_df1 = pd.DataFrame(0,
                    columns=['ace', 'service winner', 'fault (long)', 'fault (wide)', 'fault (net)', 'fault (wide and long)', 'regular'],
                    index = ['wide', 'down the T', 'to body'],
                    dtype = int)
ss_df1.index = ['wide', 'down the T', 'to body']


for i in range(serve_df.shape[0]):
    # Extract info about service
    serve_desc = serve_df['point_description'].iloc[i].split(';')[0].split('.')
    
    # Categorize first serve
    for desc in serve_desc:
        if('1st' in desc):
            # Remove string '1st serve'
            desc_list = desc[10::].split(',')

            if(len(desc_list) == 1):
                fs_df1.loc[desc_list[0], 'regular'] += 1
            else:
                fs_df1.loc[desc_list[0], desc_list[1][1::]] += 1
                
        elif('2nd' in desc):
            # Remove string ' 2nd serve'
            desc_list = desc[11::].split(',')
           
            
            if(len(desc_list) == 1):
                ss_df1.loc[desc_list[0], 'regular'] += 1
            else:
                ss_df1.loc[desc_list[0], desc_list[1][1::]] += 1      
                
# Player 2
serve_df = df.loc[df['server']==player_name[1].lower()]

fs_df2 = pd.DataFrame(0,
                    columns=['ace', 'service winner', 'fault (long)', 'fault (wide)', 'fault (net)', 'fault (wide and long)', 'regular'],
                    index = ['wide', 'down the T', 'to body'],
                    dtype = int)
fs_df2.index = ['wide', 'down the T', 'to body']

ss_df2 = pd.DataFrame(0,
                    columns=['ace', 'service winner', 'fault (long)', 'fault (wide)', 'fault (net)', 'fault (wide and long)', 'regular'],
                    index = ['wide', 'down the T', 'to body'],
                    dtype = int)
ss_df2.index = ['wide', 'down the T', 'to body']


for i in range(serve_df.shape[0]):
    # Extract info about service
    serve_desc = serve_df['point_description'].iloc[i].split(';')[0].split('.')
    
    # Categorize first serve
    for desc in serve_desc:
        if('1st' in desc):
            # Remove string '1st serve'
            desc_list = desc[10::].split(',')

            if(len(desc_list) == 1):
                fs_df2.loc[desc_list[0], 'regular'] += 1
            else:
                fs_df2.loc[desc_list[0], desc_list[1][1::]] += 1
                
        elif('2nd' in desc):
            # Remove string ' 2nd serve'
            desc_list = desc[11::].split(',')
           
            
            if(len(desc_list) == 1):
                ss_df2.loc[desc_list[0], 'regular'] += 1
            else:
                ss_df2.loc[desc_list[0], desc_list[1][1::]] += 1 

First player's first service stats

In [10]:
fs_df1.head()
Out[10]:
ace service winner fault (long) fault (wide) fault (net) fault (wide and long) regular
wide 4 2 10 9 21 3 52
down the T 5 0 9 4 16 1 53
to body 0 0 6 0 4 0 20

First player's second service stats

In [11]:
ss_df1.head()
Out[11]:
ace service winner fault (long) fault (wide) fault (net) fault (wide and long) regular
wide 0 0 1 1 1 0 22
down the T 1 0 0 0 1 0 19
to body 0 0 4 0 1 0 32

Second player's first service stats

In [12]:
fs_df2.head()
Out[12]:
ace service winner fault (long) fault (wide) fault (net) fault (wide and long) regular
wide 10 1 4 18 18 4 59
down the T 15 0 11 9 11 0 40
to body 0 0 0 0 1 0 2

Second player's second service stats

In [13]:
ss_df2.head()
Out[13]:
ace service winner fault (long) fault (wide) fault (net) fault (wide and long) regular
wide 0 0 0 0 1 0 32
down the T 0 0 4 0 0 0 17
to body 0 0 0 0 1 0 21

Service direction

In [14]:
wide = [fs_df1.loc['wide'].to_numpy().sum() + ss_df1.loc['wide'].to_numpy().sum(),
        fs_df2.loc['wide'].to_numpy().sum() + ss_df2.loc['wide'].to_numpy().sum()]

down_the_T = [fs_df1.loc['down the T'].to_numpy().sum() + ss_df1.loc['down the T'].to_numpy().sum(),
              fs_df2.loc['down the T'].to_numpy().sum() + ss_df2.loc['down the T'].to_numpy().sum()]
to_body = [fs_df1.loc['to body'].to_numpy().sum() + ss_df1.loc['to body'].to_numpy().sum(),
           fs_df2.loc['to body'].to_numpy().sum() + ss_df2.loc['to body'].to_numpy().sum()]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Service direction', fontweight='bold')

for i in range(2):  
    values = [wide[i], down_the_T[i], to_body[i]]
    names = ['Wide','Down the T','To body']
    
    ax[i].pie(values,
              labels=names,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[i].set_title(player_name[i])
plt.show()

Both Đoković and Federer prefer wide services over other types with Federer having nearly three times less services to body then Đoković.

First and second service direction

In [15]:
# Player 1
wide = [fs_df1.loc['wide'].to_numpy().sum(), ss_df1.loc['wide'].to_numpy().sum()]
down_the_T = [fs_df1.loc['down the T'].to_numpy().sum(), ss_df1.loc['down the T'].to_numpy().sum()]
to_body = [fs_df1.loc['to body'].to_numpy().sum(), ss_df1.loc['to body'].to_numpy().sum()]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle(player_name[0] + '\'s service direction (first & second)', fontweight='bold')
for i in range(2):  
    values = [wide[i], down_the_T[i], to_body[i]]
    names = ['Wide','Down the T','To body']
    ax[i].pie(values,
              labels=names,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    if(i==0):
        ax[i].set_title('First service')
    else:
        ax[i].set_title('Second service')
plt.show()


# Player 2
wide = [fs_df2.loc['wide'].to_numpy().sum(), ss_df2.loc['wide'].to_numpy().sum()]
down_the_T = [fs_df2.loc['down the T'].to_numpy().sum(), ss_df2.loc['down the T'].to_numpy().sum()]
to_body = [fs_df2.loc['to body'].to_numpy().sum(), ss_df2.loc['to body'].to_numpy().sum()]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle(player_name[1] + '\'s service direction (first & second service)', fontweight='bold')
for i in range(2):  
    values = [wide[i], down_the_T[i], to_body[i]]
    names = ['Wide','Down the T','To body']
    ax[i].pie(values,
              labels=names,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    if(i==0):
        ax[i].set_title('First service')
    else:
        ax[i].set_title('Second service')
plt.show()

Federer had only 1.48% of his first services of the opponent's body. In the second service, both players had many more services to body (maybe avoiding mistakes).

Aces direction

In [16]:
directions = ['wide','down the T','to body']
values1 = []
values2 = []
for direction in directions:
    # First player data
    values1.append(fs_df1.loc[direction, 'ace'] + ss_df1.loc[direction, 'ace'])
    values2.append(fs_df2.loc[direction, 'ace'] + ss_df2.loc[direction, 'ace'])
values = [values1, values2]
                    
positions = np.arange(3)

# Aces types
plt.figure(figsize=(12,6))
plt.bar(positions-0.1, values[0], width=0.2, color='g', align='center')
plt.bar(positions+0.1, values[1], width=0.2, color='y', align='center')
plt.ylabel('Number of aces')
plt.xticks(positions, directions)
plt.title('Aces by direction')
plt.legend(player_name)
plt.grid('minor')
plt.show()

Although most of the services for both players was wide, down the T was the direction in which both players had more aces. It was fair to assume that there was no aces directed to the body, but it is not bad to check, just to be sure :).

Fault direction

In [17]:
faults = ['fault (long)','fault (wide)','fault (net)','fault (wide and long)']
values1 = []
values2 = []
for fault in faults:
    values1.append(fs_df1[fault].to_numpy().sum() + ss_df1[fault].to_numpy().sum())
    values2.append(fs_df2[fault].to_numpy().sum() + ss_df2[fault].to_numpy().sum())
values = [values1, values2]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Fault direction', fontweight='bold')
for i in range(2):  
    ax[i].pie(values[i],
              labels=faults,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[i].set_title(player_name[i]+'\'s service')
plt.show()
    

The most common service fault is definitely net, but having in mind that most of the services were wide, Đoković's wide service errors are quite small and were made twice less frequently than Federer's faults.

Service outcome

In [18]:
names = ['ace', 'service winner', 'fault (long)', 'fault (wide)', 'fault (net)', 'fault (wide and long)', 'regular']
values1 = []
values2 = []
for name in names:
    values1.append(fs_df1[name].to_numpy().sum() + ss_df1[name].to_numpy().sum())
    values2.append(fs_df2[name].to_numpy().sum() + ss_df2[name].to_numpy().sum())
values = [values1, values2]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Service outcome', fontweight='bold')
for i in range(2):  
    ax[i].pie(values[i],
              labels=names,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[i].set_title(player_name[i]+'\'s service')
plt.show()

From the pie chart above, we can maybe conclude that Federer's service is a bit more attacking with a significantly larger number of aces, but also with the largest percentage of errors made during the match.

Drop shot

In [19]:
drop_shots = [df.loc[df["drop_shot"]==1].shape[0], df.loc[df["drop_shot"]==2].shape[0]]
position = 0

# Drop shots visualization
plt.figure(figsize=(12,6))
plt.bar(position-0.5, drop_shots[0], width=1, color='b', align='center')
plt.bar(position+0.5, drop_shots[1], width=1, color='y', align='center')
plt.ylabel('Number of drop shots')
plt.xticks([-0.5, 0.5], (player_name[0], player_name[1]))
plt.xlim([-5,5])
plt.title('Number of drop shots per player')
plt.legend(player_name)
plt.grid(which='major', axis='y')
plt.show()

As we analyse just one match, it cannot be said if Đoković is not a drop shot player or a drop shot is a way to go for Federer, but difference in playing tactics is obvious in this one.

Lobs

In [20]:
lobs = [df.loc[df["lob"]==1].shape[0], df.loc[df["lob"]==2].shape[0]]
position = 0

# Lobs visualization
plt.figure(figsize=(12,6))
plt.bar(position-0.5, lobs[0], width=1, color='g', align='center')
plt.bar(position+0.5, lobs[1], width=1, color='m', align='center')
plt.ylabel('Number of lobs')
plt.xticks([-0.5, 0.5], (player_name[0], player_name[1]))
plt.xlim([-5,5])
plt.title('Number of lobs per player')
plt.legend(player_name)
plt.grid('major', axis='y')
plt.show()

Đoković had a few lobs more and much less drop shots to draw his opponent near net than Federer.

Stroke analysis

Much more details are hidden in point description than just a serve. Most important among them are strokes. In further analysis, it is pointed out how strokes, such as forehand and backhand, differ for each player.

In [21]:
player1_on_service = df.loc[df["server"] == player_name[1].lower()]

# Forehands
forehand = [{},{}]
# Backhands
backhand = [{},{}]

for player in range(2):
    
    # Extract data for current server
    player_on_service = df.loc[df["server"] == player_name[player].lower()]

    # If first player has service then later on at first stroke itr%2 = 0
    itr_init = 1 + player
    
    # Going through every point description
    for i in range(player_on_service.shape[0]):
        
        # Extract info about service
        point_desc = player_on_service['point_description'].iloc[i].split(';')

        # Iterator helps detecting current player
        itr = itr_init
        
        for desc in point_desc:
            itr += 1

            # Remove phrase 'return' if exists
            desc = desc.replace('return ','')

            # Check if forehand is played
            if('forehand' in desc):
                # Remove rally length
                stroke_desc = desc.split('.')[0]

                # Remove 'forehand ' from description
                stroke_desc = stroke_desc.replace(' forehand ','')

                # Check for winner/error
                if(len(stroke_desc.split(','))==2):
                    # Stroke type
                    stroke_type = stroke_desc.split(',')[0]
                    # Finish type
                    point_end = stroke_desc.split(',')[1]

                    # Check for winner
                    if('winner' in point_end):
                        # Check if key exists
                        if('winner' in forehand[itr%2].keys()):
                            forehand[itr%2]['winner'] += 1
                        else:
                            forehand[itr%2]['winner'] = 1  

                    # Check for error
                    elif('error' in point_end):
                        # Error description
                        error_type = point_end[1:]

                        # Check if key exists
                        if(error_type in forehand[itr%2].keys()):
                            forehand[itr%2][error_type] += 1
                        else:
                            forehand[itr%2][error_type] = 1
                    else:
                        raise ValueError("New type came up on the end of the point: %s" %(point_end))
                else:
                    if('rally' in forehand[itr%2].keys()):
                        forehand[itr%2]['rally'] += 1
                    else:
                        forehand[itr%2]['rally'] = 1
                        
            # Check if backhand is played
            if('backhand' in desc):
                # Remove rally length
                stroke_desc = desc.split('.')[0]

                # Remove 'forehand ' from description
                stroke_desc = stroke_desc.replace(' backhand ','')

                # Check for winner/error
                if(len(stroke_desc.split(','))==2):
                    # Stroke type
                    stroke_type = stroke_desc.split(',')[0]
                    # Finish type
                    point_end = stroke_desc.split(',')[1]

                    # Check for winner
                    if('winner' in point_end):
                        # Check if key exists
                        if('winner' in backhand[itr%2].keys()):
                            backhand[itr%2]['winner'] += 1
                        else:
                            backhand[itr%2]['winner'] = 1  

                    # Check for error
                    elif('error' in point_end):
                        # Error description
                        error_type = point_end[1:]

                        # Check if key exists
                        if(error_type in backhand[itr%2].keys()):
                            backhand[itr%2][error_type] += 1
                        else:
                            backhand[itr%2][error_type] = 1
                    else:
                        raise ValueError("New type came up on the end of the point: %s" %(point_end))
                else:
                    if('rally' in backhand[itr%2].keys()):
                        backhand[itr%2]['rally'] += 1
                    else:
                        backhand[itr%2]['rally'] = 1
            

Strokes in general

In [22]:
# Strokes in general
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
    ax = axes.ravel()
    plt.suptitle(player_name[i]+'\'s strokes', fontweight='bold')
    values = [list(forehand[i].values()), list(backhand[i].values())]
    labels = [list(forehand[i].keys()), list(backhand[i].keys())]
    titles = ['Forehand', 'Backhand']
    for j in range(2):  
        ax[j].pie(values[j],
                  labels=labels[j],
                  labeldistance=1.15,
                  wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
                  autopct = '%.2f%%')
        ax[j].set_title(titles[j])
    plt.show()

Federer's forehand has the highest error rate and the greatest percentage of winners, while Đoković's backhand has the smallest percentage of errors and winner, making them the best attacking and the most stable strokes of these two players.

Number of stroke types

In [23]:
positions = np.array([-.5, .5])
winner = [forehand[i]['winner']+backhand[i]['winner'] for i in range(2)]
forced = [forehand[i]['forced error']+backhand[i]['forced error'] for i in range(2)]
unforced = [forehand[i]['unforced error']+backhand[i]['unforced error'] for i in range(2)]
rally = [forehand[i]['rally']+backhand[i]['rally'] for i in range(2)]

# Aces per sets
plt.figure(figsize=(12,6))
plt.bar(positions-0.2, rally, width=0.2, color='c', align='center')
plt.bar(positions, winner, width=0.2, color='y', align='center')
plt.bar(positions+0.2, forced, width=0.2, color='g', align='center')
plt.bar(positions+0.2, unforced, bottom=forced, width=0.2, color='m', align='center')
plt.ylabel('Number of strokes')
plt.xticks(positions, player_name)
plt.xlabel('Player')
plt.title('Stroke types per player')
plt.legend(['Rally', 'Winner', 'Forced error','Unforced error'])
plt.grid(which='major', axis='y')
plt.show()

Looking at the number of strokes (not just a relation in between stroke), Federer's strokes are more likely to end up as winners or errors then Đoković's strokes.

Stroke errors

In [24]:
labels = ['Forehand unforced error','Forehand forced error',
          'Backhand unforced error','Backhand forced error']
values = [[forehand[0]['unforced error'], forehand[0]['forced error'],
          backhand[0]['unforced error'], backhand[0]['forced error']],
         [forehand[1]['unforced error'], forehand[1]['forced error'],
          backhand[1]['unforced error'], backhand[1]['forced error']]]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Stroke error', fontweight='bold')
for j in range(2):  
    ax[j].pie(values[j],
              labels=labels,
              labeldistance=1.01,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[j].set_title(player_name[j])
plt.show()

Number of forced and unforced errors can be compared also as an overall stroke error analysis. Federer's unforced backhand error is the rarest mistake on the match and his his unforced forehand error the most common one. Đoković's errors, on the other hand, are quite balanced.

Point ending

In [25]:
labels = ['Error by opponent', 'Winner', 'Ace', 'Double fault']
values = [[],[]]
# Points won by first player
values[0].append(df.loc[df['error']==2].shape[0])
values[0].append(df.loc[df['winner']==1].shape[0])
values[0].append(df.loc[df['ace']==1].shape[0])
values[0].append(df.loc[df['double_fault']==2].shape[0])
# Points won by second player
values[1].append(df.loc[df['error']==1].shape[0])
values[1].append(df.loc[df['winner']==2].shape[0])
values[1].append(df.loc[df['ace']==2].shape[0])
values[1].append(df.loc[df['double_fault']==1].shape[0])

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), facecolor='white')
ax = axes.ravel()
plt.suptitle('Point ending', fontweight='bold')
for j in range(2):  
    ax[j].pie(values[j],
              labels=labels,
              labeldistance=1.15,
              wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
              autopct = '%.2f%%')
    ax[j].set_title(player_name[j]+'\'s winning points')

Analysis started with points beginning (service), going through points "body" (strokes) and the last one is how point ends. Đoković relies on an opponent's mistake with 75% of his winning points and afterwards on winners that are one fifth of all points he has won. Federer on the other hand didn't count on opponent's preparedness or endurance and Đoković's mistakes gave him around 60% of all of his points, with his winners around 30% and aces over 11%. This may not seem much by looking at the chart above, but Đoković had only 25% won "by his own" comparing to Federer's 41%.

Comparing point's winner and point's rally length

In [26]:
points_won = [[],[]]
max_rally = np.amax(df['rally_length'].to_numpy())+1
for i in range(2, max_rally):
    for j in range(2):
        points_won[j].append(df.loc[(df['rally_length']==i) & (df['pt_won_by']==(j+1))].shape[0])
    
positions = np.arange(2, max_rally)

plt.figure(figsize=(12,6))
plt.bar(positions-0.1, points_won[0], width=0.2, color='g', align='center')
plt.bar(positions+0.1, points_won[1], width=0.2, color='y', align='center')
plt.ylabel('Number of points won')
plt.xlabel('Rally length')
plt.xticks(positions, positions)
plt.title('Winning points by rally length')
plt.legend(player_name)
plt.grid(which='major', axis='y')
plt.show()

For short rallies, Federer has a better score of winning the point, but when it comes to medium and long rallies, Đoković have performed better.

Distribution of points won with respect to point duration

In [27]:
duration = []
for i in range(1,3):
    duration.append(df.loc[df['pt_won_by'] == i]['point_duration'])

plt.figure(figsize=(12,8))
sn.kdeplot(duration[0], label=player_name[0])
sn.kdeplot(duration[1], label=player_name[1])
plt.ylabel('Probability  of winning a point')
plt.xlabel('Point duration')
plt.xlim([0,np.amax(df['point_duration'])])
plt.title('Distribution of points won with respect to point duration')
plt.grid('minor')
plt.show()

Both players have had a similar distribution of points won depending on point duration. Some rough estimate should suggest that points up to 60 seconds are the one that Đoković can turn in his favour and longer points are more Federer "cup of tea".

Points duration

In [28]:
match_dur = df['vid_second'].iloc[-1] - df['vid_second'].iloc[0] + df['point_duration'].iloc[-1]
h = match_dur//3600
m = (match_dur%3600)//60
s = (match_dur%3600)%60
print('Match duration is: %dh:%dm:%ds' %(h,m,s))

# Points duration
fig, axes = plt.subplots(ncols=2, figsize=(18,6))
ax = axes.ravel()
plt.suptitle("Point duration")

# Show timestamps of
ax[0].scatter(df['vid_second'],df['point_duration'])
ax[0].set_xlabel("Video [s]")
ax[0].set_ylabel("Timestamp of the match [s]")
ax[0].set_title('Point duration during the match')
ax[0].grid('minor')

# Show boxchart of points duration
sn.boxplot(ax=ax[1], x=df['point_duration'])

plt.show()
Match duration is: 4h:56m:26s

Median of point duration is about 40 seconds with three longest points around 5 minute player in the middle of the game. One fourth of all points have ended up with less then 25 seconds and over one fourth of all points lasted over 50s. With match duration of nearly 5 hours without a doubt it is one of the longest, if not the longest, Wimbledon finals.

Rally length

In [29]:
# Points duration
fig, axes = plt.subplots(ncols=2, figsize=(18,6))
ax = axes.ravel()
plt.suptitle("Rally length")

# Show timestamps of
ax[0].scatter(df['vid_second'],df['rally_length'])
ax[0].set_xlabel("Timestamp of the match [s]")
ax[0].set_ylabel("Rally length")
ax[0].set_title('Rally length during the match')
ax[0].grid('minor')

# Show boxchart of points duration
sn.boxplot(ax=ax[1], x=df['rally_length'])

plt.show()