To do:¶

League of Legends Machine Learning¶

Load libraries and API key

In [3]:

import requests, json
import numpy as np
from src import API_io
import importlib
import pandas as pd

In [4]:

working_dir = 'C:\\Users\\Me\\Documents\\GitHub\\lolML'
with open(working_dir+ '\\api_key.txt', 'r') as api_file:
    api_key =  api_file.read()

Get summoner names, and list of matches¶

Load featured games, and get a list of summoner_names

In [ ]:

featured_json = API_io.load_featured_games(api_key) # load json of featured games
featured_game_ids = [x['gameId'] for x in featured_json ] #  use list comprehension to get featured games; don't use this

Make a list of summoner names and summoner IDs from the featured JSON

In [ ]:

summoner_names, summoner_IDs = API_io.get_summoners_IDs_from_featured_games(featured_json, api_key)
summoner_names[:5]

Make a list of summoner ID urls to query RITO with, and then query them (this is rate limited to one query / 1.2 seconds to avoid overloading API).

In [ ]:

summoner_urls = [API_io.make_matchlist_url_summoner_ID(x, True, True, api_key) for x in summoner_IDs]
summoner_urls[:2]
match_histories = [API_io.get_limited_request(x) for x in summoner_urls ]

Extract the match ID from the match history JSON

In [ ]:

match_IDs = np.empty(0, dtype=int)
for cur_matches in match_histories:
    match_IDs = np.append( match_IDs, [x['matchId'] for x in cur_matches['matches']] )
pd.Series(match_IDs).to_csv('Match IDs.csv')
match_IDs.shape

Load a csv of match info to skip above steps¶

In [66]:

match_df = pd.read_csv('Match IDs.csv', header =None)
match_IDs = match_df[1]
match_IDs = match_IDs.unique()
match_IDs.shape

Out[66]:

(27928,)

Query Riot's API for individual game info¶

Make a list of match urls, and then juse requests to query them; again this is rate limited.

In [67]:

# make urls for loading
match_urls = [API_io.make_match_info_url(x, True, api_key) for x in match_IDs] # True flag means we get the timeline
match_urls[:2]

Out[67]:

['https://na.api.pvp.net/api/lol/na/v2.2/match/1955239698?includeTimeline=true&api_key=0da3703d-7bf5-4e72-96cd-5062b28720d7',
 'https://na.api.pvp.net/api/lol/na/v2.2/match/1954974642?includeTimeline=true&api_key=0da3703d-7bf5-4e72-96cd-5062b28720d7']

In [ ]:

import time
import sys
match_range = np.arange(2000,2010)
# this for loop is ugly; used list comprehension previously, but rate limit was fluky
full_match_info = np.empty(0)
for cur_match in match_range:
    time.sleep(1.2) # RIOT API is throttled to ~0.83 requests / second
    try:
        full_match_info = np.append(full_match_info, requests.get(match_urls[cur_match]).json() )
    except requests.exceptions.HTTPError as e:
        print('Error: ' + e + ' in game ' + str(match_IDs[cur_match]))
    except:
        err = sys.exc_info()[0]
        print('Error: ' + str(err) + ' in game ' + str(match_IDs[cur_match]))

Save to a JSON or .pickle so we don't have to query again.

In [ ]:

with open('full matchinfo.csv', 'w') as out_file:
    json.dump(full_match_info.tolist(), out_file)

In [10]:

# saving as a pickle file saves ~%40 of the space
import pickle
with open('full match info.pickle', 'wb') as pickle_file:
    pickle.dump(full_match_info, pickle_file)

Load information from local JSON or pickle¶

In [6]:

# load from JSON
import json
with open('games 6000-8000.json') as json_file:
    scraped_matches= json.load(json_file)
scraped_matches = np.array(full_match_info)

In [5]:

# loading the pickle in
import pickle
with open('full match info.pickle', 'rb') as pickle_file:
    full_match_info = pickle.load(pickle_file)

Analyze game data¶

Explore and clean data¶

First, load some more libraries

In [14]:

import matplotlib.pyplot as plt
import src.plotting as lol_plt
%matplotlib inline

Plot the length of games

In [15]:

game_lengths = np.array([np.size(x['timeline']['frames']) for x in full_match_info] )
plt.hist(game_lengths, bins = 50);
plt.xlabel('Game length (min)', fontsize = 18)
plt.ylabel('# Games', fontsize = 18)
lol_plt.prettify_axes(plt.gca())

Some games don't even last twenty minutes! There is also a large spike of games ending around 20 minutes due to surrenders. When we create features, the feature calculator will have to consider game length.

Create features¶

Create features for the classifier; now just starting with simple stuff like first blood, first tower, and first dragon.

In [22]:

from src import feature_calc
importlib.reload(feature_calc)
games_df = feature_calc.calc_features_all_matches(full_match_info[:100], 20)
games_df.head(3)

Out[22]:

	first_dragon	blue_dragons	red_dragons	drag_diff	first_baron	blue_barons	red_barons	first_tower	blue_towers	red_towers	...	blue_3	blue_4	red_0	red_1	red_2	red_3	red_4	surrender	game_length	winner
matchId
1955239698	1	1	1	0	-1	0	0	1	3	2	...	59	429	82	1	2	86	201	1	37	1
1954974642	0	0	2	-2	-1	0	0	0	1	3	...	20	22	56	432	67	101	39	1	32	0
1950969271	0	0	1	-1	-1	0	0	0	2	2	...	60	267	201	238	223	119	150	0	44	0

3 rows × 34 columns

In [23]:

games_df.dtypes

Out[23]:

first_dragon    category
blue_dragons     float64
red_dragons      float64
drag_diff        float64
first_baron     category
blue_barons      float64
red_barons       float64
first_tower     category
blue_towers      float64
red_towers       float64
tower_diff       float64
first_inhib     category
blue_inhibs      float64
red_inhibs       float64
first_blood     category
gold_diff        float64
blue_kills       float64
red_kills        float64
blue_share       float64
red_share        float64
kills_diff       float64
blue_0           float64
blue_1           float64
blue_2           float64
blue_3           float64
blue_4           float64
red_0            float64
red_1            float64
red_2            float64
red_3            float64
red_4            float64
surrender       category
game_length      float64
winner          category
dtype: object

In [23]:

count, bins, _ = plt.hist(games_df['gold_diff'] / 1000, bins = 50)
plt.ylabel('# Games', fontsize = 18)
plt.xlabel('Gold difference (thousands)', fontsize = 18)
lol_plt.prettify_axes(plt.gca())

In [41]:

bins = np.arange(40)
kills_fig = plt.figure()
plt.hist(games_df['blue_kills'] , bins = bins, color = 'blue')
plt.hist(games_df['red_kills'] , bins = bins, color = 'red', alpha = 0.5)
plt.ylabel('# Games', fontsize = 18)
plt.xlabel('Kills', fontsize = 18)
lol_plt.prettify_axes(plt.gca())
#plt.gca().set_xticklabels(bins / 1000, rotation = 90)

Run machine learning algorithms:¶

Load libraries, and initialize feature info¶

In [48]:

# load sklearn package 
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

In [61]:

# variables for classifiers
col_names = games_df.columns
train_col = np.array([x for x in games_df.columns if x not in
             ['winner', 'game_length', 'blue_0', 'blue_1', 'blue_2', 'blue_3', 'blue_4',
              'red_0', 'red_1', 'red_2', 'red_3', 'red_4'] ])
num_features = np.size(train_col)
print(train_col, )

['first_dragon' 'blue_dragons' 'red_dragons' 'drag_diff' 'first_baron'
 'blue_barons' 'red_barons' 'first_tower' 'blue_towers' 'red_towers'
 'tower_diff' 'first_inhib' 'blue_inhibs' 'red_inhibs' 'first_blood'
 'gold_diff' 'blue_kills' 'red_kills' 'blue_share' 'red_share' 'kills_diff'
 'surrender']

Naive Bayes:¶

First let's see how good prediction is with each feature individually

In [ ]:

gnb = GaussianNB()
def quick_score(games_df, col_index):
    gnb.fit(games_df[[col_index]], games_df['winner'])
    return gnb.score(games_df[[col_index]], games_df['winner'])

[quick_score(games_df, x) for x in np.arange(num_features-1)]

First dragon and first tower are both pretty meaningful, but first blood isn't. The most important thing, though, is gold.

Now let's use machine learning to look at everything together. Let's do a 10-fold cross-validation on the data, and see what the average score is.

In [ ]:

scores = cross_validation.cross_val_score(gnb, games_df[train_col], games_df['winner'], cv=10)
print(np.mean(scores))

The full predictor is not that much more informative than the individual parameters! What if we try a:

SVC Feature importance¶

Some sample code using built in recursive feature elimination and cross-validation. I never waited long enough for this to finish!

In [20]:

from sklearn.feature_selection import RFECV
from sklearn.svm import SVC

svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=3,
              scoring='accuracy')

In [ ]:

# this takes a long time to run
rfecv.fit(timelines_df[-1][train_col], timelines_df[-1]['winner'])

Random Forest Feature Importance¶

Start with a large random forest to get a sense of which features are important at different timepoints. First, let's get a dataframe for each timepoint

In [46]:

timeline_end = 50
time_indices = np.arange(5, timeline_end, 5)
timelines_df = [feature_calc.calc_features_all_matches(full_match_info, x) for x in time_indices]
print([x.shape[0] for x in timelines_df])

[5995, 5991, 5948, 5800, 4684, 3471, 2019, 912, 294]

In [49]:

big_forest = RandomForestClassifier(n_jobs = 3, n_estimators = 100, max_features = 'sqrt')

In [50]:

importances = std = np.zeros([len(timelines_df), np.size(train_col)])
for i, cur_df in enumerate(timelines_df):
    big_forest.fit(cur_df[train_col], cur_df['winner'])
    importances[i] = big_forest.feature_importances_
    std[i] = np.std([tree.feature_importances_ for tree in big_forest.estimators_],
                 axis=0)
indices_at_20 = np.argsort(importances[3])[::-1]
indices_at_35 = np.argsort(importances[6])[::-1]
#for f in range(10):
#    print("%d. feature %s (%f)" % (f + 1, str(train_col[indices[f]]), importances[indices[f]]))

In [62]:

feature_fig = plt.figure(figsize = [12,9])
plt.imshow(importances[:,indices_at_20], interpolation = 'none', extent = [0, num_features*4, 47.5, 2.5])
plt.ylabel('Time (min)', fontsize = 18)
plt.xticks(np.arange(0, num_features*4, 4)+2, indices_at_20)
x_tick_labels = [str(x) for x in train_col[indices_at_20]]
plt.gca().set_xticklabels(x_tick_labels, rotation = 90)
lol_plt.prettify_axes(plt.gca())
plt.gca().yaxis.set_ticks(time_indices);
plt.colorbar();

Gold differential is the most important feature at all timepoints, followed by kill differential, and tower differential. Dragon differential comes in late, behind the total number of towers, at #6. Barons and inhibitors are not particularly informative at later timepoints, probably because they are a reflection of gold and kill differentials.

In [81]:

plt.figure(figsize = [9, 6])
plt.plot(importances[3,indices_at_20], label = 'Importance at 20', linewidth = 2)
plt.plot(importances[6,indices_at_20], label = 'Importance at 35', linewidth = 2)
plt.ylabel('Importance', fontsize = 18)
plt.xticks(np.arange(num_features))
x_tick_labels = [str(x) for x in train_col[indices_at_20]]
plt.gca().set_xticklabels(x_tick_labels, rotation = 90)
plt.legend(frameon = False, fontsize = 16)
lol_plt.prettify_axes(plt.gca())

Let's extract the important columns for future analyses

In [96]:

import_at_20 = train_col[indices_at_20[:10]]
import_at_35 = train_col[indices_at_35[:10]]
important_col = np.unique(np.append(import_at_20, import_at_35))
important_col

Out[96]:

array(['blue_barons', 'blue_dragons', 'blue_inhibs', 'blue_kills',
       'blue_towers', 'drag_diff', 'first_inhib', 'gold_diff',
       'kills_diff', 'red_barons', 'red_dragons', 'red_inhibs',
       'red_kills', 'red_towers', 'tower_diff'], 
      dtype='<U12')

In a previous version of this analysis, I found that the carry share was predictive. It does not really fall out from the new analysis, though. While investigating this metric, though, I found something interesting, which I have kept here. First, let's plot the distribution of carry share.

In [160]:

games_df['blue_share'].hist(bins = np.arange(0, 1, 0.05))
lol_plt.prettify_axes(plt.gca())
plt.xlabel('Blue carry\'s share of kills', fontsize = 18)
print('Median blue carry share: {:.2f}\nStd blue carry share: {:.2f}'.format(games_df['blue_share'].median(), games_df['blue_share'].std()))

Median blue carry share: 0.40
Std blue carry share: 0.18

The median carry ratio is 0.4, with a standard deviation of ~0.2. Now let's plot the win percentage against the carry share.

In [91]:

games_df = timelines_df[6]
share_indices = np.arange(0, 1.0+0.1, 0.1)
win_percent_by_share = games_df['winner'].astype(int).groupby(pd.cut(games_df["blue_share"], share_indices)).mean()
plt.plot(share_indices[:-1], win_percent_by_share, 'o')
plt.xlim([0, 1])
plt.xlabel('Blue carry\'s share of kills', fontsize = 18)
plt.ylabel('Win percentage', fontsize = 18)
lol_plt.prettify_axes(plt.gca())

The win percentage goes down as the carry share increases. This means that teams which have all the kills weighted onto one team are less likely to win. If you get far ahead in your lane, it's less important for you to keep your lane opponent down than it is to get the rest of your team fed.

Does game get more predictable over time?¶

Now that we have features that are important, we can answer other questions.

In [97]:

rfc = RandomForestClassifier(n_jobs = 2, n_estimators = 20, max_features = 'sqrt')
def cross_validate_df(cur_df):
    return cross_validation.cross_val_score(rfc, cur_df[important_col], cur_df['winner'], cv=5, n_jobs = 2)

scores_list = [cross_validate_df(x) for x in timelines_df]

In [98]:

plt.plot(time_indices, np.mean(scores_list, 1))
plt.ylim( 0.5, 1)
plt.xlabel('Time (min)', fontsize = 18)
plt.ylabel('Prediction accuracy', fontsize = 18)
plt.xticks(time_indices)
lol_plt.prettify_axes(plt.gca())

The game does get more easily predicted with time, as more information is gathered by the model. However, once you reach 30 minutes, the model loses accuracy. This is probably because as the game enters the late phase, gold matters less, and objectives more, and a single decisive teamfight can decide the game either way.

Are surrendered games winnable?¶

First, separate games into surrendered early, and those that were semi-close.

In [99]:

surrender_at_20_df = timelines_df[3].query('(surrender == 1) & (game_length <=25)')
good_games_df = timelines_df[3].query('(surrender == 0) | (game_length >25)')
surrender_at_20_df[['surrender', 'game_length']].head(2)

Out[99]:

	surrender	game_length
matchId
1947508622	1	22
1947379227	1	25

In [133]:

gold_bins = np.arange(-20000, 20000, 1000)
plt.hist(np.array(surrender_at_20_df['gold_diff']), bins=gold_bins)
plt.xlabel('Gold difference (thousands)', fontsize = 18)
plt.ylabel('# Games', fontsize = 18)
plt.gca().set_xticklabels(np.arange(-20, 21, 5))
lol_plt.prettify_axes(plt.gca())

Since these are stomps, the gold difference is bimodal towards large leads. However, there are a few games with small gold leads. First, let's see how accurate the random forest is for these stomps. (It should be high!)

In [101]:

surrender_forest = RandomForestClassifier(n_jobs = 2, n_estimators = 10, max_features = 'sqrt')
surrender_scores = cross_validation.cross_val_score(surrender_forest, surrender_at_20_df[important_col], surrender_at_20_df['winner'], cv=10)
print('Forest mean accuracy for surrendered games: {:.2f}'.format(np.mean(surrender_scores)))

Forest mean accuracy for surrendered games: 0.99

Pretty damn good! Ok, what happens if we train the model on the "close" games, and use it to predict the probability of winning surrendered ones?

In [102]:

close_forest = RandomForestClassifier(n_jobs = 3, n_estimators = 20, max_features = 'sqrt')
close_forest.fit(good_games_df[important_col], good_games_df['winner'])
cross_score = np.mean(np.max(close_forest.predict_proba(surrender_at_20_df[important_col]), axis = 1))
print('A forest trained on non-surrender games predicts the team that won would do so with {:.2f} probability'.format(cross_score))

A forest trained on non-surrender games predicts the team that won would do so with 0.93 probability

The random forest trained on close games actually gives games surrendered at 20 a 7% chance of winning!

Forest size versus prediction accuracy¶

In [ ]:

forest_sizes = np.arange(5, 205, 25)
def score_by_size(cur_size):
    size_forest = RandomForestClassifier(n_jobs = 4, n_estimators = cur_size, max_features = 'sqrt')
    return cross_validation.cross_val_score(size_forest, timelines_df[6][important_col],  timelines_df[6]['winner'], cv=5, n_jobs = 4)

size_scores = list(map(score_by_size, forest_sizes))

In [ ]:

plt.plot(forest_sizes, np.mean(size_scores, 1))
lol_plt.prettify_axes(plt.gca())

In [ ]:

To do:¶

League of Legends Machine Learning¶

Get summoner names, and list of matches¶

Load a csv of match info to skip above steps¶

Query Riot's API for individual game info¶

Load information from local JSON or pickle¶

Analyze game data¶

Explore and clean data¶

Create features¶

Run machine learning algorithms:¶

Load libraries, and initialize feature info¶

Naive Bayes:¶

SVC Feature importance¶

Random Forest Feature Importance¶

An aside on carry share¶

Does game get more predictable over time?¶

Are surrendered games winnable?¶

Forest size versus prediction accuracy¶