In League of Legends, an online battle arena game in which players choose a character (a champion) with unique abilities and fight together in matches against an opposing team. It is very popular, and has a competitive player base.
The mechanics of the game aren't important for understanding this demo, but one mechanic is of interest: before each match, each team can choose to ban up to three champion types from the game. Teams may do this because they perceive particular champions as broken (unfair to use), or to give themselves an advantage in the match. However, it's generally not established what, if any, advantage banning a character gives to a team. We seek to ask questions on this topic here.
Alyssa Adams has pulled data for each match from Riot Games' public API. For each match, we have a directory named with the match's unique ID, and this contains a couple things. For example:
%ls matches/2353211982
This match contains a CSV file giving some of the most interesting match data to Alyssa, which we can parse using pandas
:
import pandas as pd
pd.read_csv('matches/2353211982/match.csv', index_col=0)
Each row corresponds to a given player in the match, and gives information such as what champion they used, which side they were on, how many kills, deaths, etc., and whether their team won.
In addition to this file, we also have a funny looking Treant.*.json
file. This is a datreant
state file, and it's pure JSON:
%cat matches/2353211982/Treant.c7633606-6432-4cb8-a673-1faa37f3a56b.json
We've used this to store information about the match as a whole, not player-specific data. This contains several categories
(key-value pairs), giving things like the map, the game type, the duration, and, importantly for this study, which bans were chosen by each team. We could use any old JSON parser to use this file, such as the built-in json.load
function:
import json
with open('matches/2353211982/Treant.c7633606-6432-4cb8-a673-1faa37f3a56b.json', 'r') as f:
match = json.load(f)
match
But instead we'll use this in a more refined way, using datreant
:
import datreant.core as dtr
match = dtr.Treant('matches/2353211982')
match
This file is a datreant
state file, which stores the state for a Treant
. This is a library that makes it easy to treat directories in the filesystem as logical units that can be introspected easily and tagged with metadata:
match.draw()
match['match.csv'].read()
match.categories
match.categories['Map']
Having our match data in this form lets us ask lots of questions with very little friction. If you're familiar with relational databases, in a way it lets one work with directories as if they were part of a relational database. And the files inside these directories can be anything, not just tables like we have here.
Treant
s allow introspection into their own trees to any level of detail required, but we can do powerful things with them in aggregate. Let's gather all our matches up into a Bundle
:
matches = dtr.discover('matches/')
len(matches)
This Bundle
has a few thousand Treant
s. It's important to remember that a Treant
is just a directory in the filesystem with a state file, which serves as:
Treant
, andThe Treant
object (that is, the Python object in memory) is itself very light. It only really amounts to a reference to the state file, which gets parsed on access of things like tags
and categories
.
A Bundle
can be used to work with a collection of Treant
s as if it was a single logical unit:
matches[:5].draw()
We can, for example, get back any category keys present among these Treant
s:
matches.categories.any.keys()
And perhaps get back the values for one:
matches[:10].categories['Mode']
For starters, we'll ask a simple question. If the champion Teemo is banned, does this make his team more likely to win? Does it make any difference? First, we need to extract matches in which Teemo is banned. Let's keep blue and red bans separate for now, since this might make answering the question easier later.
First, we'll get all the matches in which the blue team banned Teemo:
blueban = dtr.Bundle([matches.categories.groupby(ban)['Teemo'] for ban in ('Blue Ban 1', 'Blue Ban 2', 'Blue Ban 3')])
blueban
And we'll do the same for the red team:
redban = dtr.Bundle([matches.categories.groupby(ban)['Teemo'] for ban in ('Red Ban 1', 'Red Ban 2', 'Red Ban 3')])
redban
Because Bundle
s behave as ordered sets, if we wanted all the matches where Teemo was banned by at least one team, then we could get it with:
redban + blueban
And if we wanted all matches where Teemo was banned by both teams, we could do an intersection of the two:
redban & blueban
This is expected, because both teams can't ban the same champion. This is a nice check. :)
For red team bans, how many of those matches did the red team lose? Looking at what a given match CSV gives us:
df = pd.read_csv(redban[0]['match.csv'].abspath, index_col=0)
df
We could get this by checking if the word 'blue'
is in the 'Win'
column for rows where that column gives True
:
def blue_win(match):
df = pd.read_csv(match['match.csv'].abspath, index_col=0)
return 'blue' in df[df['Win'] == True]['Side'].iloc[0]
So we could then get back a list of bools telling us which matches blue won among those where the red team banned Teemo:
redban_blue_wins = redban.map(blue_win)
redban_blue_wins
Did red win more often than they lost? Here are the matches red won:
len(redban - redban[redban_blue_wins])
Here are the number of matches red lost:
len(redban[redban_blue_wins])
We used this list of bools as a boolean index for our redban
Bundle
, here.
We'll do the same thing for matches in which the blue team banned Teemo:
blueban_blue_wins = blueban.map(blue_win)
blueban_blue_wins
Did blue win more often than they lost? Here are the matches blue won:
len(blueban[blueban_blue_wins])
Here are the number of matches blue lost:
len(blueban - blueban[blueban_blue_wins])
Hmmm...maybe not a signal here, but the number of samples is pretty small anyhow to say anything.
We want to take to task some conventional wisdom that may be floating around the community about which characters you should ban if you want to increase your chances to win. Let's see if this holds water.
First we need to find the most banned character. Probably lots of ways to get at this, but one thing we can do is a groupby
on each ban slot as before, for each building a dictionary giving banned champions as keys and the number of matches they were banned in as values:
counts = pd.DataFrame({ban: {key: len(value) for key, value in matches.categories.groupby(ban).items()} for ban in ('Blue Ban 1', 'Blue Ban 2', 'Blue Ban 3', 'Red Ban 1', 'Red Ban 2', 'Red Ban 3')})
counts
We can then sum across columns and sort to get a ranking of banned champions across all matches:
counts.sum(axis=1).sort_values(ascending=False)
So we have our answer. The most banned is Rengar.
Out of curiosity, what's the distribution of bans look like across champions?
%matplotlib inline
counts.sum(axis=1).sort_values(ascending=False).plot()
We'll do the same analysis we did for Teemo, but now for Rengar.
blueban = dtr.Bundle([matches.categories.groupby(ban)['Rengar'] for ban in ('Blue Ban 1', 'Blue Ban 2', 'Blue Ban 3')])
len(blueban)
redban = dtr.Bundle([matches.categories.groupby(ban)['Rengar'] for ban in ('Red Ban 1', 'Red Ban 2', 'Red Ban 3')])
len(redban)
For red team bans, how many of those matches did the red team lose?
Did red win more often than they lost? Here are the matches red won:
len(redban - redban[redban.map(blue_win)])
Here are the number of matches red lost:
len(redban[redban.map(blue_win)])
Did blue win more often than they lost? Here are the matches blue won:
len(blueban[blueban.map(blue_win)])
Here are the number of matches blue lost:
len(blueban - blueban[blueban.map(blue_win)])
Hmmm...maybe not a signal here...might not be much point overall to banning Rengar. However, it may be the case that if you ban Rengar and perhaps another character this improves your chances? That's an interesting question for another time. Could it also be the case that banning Rengar levels the playing field?
We already know that banning Rengar doesn't seem to make your team more likely to win, but does not banning him make a team with Rengar more likely to win? Is he even picked by a player?
First we need matches where Rengar was not banned. We'll iterate through all ban slots, grab out the matches where the banned champion was Rengar, then combine these Bundle
s into a single one. We can then subtract this Bundle
from all the matches to get only those where Rengar was not banned:
rengar_not_banned = matches - dtr.Bundle([matches.categories.groupby(ban)['Rengar'] for ban in ('Blue Ban 1', 'Blue Ban 2', 'Blue Ban 3', 'Red Ban 1', 'Red Ban 2', 'Red Ban 3')])
len(rengar_not_banned)
So now, which matches was Rengar present in? We can get this like we did before, by checking each match CSV. this time, we just want to know if 'Rengar'
shows up in 'Champion'
column:
def has_rengar(match):
df = pd.read_csv(match['match.csv'].abspath, index_col=0)
return len(df[df['Champion'] == 'Rengar']) == 1
matches_has_rengar = rengar_not_banned[rengar_not_banned.map(has_rengar)]
len(matches_has_rengar)
So, if a team has a Rengar, did they also tend to win? We can check the CSV files again to see if the player using Rengar was also on the winning team:
def rengar_won(match):
df = pd.read_csv(match['match.csv'].abspath, index_col=0)
return bool(df[df['Champion'] == 'Rengar']['Win'].iloc[0])
Number of wins:
len(matches_has_rengar[matches_has_rengar.map(rengar_won, processes=4)])
Number of losses:
len(matches_has_rengar - matches_has_rengar[matches_has_rengar.map(rengar_won, processes=4)])
Doesn't really look like he confers too much of an advantage, but perhaps a slight one?
There are tons of questions we can ask, and even complex ones are surprisingly easy to do so given the way we've structured our data, and how datreant
allows us to work with it. Here's a smattering:
We can do a similar analysis as we did for Rengar on all champions, and if it's the case that having a particular champion on your team results in more wins on average, we can begin drilling down into more questions to establish why. First let's see what we get.
We'll gather all dataframes into a single, big dataframe with an extra index giving the uuid of the Treant
for unique identification. This will make writing code that gets the number of wins/losses for a particular champion faster, since we can operate on a single DataFrame
instead of ~5000:
match_dfs = [pd.read_csv(match['match.csv'].abspath, index_col=0) for match in matches]
bigdf = pd.concat(match_dfs, keys=matches.uuids, names=['uuids'])
bigdf
We'll then write two quick-and-dirty functions (probably better to combine them, but whatever for now), that tell us how many wins a champion saw or how many losses:
def match_result_champion(champion, win=True):
df = bigdf[bigdf['Win'] == win]
df = df[df['Champion'] == champion]
return len(df)
For example:
match_result_champion('Rengar', win=True)
Let's get a full list of the available champions from our dataframe:
champions = list(bigdf['Champion'].unique())
len(champions)
Now we can get the number of wins for every champion:
champion_won = {champion: match_result_champion(champion, win=True) for champion in champions}
...and the number of losses:
champion_lost = {champion: match_result_champion(champion, win=False) for champion in champions}
So now we can build pandas
Series objects directly from these, for easy statistics work:
num_wins = pd.Series(champion_won)
num_lost = pd.Series(champion_lost)
And perhaps we can see which characters correlate with more wins (or losses).
percent_win = ((num_wins - num_lost)*100.0 / (num_wins + num_lost)).sort_values(ascending=False)
percent_win
ax = percent_win.plot.bar(figsize=(40, 8))
for item in ax.get_xticklabels():
item.set_rotation(60)
From here we can start to ask why some champions appear advantageous from these match data.