## World Cup Learning¶

Here I try to predict fifa world cup matches results, based on the knowledge of previous matches from the cups since the year 1950.

I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with a lot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = wins team2).

I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for the graphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly to abstract the data processing I need before I feed the classifier.

In [1]:
from random import random

from IPython.display import SVG
import pygal

from pybrain.structure import SigmoidLayer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError

from utils import get_matches, get_team_stats, extract_samples, normalize, split_samples, graph_teams_stat_bars, graph_matches_results_scatter


## Configs¶

In [2]:
# the features I will feed to the classifier as input data.
input_features = ['year',
'matches_won_percent',
'podium_score_yearly',
'matches_won_percent_2',
'podium_score_yearly_2',]

# the feature giving the result the classifier must learn to predict (I recommend allways using 'winner')
output_feature = 'winner'

# used to avoid including tied matches in the learning process. I found this greatly improves the classifier accuracy.
# I know there will be some ties, but I'm willing to fail on those and have better accuracy with all the rest.
# at this point, this code will break if you set it to False, because the network uses a sigmoid function with a
# threeshold for output, so it is able to distinquish only 2 kinds of results.
exclude_ties = True

# used to duplicate matches data, reversing the teams (team1->team2, and viceversa).
# This helps on visualizations, and also improves precission of the predictions avoiding a dependence on the
# order of the teams from the input.
duplicate_with_reversed = True

In [3]:
def show(graph):
'''Small utility to display pygal graphs'''
return SVG(graph.render())


## Team stats¶

First we need the teams stats. We can't feed the classifier inputs like ('Argentina', 'Brazil'), we need to give it numbers. And not any numbers, not just ids, but numbers that could be somewhat related to the result of the matches.

For example: the percentage of won matches of each team is something that could have an impact in the result, so that stat is a very good candidate.

We just calculate a lots of stats per team, and after we will decide which ones to use.

In [4]:
team_stats = get_team_stats()
team_stats

Out[4]:
matches_played matches_won years_played podium_score cups_won matches_won_percent podium_score_yearly cups_won_yearly
team
Brazil 89 63 16 102 5 70.786517 6.375000 0.312500
Canada 3 0 1 0 0 0.000000 0.000000 0.000000
Serbia and Montenegro 3 0 1 0 0 0.000000 0.000000 0.000000
Kuwait 3 0 1 0 0 0.000000 0.000000 0.000000
Scotland 23 4 8 0 0 17.391304 0.000000 0.000000
Costa Rica 10 3 3 0 0 30.000000 0.000000 0.000000
Ivory Coast 6 2 2 0 0 33.333333 0.000000 0.000000
Wales 5 1 1 0 0 20.000000 0.000000 0.000000
Argentina 64 33 13 40 2 51.562500 3.076923 0.153846
Bolivia 4 0 2 0 0 0.000000 0.000000 0.000000
Cameroon 20 4 6 0 0 20.000000 0.000000 0.000000
Ecuador 7 3 2 0 0 42.857143 0.000000 0.000000
Ghana 9 4 2 0 0 44.444444 0.000000 0.000000
Saudi Arabia 13 2 4 0 0 15.384615 0.000000 0.000000
Australia 10 2 3 0 0 20.000000 0.000000 0.000000
Iran 9 1 3 0 0 11.111111 0.000000 0.000000
Algeria 9 2 3 0 0 22.222222 0.000000 0.000000
El Salvador 6 0 2 0 0 0.000000 0.000000 0.000000
Republic of Ireland 13 2 3 0 0 15.384615 0.000000 0.000000
Slovenia 6 1 2 0 0 16.666667 0.000000 0.000000
Chile 26 7 7 4 0 26.923077 0.571429 0.000000
Belgium 32 10 8 2 0 31.250000 0.250000 0.000000
Haiti 3 0 1 0 0 0.000000 0.000000 0.000000
Iraq 3 0 1 0 0 0.000000 0.000000 0.000000
Spain 53 27 12 18 1 50.943396 1.500000 0.083333
China PR 3 0 1 0 0 0.000000 0.000000 0.000000
Netherlands 41 22 7 26 0 53.658537 3.714286 0.000000
Denmark 16 8 4 0 0 50.000000 0.000000 0.000000
Poland 30 15 6 8 0 50.000000 1.333333 0.000000
Morocco 13 2 4 0 0 15.384615 0.000000 0.000000
Croatia 13 6 3 4 0 46.153846 1.333333 0.000000
Switzerland 24 7 7 0 0 29.166667 0.000000 0.000000
Honduras 6 0 2 0 0 0.000000 0.000000 0.000000
New Zealand 6 0 2 0 0 0.000000 0.000000 0.000000
Jamaica 3 1 1 0 0 33.333333 0.000000 0.000000
England 59 26 13 18 1 44.067797 1.384615 0.076923
Uruguay 43 14 10 22 1 32.558140 2.200000 0.100000
United Arab Emirates 3 0 1 0 0 0.000000 0.000000 0.000000
South Africa 9 2 3 0 0 22.222222 0.000000 0.000000
Egypt 3 0 1 0 0 0.000000 0.000000 0.000000
Colombia 13 3 4 0 0 23.076923 0.000000 0.000000
South Korea 28 5 8 2 0 17.857143 0.250000 0.000000
Turkey 10 5 2 4 0 50.000000 2.000000 0.000000
Italy 71 36 15 54 2 50.704225 3.600000 0.133333
Czech Republic 3 1 1 0 0 33.333333 0.000000 0.000000
France 48 23 10 34 1 47.916667 3.400000 0.100000
Slovakia 4 1 1 0 0 25.000000 0.000000 0.000000
Peru 13 4 3 0 0 30.769231 0.000000 0.000000
Norway 7 2 2 0 0 28.571429 0.000000 0.000000
Nigeria 14 4 4 0 0 28.571429 0.000000 0.000000
Israel 3 0 1 0 0 0.000000 0.000000 0.000000
Zaire 3 0 1 0 0 0.000000 0.000000 0.000000
Czechoslovakia 23 7 6 8 0 30.434783 1.333333 0.000000
Austria 25 10 6 4 0 40.000000 0.666667 0.000000
Togo 3 0 1 0 0 0.000000 0.000000 0.000000
Germany 98 59 15 94 3 60.204082 6.266667 0.200000
Ukraine 5 2 1 0 0 40.000000 0.000000 0.000000
Northern Ireland 13 3 3 0 0 23.076923 0.000000 0.000000
United States 25 5 7 0 0 20.000000 0.000000 0.000000
Trinidad and Tobago 3 0 1 0 0 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ...

76 rows × 8 columns

Lets visualize some of those stats, just because it helps paint a bigger picture on how good the teams are.

(you can hoover with your mouse on the '...' from the x axys to see the team name)

In [5]:
show(graph_teams_stat_bars(team_stats, 'matches_won_percent'))

Out[5]:

Pudium score is an invented measure on how good the teams are by looking at the 4 first teams from each cup. The first team receives 8 points, the second 4, the third 2, and the fourth 1. All the rest receive 0 points. As you can see, the scoring is exponential, because each position implies an exponentially bigger amount of matches won than the next one.

In [6]:
show(graph_teams_stat_bars(team_stats, 'podium_score_yearly'))

Out[6]:

## Matches¶

Now we need to get the matches data, including the "reversed" duplication of matches, and adding the team stats in each match.

In [7]:
matches = get_matches(with_team_stats=True,
duplicate_with_reversed=duplicate_with_reversed,
exclude_ties=exclude_ties)

matches

Out[7]:
score1 score2 team1 team2 year score_diff winner matches_played matches_won years_played podium_score cups_won matches_won_percent podium_score_yearly cups_won_yearly matches_played_2 matches_won_2 years_played_2 podium_score_2 cups_won_2
0 4 0 Brazil Mexico 1950 4 1 89 63 16 102 5 70.786517 6.375000 0.312500 46 12 13 0 0 ...
1 3 0 Yugoslavia Switzerland 1950 3 1 34 14 8 2 0 41.176471 0.250000 0.000000 24 7 7 0 0 ...
3 4 1 Yugoslavia Mexico 1950 3 1 34 14 8 2 0 41.176471 0.250000 0.000000 46 12 13 0 0 ...
4 2 0 Brazil Yugoslavia 1950 2 1 89 63 16 102 5 70.786517 6.375000 0.312500 34 14 8 2 0 ...
5 2 1 Switzerland Mexico 1950 1 1 24 7 7 0 0 29.166667 0.000000 0.000000 46 12 13 0 0 ...
6 2 0 England Chile 1950 2 1 59 26 13 18 1 44.067797 1.384615 0.076923 26 7 7 4 0 ...
7 3 1 Spain United States 1950 2 1 53 27 12 18 1 50.943396 1.500000 0.083333 25 5 7 0 0 ...
8 2 0 Spain Chile 1950 2 1 53 27 12 18 1 50.943396 1.500000 0.083333 26 7 7 4 0 ...
9 1 0 United States England 1950 1 1 25 5 7 0 0 20.000000 0.000000 0.000000 59 26 13 18 1 ...
10 1 0 Spain England 1950 1 1 53 27 12 18 1 50.943396 1.500000 0.083333 59 26 13 18 1 ...
11 5 2 Chile United States 1950 3 1 26 7 7 4 0 26.923077 0.571429 0.000000 25 5 7 0 0 ...
12 3 2 Sweden Italy 1950 1 1 41 14 9 16 0 34.146341 1.777778 0.000000 71 36 15 54 2 ...
14 2 0 Italy Paraguay 1950 2 1 71 36 15 54 2 50.704225 3.600000 0.133333 25 6 7 0 0 ...
15 8 0 Uruguay Bolivia 1950 8 1 43 14 10 22 1 32.558140 2.200000 0.100000 4 0 2 0 0 ...
17 7 1 Brazil Sweden 1950 6 1 89 63 16 102 5 70.786517 6.375000 0.312500 41 14 9 16 0 ...
18 6 1 Brazil Spain 1950 5 1 89 63 16 102 5 70.786517 6.375000 0.312500 53 27 12 18 1 ...
19 3 2 Uruguay Sweden 1950 1 1 43 14 10 22 1 32.558140 2.200000 0.100000 41 14 9 16 0 ...
20 3 1 Sweden Spain 1950 2 1 41 14 9 16 0 34.146341 1.777778 0.000000 53 27 12 18 1 ...
21 2 1 Uruguay Brazil 1950 1 1 43 14 10 22 1 32.558140 2.200000 0.100000 89 63 16 102 5 ...
22 5 0 Brazil Mexico 1954 5 1 89 63 16 102 5 70.786517 6.375000 0.312500 46 12 13 0 0 ...
23 1 0 Yugoslavia France 1954 1 1 34 14 8 2 0 41.176471 0.250000 0.000000 48 23 10 34 1 ...
25 3 2 France Mexico 1954 1 1 48 23 10 34 1 47.916667 3.400000 0.100000 46 12 13 0 0 ...
26 4 1 Germany Turkey 1954 3 1 98 59 15 94 3 60.204082 6.266667 0.200000 10 5 2 4 0 ...
27 9 0 Hungary South Korea 1954 9 1 26 11 7 8 0 42.307692 1.142857 0.000000 28 5 8 2 0 ...
28 8 3 Hungary Germany 1954 5 1 26 11 7 8 0 42.307692 1.142857 0.000000 98 59 15 94 3 ...
29 7 0 Turkey South Korea 1954 7 1 10 5 2 4 0 50.000000 2.000000 0.000000 28 5 8 2 0 ...
30 7 2 Germany Turkey 1954 5 1 98 59 15 94 3 60.204082 6.266667 0.200000 10 5 2 4 0 ...
31 2 0 Uruguay Czechoslovakia 1954 2 1 43 14 10 22 1 32.558140 2.200000 0.100000 23 7 6 8 0 ...
32 1 0 Austria Scotland 1954 1 1 25 10 6 4 0 40.000000 0.666667 0.000000 23 4 8 0 0 ...
33 7 0 Uruguay Scotland 1954 7 1 43 14 10 22 1 32.558140 2.200000 0.100000 23 4 8 0 0 ...
34 5 0 Austria Czechoslovakia 1954 5 1 25 10 6 4 0 40.000000 0.666667 0.000000 23 7 6 8 0 ...
35 2 1 Switzerland Italy 1954 1 1 24 7 7 0 0 29.166667 0.000000 0.000000 71 36 15 54 2 ...
37 4 1 Italy Belgium 1954 3 1 71 36 15 54 2 50.704225 3.600000 0.133333 32 10 8 2 0 ...
38 2 0 England Switzerland 1954 2 1 59 26 13 18 1 44.067797 1.384615 0.076923 24 7 7 0 0 ...
39 4 1 Switzerland Italy 1954 3 1 24 7 7 0 0 29.166667 0.000000 0.000000 71 36 15 54 2 ...
40 7 5 Austria Switzerland 1954 2 1 25 10 6 4 0 40.000000 0.666667 0.000000 24 7 7 0 0 ...
41 4 2 Uruguay England 1954 2 1 43 14 10 22 1 32.558140 2.200000 0.100000 59 26 13 18 1 ...
42 2 4 Brazil Hungary 1954 -2 2 89 63 16 102 5 70.786517 6.375000 0.312500 26 11 7 8 0 ...
43 0 2 Yugoslavia Germany 1954 -2 2 34 14 8 2 0 41.176471 0.250000 0.000000 98 59 15 94 3 ...
44 4 2 Hungary Uruguay 1954 2 1 26 11 7 8 0 42.307692 1.142857 0.000000 43 14 10 22 1 ...
45 6 1 Germany Austria 1954 5 1 98 59 15 94 3 60.204082 6.266667 0.200000 25 10 6 4 0 ...
46 1 3 Uruguay Austria 1954 -2 2 43 14 10 22 1 32.558140 2.200000 0.100000 25 10 6 4 0 ...
47 2 3 Hungary Germany 1954 -1 2 26 11 7 8 0 42.307692 1.142857 0.000000 98 59 15 94 3 ...
48 3 1 Germany Argentina 1958 2 1 98 59 15 94 3 60.204082 6.266667 0.200000 64 33 13 40 2 ...
49 1 0 Northern Ireland Czechoslovakia 1958 1 1 13 3 3 0 0 23.076923 0.000000 0.000000 23 7 6 8 0 ...
50 3 1 Argentina Northern Ireland 1958 2 1 64 33 13 40 2 51.562500 3.076923 0.153846 13 3 3 0 0 ...
53 6 1 Czechoslovakia Argentina 1958 5 1 23 7 6 8 0 30.434783 1.333333 0.000000 64 33 13 40 2 ...
54 2 1 Northern Ireland Czechoslovakia 1958 1 1 13 3 3 0 0 23.076923 0.000000 0.000000 23 7 6 8 0 ...
55 7 3 France Paraguay 1958 4 1 48 23 10 34 1 47.916667 3.400000 0.100000 25 6 7 0 0 ...
57 3 2 Yugoslavia France 1958 1 1 34 14 8 2 0 41.176471 0.250000 0.000000 48 23 10 34 1 ...
58 3 2 Paraguay Scotland 1958 1 1 25 6 7 0 0 24.000000 0.000000 0.000000 23 4 8 0 0 ...
59 2 1 France Scotland 1958 1 1 48 23 10 34 1 47.916667 3.400000 0.100000 23 4 8 0 0 ...
61 3 0 Sweden Mexico 1958 3 1 41 14 9 16 0 34.146341 1.777778 0.000000 46 12 13 0 0 ...
64 2 1 Sweden Hungary 1958 1 1 41 14 9 16 0 34.146341 1.777778 0.000000 26 11 7 8 0 ...
66 4 0 Hungary Mexico 1958 4 1 26 11 7 8 0 42.307692 1.142857 0.000000 46 12 13 0 0 ...
67 2 1 Wales Hungary 1958 1 1 5 1 1 0 0 20.000000 0.000000 0.000000 26 11 7 8 0 ...
68 3 0 Brazil Austria 1958 3 1 89 63 16 102 5 70.786517 6.375000 0.312500 25 10 6 4 0 ...
71 2 0 Russia Austria 1958 2 1 37 17 9 2 0 45.945946 0.222222 0.000000 25 10 6 4 0 ...
73 2 0 Brazil Russia 1958 2 1 89 63 16 102 5 70.786517 6.375000 0.312500 37 17 9 2 0 ...
74 1 0 Russia England 1958 1 1 37 17 9 2 0 45.945946 0.222222 0.000000 59 26 13 18 1 ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

1100 rows × 23 columns

Are the results able to be classified? Can we see a pattern, some kind of grouping of results based on the stats of bot teams?

Let's try visualizing two of the most interesting ones: matches won percent, and podium score yearly (mean).

In [8]:
show(graph_matches_results_scatter(matches, 'matches_won_percent', 'matches_won_percent_2'))

Out[8]:
In [9]:
show(graph_matches_results_scatter(matches, 'podium_score_yearly', 'podium_score_yearly_2'))

Out[9]:

Before any conclussions: there is more there than what you can see with your eyes. At some location, there could be more than 1 point, and you only see the one on the top.

The first graph tells us something that most people already expect: there is a small tendency on the result, the team with the better matches won percent tends to win. The second graph also shows a similar relation between podium score yearly and the result, even if it's not visible to the eye because of the overlapping of dots.

But remember, the classifier can learn a lot more than just those simple relations based on the info we give to it. These graphs were just a screening to confirm some basic intuitions.

## Learn¶

Ok, now we have everything we need. Lets feed the selected input features to a the neural network classifier, and let it learn.

We have to normalize the data, otherwise the features with smaller values will impose a greater weight on the prediction.

Also, we use a percentage of the inputs to train, but keep the rest "hidden", we don't let the classifier see them while learning. After the training we use those inputs to "test" the ability of the classifier to predict data it has never seen before (and data we already know the correct answer).

In [10]:
inputs, outputs = extract_samples(matches,
input_features,
output_feature)

normalizer, inputs = normalize(inputs)

train_inputs, train_outputs, test_inputs, test_outputs = split_samples(inputs, outputs)

n = buildNetwork(len(input_features),
10 * len(input_features),
10 * len(input_features),
1,
outclass=SigmoidLayer,
bias=True)


To be able to evaluate the results and show progress on the learning cycle, we need these two functions wich help us calculate how well the network can predict the results from the matches used to learn, and the matches it doesn't know.

In [11]:
def neural_result(input):
"""Call the neural network, and translates its output to a match result."""
n_output = n.activate(input)
if n_output >= 0.5:
return 2
else:
return 1

def test_network():
"""Calculate train and test sets errors."""
print (100 - percentError(map(neural_result, train_inputs), train_outputs),
100 - percentError(map(neural_result, test_inputs), test_outputs))


Create a train set (a kind of dataset that pybrain uses to train neural networks), and display initial accuracy on both sets (train and test).

In [12]:
train_set = ClassificationDataSet(len(input_features))

for i, input_line in enumerate(train_inputs):

trainer = BackpropTrainer(n, dataset=train_set, momentum=0.5, weightdecay=0.0)

train_set.assignClasses()

test_network()

(50.78979343863912, 54.51263537906137)


Train the network, for a given number of iterations. You can re-run this step many times, and it will keep learning, but as you know, if you train too much you can end overfitting the training data (this is visible when the test set accuracy starts to decrease).

In [13]:
for i in range(20):
trainer.train()
test_network()

(72.17496962332928, 77.9783393501805)
(73.02551640340218, 75.09025270758123)
(73.63304981773997, 75.81227436823104)
(73.63304981773997, 75.45126353790613)
(73.87606318347508, 74.72924187725631)
(74.24058323207777, 74.0072202166065)
(74.36208991494533, 74.36823104693141)
(73.87606318347508, 76.17328519855596)
(74.48359659781288, 75.09025270758123)
(73.99756986634264, 75.45126353790613)
(73.2685297691373, 72.20216606498195)
(74.726609963548, 74.36823104693141)
(74.726609963548, 74.72924187725631)
(74.24058323207777, 75.81227436823104)
(74.60510328068044, 75.09025270758123)
(74.96962332928311, 74.72924187725631)
(74.726609963548, 75.09025270758123)
(74.24058323207777, 72.92418772563177)
(74.36208991494533, 74.72924187725631)
(74.60510328068044, 76.17328519855596)


The closer this score is to 100%, the better the classifier is doing its predictions. A score of 100 means the classifier allways predicts the exact real result, something impossible.

And something around 75% sounds impressive, but in fact is not that good. It's pretty good, but consider that just throwing a coin will get you 50%. So this sits in the middle between throwing a coin and having a time machine.

## Predict¶

With the classifier already trained, we can start making predictions. But we need a little function able to translate inputs like this: (2014, 'Argentina', 'Brazil'), to the numeric inputs the classifier expects (based on the input features).

This function does the conversion, also normalizes the data with the same normalizer used before, and then just asks the classifier for the prediction.

In [14]:
def predict(year, team1, team2):
inputs = []

for feature in input_features:
from_team_2 = '_2' in feature
feature = feature.replace('_2', '')

if feature in team_stats.columns.values:
team = team2 if from_team_2 else team1
value = team_stats.loc[team, feature]
elif feature == 'year':
value = year
else:
raise ValueError("Don't know where to get feature: " + feature)

inputs.append(value)

inputs = normalizer.transform(inputs)
result = neural_result(inputs)

if result == 0:
return 'tie'
elif result == 1:
return team1
elif result == 2:
return team2
else:
return 'Unknown result: ' + str(result)


## Some predictions about the past, compared to real results:¶

Even while we know those results and some of them where used to train, that doesn't guarantee the real result is what the classifier will predict.

In [15]:
predict(1950, 'Mexico', 'Brazil')  # real result: 4-0 wins Brazil

Out[15]:
'Brazil'
In [16]:
predict(1990, 'United Arab Emirates', 'Colombia')  # real result: 2-0 wins Colombia

Out[16]:
'Colombia'
In [17]:
predict(2002, 'South Africa', 'Spain')  # real result: 2-3 wins Spain

Out[17]:
'Spain'
In [18]:
predict(2010, 'Japan', 'Cameroon')  # real result: 1-0 wins Japan

Out[18]:
'Japan'

## Some predictions about the future:¶

(at least these where "future" at the moment of programming)

In [19]:
predict(2014, 'Argentina', 'Brazil')

Out[19]:
'Argentina'
In [20]:
predict(2014, 'Spain', 'Haiti')

Out[20]:
'Spain'
In [21]:
predict(2014, 'Russia', 'Germany')

Out[21]:
'Germany'
In [22]:
predict(2014, 'Russia', 'Russia')

Out[22]:
'Russia'