World Cup Learning¶

Here I try to predict fifa world cup matches results, based on the knowledge of previous matches from the cups since the year 1950.

I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with a lot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = wins team2).

I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for the graphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly to abstract the data processing I need before I feed the classifier.

In [1]:

from random import random

from IPython.display import SVG
import pygal

from pybrain.structure import SigmoidLayer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError

from utils import get_matches, get_team_stats, extract_samples, normalize, split_samples, graph_teams_stat_bars, graph_matches_results_scatter

Configs¶

In [2]:

# the features I will feed to the classifier as input data.
input_features = ['year',
                  'matches_won_percent',
                  'podium_score_yearly',
                  'matches_won_percent_2',
                  'podium_score_yearly_2',]

# the feature giving the result the classifier must learn to predict (I recommend allways using 'winner')
output_feature = 'winner'

# used to avoid including tied matches in the learning process. I found this greatly improves the classifier accuracy.
# I know there will be some ties, but I'm willing to fail on those and have better accuracy with all the rest.
# at this point, this code will break if you set it to False, because the network uses a sigmoid function with a 
# threeshold for output, so it is able to distinquish only 2 kinds of results.
exclude_ties = True

# used to duplicate matches data, reversing the teams (team1->team2, and viceversa). 
# This helps on visualizations, and also improves precission of the predictions avoiding a dependence on the
# order of the teams from the input.
duplicate_with_reversed = True

In [3]:

def show(graph):
    '''Small utility to display pygal graphs'''
    return SVG(graph.render())

Team stats¶

First we need the teams stats. We can't feed the classifier inputs like ('Argentina', 'Brazil'), we need to give it numbers. And not any numbers, not just ids, but numbers that could be somewhat related to the result of the matches.

For example: the percentage of won matches of each team is something that could have an impact in the result, so that stat is a very good candidate.

We just calculate a lots of stats per team, and after we will decide which ones to use.

In [4]:

team_stats = get_team_stats()
team_stats

Out[4]:

	matches_played	matches_won	years_played	podium_score	cups_won	matches_won_percent	podium_score_yearly	cups_won_yearly
team
Brazil	89	63	16	102	5	70.786517	6.375000	0.312500
Canada	3	0	1	0	0	0.000000	0.000000	0.000000
Serbia and Montenegro	3	0	1	0	0	0.000000	0.000000	0.000000
Kuwait	3	0	1	0	0	0.000000	0.000000	0.000000
Scotland	23	4	8	0	0	17.391304	0.000000	0.000000
Costa Rica	10	3	3	0	0	30.000000	0.000000	0.000000
Ivory Coast	6	2	2	0	0	33.333333	0.000000	0.000000
Wales	5	1	1	0	0	20.000000	0.000000	0.000000
Argentina	64	33	13	40	2	51.562500	3.076923	0.153846
Bolivia	4	0	2	0	0	0.000000	0.000000	0.000000
Cameroon	20	4	6	0	0	20.000000	0.000000	0.000000
Ecuador	7	3	2	0	0	42.857143	0.000000	0.000000
Ghana	9	4	2	0	0	44.444444	0.000000	0.000000
Saudi Arabia	13	2	4	0	0	15.384615	0.000000	0.000000
Australia	10	2	3	0	0	20.000000	0.000000	0.000000
Iran	9	1	3	0	0	11.111111	0.000000	0.000000
Algeria	9	2	3	0	0	22.222222	0.000000	0.000000
El Salvador	6	0	2	0	0	0.000000	0.000000	0.000000
Republic of Ireland	13	2	3	0	0	15.384615	0.000000	0.000000
Slovenia	6	1	2	0	0	16.666667	0.000000	0.000000
Chile	26	7	7	4	0	26.923077	0.571429	0.000000
Belgium	32	10	8	2	0	31.250000	0.250000	0.000000
Haiti	3	0	1	0	0	0.000000	0.000000	0.000000
Iraq	3	0	1	0	0	0.000000	0.000000	0.000000
Spain	53	27	12	18	1	50.943396	1.500000	0.083333
China PR	3	0	1	0	0	0.000000	0.000000	0.000000
Netherlands	41	22	7	26	0	53.658537	3.714286	0.000000
Denmark	16	8	4	0	0	50.000000	0.000000	0.000000
Poland	30	15	6	8	0	50.000000	1.333333	0.000000
Morocco	13	2	4	0	0	15.384615	0.000000	0.000000
Croatia	13	6	3	4	0	46.153846	1.333333	0.000000
Switzerland	24	7	7	0	0	29.166667	0.000000	0.000000
Honduras	6	0	2	0	0	0.000000	0.000000	0.000000
New Zealand	6	0	2	0	0	0.000000	0.000000	0.000000
Jamaica	3	1	1	0	0	33.333333	0.000000	0.000000
England	59	26	13	18	1	44.067797	1.384615	0.076923
Uruguay	43	14	10	22	1	32.558140	2.200000	0.100000
United Arab Emirates	3	0	1	0	0	0.000000	0.000000	0.000000
South Africa	9	2	3	0	0	22.222222	0.000000	0.000000
Egypt	3	0	1	0	0	0.000000	0.000000	0.000000
Colombia	13	3	4	0	0	23.076923	0.000000	0.000000
South Korea	28	5	8	2	0	17.857143	0.250000	0.000000
Turkey	10	5	2	4	0	50.000000	2.000000	0.000000
Italy	71	36	15	54	2	50.704225	3.600000	0.133333
Czech Republic	3	1	1	0	0	33.333333	0.000000	0.000000
France	48	23	10	34	1	47.916667	3.400000	0.100000
Slovakia	4	1	1	0	0	25.000000	0.000000	0.000000
Peru	13	4	3	0	0	30.769231	0.000000	0.000000
Norway	7	2	2	0	0	28.571429	0.000000	0.000000
Nigeria	14	4	4	0	0	28.571429	0.000000	0.000000
Israel	3	0	1	0	0	0.000000	0.000000	0.000000
Zaire	3	0	1	0	0	0.000000	0.000000	0.000000
Czechoslovakia	23	7	6	8	0	30.434783	1.333333	0.000000
Austria	25	10	6	4	0	40.000000	0.666667	0.000000
Togo	3	0	1	0	0	0.000000	0.000000	0.000000
Germany	98	59	15	94	3	60.204082	6.266667	0.200000
Ukraine	5	2	1	0	0	40.000000	0.000000	0.000000
Northern Ireland	13	3	3	0	0	23.076923	0.000000	0.000000
United States	25	5	7	0	0	20.000000	0.000000	0.000000
Trinidad and Tobago	3	0	1	0	0	0.000000	0.000000	0.000000
	...	...	...	...	...	...	...	...

76 rows × 8 columns

Lets visualize some of those stats, just because it helps paint a bigger picture on how good the teams are.

(you can hoover with your mouse on the '...' from the x axys to see the team name)

In [5]:

show(graph_teams_stat_bars(team_stats, 'matches_won_percent'))

Out[5]:

Pudium score is an invented measure on how good the teams are by looking at the 4 first teams from each cup. The first team receives 8 points, the second 4, the third 2, and the fourth 1. All the rest receive 0 points. As you can see, the scoring is exponential, because each position implies an exponentially bigger amount of matches won than the next one.

In [6]:

show(graph_teams_stat_bars(team_stats, 'podium_score_yearly'))

Out[6]:

Matches¶

Now we need to get the matches data, including the "reversed" duplication of matches, and adding the team stats in each match.

In [7]:

matches = get_matches(with_team_stats=True,
                      duplicate_with_reversed=duplicate_with_reversed,
                      exclude_ties=exclude_ties)
        
matches

Out[7]:

	score1	score2	team1	team2	year	score_diff	winner	matches_played	matches_won	years_played	podium_score	cups_won	matches_won_percent	podium_score_yearly	cups_won_yearly	matches_played_2	matches_won_2	years_played_2	podium_score_2	cups_won_2
0	4	0	Brazil	Mexico	1950	4	1	89	63	16	102	5	70.786517	6.375000	0.312500	46	12	13	0	0	...
1	3	0	Yugoslavia	Switzerland	1950	3	1	34	14	8	2	0	41.176471	0.250000	0.000000	24	7	7	0	0	...
3	4	1	Yugoslavia	Mexico	1950	3	1	34	14	8	2	0	41.176471	0.250000	0.000000	46	12	13	0	0	...
4	2	0	Brazil	Yugoslavia	1950	2	1	89	63	16	102	5	70.786517	6.375000	0.312500	34	14	8	2	0	...
5	2	1	Switzerland	Mexico	1950	1	1	24	7	7	0	0	29.166667	0.000000	0.000000	46	12	13	0	0	...
6	2	0	England	Chile	1950	2	1	59	26	13	18	1	44.067797	1.384615	0.076923	26	7	7	4	0	...
7	3	1	Spain	United States	1950	2	1	53	27	12	18	1	50.943396	1.500000	0.083333	25	5	7	0	0	...
8	2	0	Spain	Chile	1950	2	1	53	27	12	18	1	50.943396	1.500000	0.083333	26	7	7	4	0	...
9	1	0	United States	England	1950	1	1	25	5	7	0	0	20.000000	0.000000	0.000000	59	26	13	18	1	...
10	1	0	Spain	England	1950	1	1	53	27	12	18	1	50.943396	1.500000	0.083333	59	26	13	18	1	...
11	5	2	Chile	United States	1950	3	1	26	7	7	4	0	26.923077	0.571429	0.000000	25	5	7	0	0	...
12	3	2	Sweden	Italy	1950	1	1	41	14	9	16	0	34.146341	1.777778	0.000000	71	36	15	54	2	...
14	2	0	Italy	Paraguay	1950	2	1	71	36	15	54	2	50.704225	3.600000	0.133333	25	6	7	0	0	...
15	8	0	Uruguay	Bolivia	1950	8	1	43	14	10	22	1	32.558140	2.200000	0.100000	4	0	2	0	0	...
17	7	1	Brazil	Sweden	1950	6	1	89	63	16	102	5	70.786517	6.375000	0.312500	41	14	9	16	0	...
18	6	1	Brazil	Spain	1950	5	1	89	63	16	102	5	70.786517	6.375000	0.312500	53	27	12	18	1	...
19	3	2	Uruguay	Sweden	1950	1	1	43	14	10	22	1	32.558140	2.200000	0.100000	41	14	9	16	0	...
20	3	1	Sweden	Spain	1950	2	1	41	14	9	16	0	34.146341	1.777778	0.000000	53	27	12	18	1	...
21	2	1	Uruguay	Brazil	1950	1	1	43	14	10	22	1	32.558140	2.200000	0.100000	89	63	16	102	5	...
22	5	0	Brazil	Mexico	1954	5	1	89	63	16	102	5	70.786517	6.375000	0.312500	46	12	13	0	0	...
23	1	0	Yugoslavia	France	1954	1	1	34	14	8	2	0	41.176471	0.250000	0.000000	48	23	10	34	1	...
25	3	2	France	Mexico	1954	1	1	48	23	10	34	1	47.916667	3.400000	0.100000	46	12	13	0	0	...
26	4	1	Germany	Turkey	1954	3	1	98	59	15	94	3	60.204082	6.266667	0.200000	10	5	2	4	0	...
27	9	0	Hungary	South Korea	1954	9	1	26	11	7	8	0	42.307692	1.142857	0.000000	28	5	8	2	0	...
28	8	3	Hungary	Germany	1954	5	1	26	11	7	8	0	42.307692	1.142857	0.000000	98	59	15	94	3	...
29	7	0	Turkey	South Korea	1954	7	1	10	5	2	4	0	50.000000	2.000000	0.000000	28	5	8	2	0	...
30	7	2	Germany	Turkey	1954	5	1	98	59	15	94	3	60.204082	6.266667	0.200000	10	5	2	4	0	...
31	2	0	Uruguay	Czechoslovakia	1954	2	1	43	14	10	22	1	32.558140	2.200000	0.100000	23	7	6	8	0	...
32	1	0	Austria	Scotland	1954	1	1	25	10	6	4	0	40.000000	0.666667	0.000000	23	4	8	0	0	...
33	7	0	Uruguay	Scotland	1954	7	1	43	14	10	22	1	32.558140	2.200000	0.100000	23	4	8	0	0	...
34	5	0	Austria	Czechoslovakia	1954	5	1	25	10	6	4	0	40.000000	0.666667	0.000000	23	7	6	8	0	...
35	2	1	Switzerland	Italy	1954	1	1	24	7	7	0	0	29.166667	0.000000	0.000000	71	36	15	54	2	...
37	4	1	Italy	Belgium	1954	3	1	71	36	15	54	2	50.704225	3.600000	0.133333	32	10	8	2	0	...
38	2	0	England	Switzerland	1954	2	1	59	26	13	18	1	44.067797	1.384615	0.076923	24	7	7	0	0	...
39	4	1	Switzerland	Italy	1954	3	1	24	7	7	0	0	29.166667	0.000000	0.000000	71	36	15	54	2	...
40	7	5	Austria	Switzerland	1954	2	1	25	10	6	4	0	40.000000	0.666667	0.000000	24	7	7	0	0	...
41	4	2	Uruguay	England	1954	2	1	43	14	10	22	1	32.558140	2.200000	0.100000	59	26	13	18	1	...
42	2	4	Brazil	Hungary	1954	-2	2	89	63	16	102	5	70.786517	6.375000	0.312500	26	11	7	8	0	...
43	0	2	Yugoslavia	Germany	1954	-2	2	34	14	8	2	0	41.176471	0.250000	0.000000	98	59	15	94	3	...
44	4	2	Hungary	Uruguay	1954	2	1	26	11	7	8	0	42.307692	1.142857	0.000000	43	14	10	22	1	...
45	6	1	Germany	Austria	1954	5	1	98	59	15	94	3	60.204082	6.266667	0.200000	25	10	6	4	0	...
46	1	3	Uruguay	Austria	1954	-2	2	43	14	10	22	1	32.558140	2.200000	0.100000	25	10	6	4	0	...
47	2	3	Hungary	Germany	1954	-1	2	26	11	7	8	0	42.307692	1.142857	0.000000	98	59	15	94	3	...
48	3	1	Germany	Argentina	1958	2	1	98	59	15	94	3	60.204082	6.266667	0.200000	64	33	13	40	2	...
49	1	0	Northern Ireland	Czechoslovakia	1958	1	1	13	3	3	0	0	23.076923	0.000000	0.000000	23	7	6	8	0	...
50	3	1	Argentina	Northern Ireland	1958	2	1	64	33	13	40	2	51.562500	3.076923	0.153846	13	3	3	0	0	...
53	6	1	Czechoslovakia	Argentina	1958	5	1	23	7	6	8	0	30.434783	1.333333	0.000000	64	33	13	40	2	...
54	2	1	Northern Ireland	Czechoslovakia	1958	1	1	13	3	3	0	0	23.076923	0.000000	0.000000	23	7	6	8	0	...
55	7	3	France	Paraguay	1958	4	1	48	23	10	34	1	47.916667	3.400000	0.100000	25	6	7	0	0	...
57	3	2	Yugoslavia	France	1958	1	1	34	14	8	2	0	41.176471	0.250000	0.000000	48	23	10	34	1	...
58	3	2	Paraguay	Scotland	1958	1	1	25	6	7	0	0	24.000000	0.000000	0.000000	23	4	8	0	0	...
59	2	1	France	Scotland	1958	1	1	48	23	10	34	1	47.916667	3.400000	0.100000	23	4	8	0	0	...
61	3	0	Sweden	Mexico	1958	3	1	41	14	9	16	0	34.146341	1.777778	0.000000	46	12	13	0	0	...
64	2	1	Sweden	Hungary	1958	1	1	41	14	9	16	0	34.146341	1.777778	0.000000	26	11	7	8	0	...
66	4	0	Hungary	Mexico	1958	4	1	26	11	7	8	0	42.307692	1.142857	0.000000	46	12	13	0	0	...
67	2	1	Wales	Hungary	1958	1	1	5	1	1	0	0	20.000000	0.000000	0.000000	26	11	7	8	0	...
68	3	0	Brazil	Austria	1958	3	1	89	63	16	102	5	70.786517	6.375000	0.312500	25	10	6	4	0	...
71	2	0	Russia	Austria	1958	2	1	37	17	9	2	0	45.945946	0.222222	0.000000	25	10	6	4	0	...
73	2	0	Brazil	Russia	1958	2	1	89	63	16	102	5	70.786517	6.375000	0.312500	37	17	9	2	0	...
74	1	0	Russia	England	1958	1	1	37	17	9	2	0	45.945946	0.222222	0.000000	59	26	13	18	1	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

1100 rows × 23 columns

Are the results able to be classified? Can we see a pattern, some kind of grouping of results based on the stats of bot teams?

Let's try visualizing two of the most interesting ones: matches won percent, and podium score yearly (mean).

In [8]:

show(graph_matches_results_scatter(matches, 'matches_won_percent', 'matches_won_percent_2'))

Out[8]:

In [9]:

show(graph_matches_results_scatter(matches, 'podium_score_yearly', 'podium_score_yearly_2'))

Out[9]:

Before any conclussions: there is more there than what you can see with your eyes. At some location, there could be more than 1 point, and you only see the one on the top.

The first graph tells us something that most people already expect: there is a small tendency on the result, the team with the better matches won percent tends to win. The second graph also shows a similar relation between podium score yearly and the result, even if it's not visible to the eye because of the overlapping of dots.

But remember, the classifier can learn a lot more than just those simple relations based on the info we give to it. These graphs were just a screening to confirm some basic intuitions.

Learn¶

Ok, now we have everything we need. Lets feed the selected input features to a the neural network classifier, and let it learn.

We have to normalize the data, otherwise the features with smaller values will impose a greater weight on the prediction.

Also, we use a percentage of the inputs to train, but keep the rest "hidden", we don't let the classifier see them while learning. After the training we use those inputs to "test" the ability of the classifier to predict data it has never seen before (and data we already know the correct answer).

In [10]:

inputs, outputs = extract_samples(matches,
                                  input_features,
                                  output_feature)

normalizer, inputs = normalize(inputs)

train_inputs, train_outputs, test_inputs, test_outputs = split_samples(inputs, outputs)

n = buildNetwork(len(input_features),
                 10 * len(input_features),
                 10 * len(input_features),
                 1,
                 outclass=SigmoidLayer,
                 bias=True)

To be able to evaluate the results and show progress on the learning cycle, we need these two functions wich help us calculate how well the network can predict the results from the matches used to learn, and the matches it doesn't know.

In [11]:

def neural_result(input):
    """Call the neural network, and translates its output to a match result."""
    n_output = n.activate(input) 
    if n_output >= 0.5:
        return 2
    else:
        return 1
    
def test_network():
    """Calculate train and test sets errors."""
    print (100 - percentError(map(neural_result, train_inputs), train_outputs), 
           100 - percentError(map(neural_result, test_inputs), test_outputs))

Create a train set (a kind of dataset that pybrain uses to train neural networks), and display initial accuracy on both sets (train and test).

In [12]:

train_set = ClassificationDataSet(len(input_features))

for i, input_line in enumerate(train_inputs):
    train_set.addSample(train_inputs[i], [train_outputs[i] - 1])

trainer = BackpropTrainer(n, dataset=train_set, momentum=0.5, weightdecay=0.0)

train_set.assignClasses()

test_network()

(50.78979343863912, 54.51263537906137)

Train the network, for a given number of iterations. You can re-run this step many times, and it will keep learning, but as you know, if you train too much you can end overfitting the training data (this is visible when the test set accuracy starts to decrease).

In [13]:

for i in range(20):
    trainer.train()
    test_network()

(72.17496962332928, 77.9783393501805)
(73.02551640340218, 75.09025270758123)
(73.63304981773997, 75.81227436823104)
(73.63304981773997, 75.45126353790613)
(73.87606318347508, 74.72924187725631)
(74.24058323207777, 74.0072202166065)
(74.36208991494533, 74.36823104693141)
(73.87606318347508, 76.17328519855596)
(74.48359659781288, 75.09025270758123)
(73.99756986634264, 75.45126353790613)
(73.2685297691373, 72.20216606498195)
(74.726609963548, 74.36823104693141)
(74.726609963548, 74.72924187725631)
(74.24058323207777, 75.81227436823104)
(74.60510328068044, 75.09025270758123)
(74.96962332928311, 74.72924187725631)
(74.726609963548, 75.09025270758123)
(74.24058323207777, 72.92418772563177)
(74.36208991494533, 74.72924187725631)
(74.60510328068044, 76.17328519855596)

The closer this score is to 100%, the better the classifier is doing its predictions. A score of 100 means the classifier allways predicts the exact real result, something impossible.

And something around 75% sounds impressive, but in fact is not that good. It's pretty good, but consider that just throwing a coin will get you 50%. So this sits in the middle between throwing a coin and having a time machine.

Predict¶

With the classifier already trained, we can start making predictions. But we need a little function able to translate inputs like this: (2014, 'Argentina', 'Brazil'), to the numeric inputs the classifier expects (based on the input features).

This function does the conversion, also normalizes the data with the same normalizer used before, and then just asks the classifier for the prediction.

In [14]:

def predict(year, team1, team2):
    inputs = []
    
    for feature in input_features:
        from_team_2 = '_2' in feature
        feature = feature.replace('_2', '')
        
        if feature in team_stats.columns.values:
            team = team2 if from_team_2 else team1
            value = team_stats.loc[team, feature]
        elif feature == 'year':
            value = year
        else:
            raise ValueError("Don't know where to get feature: " + feature)
            
        inputs.append(value)
        
    inputs = normalizer.transform(inputs)
    result = neural_result(inputs)
    
    if result == 0:
        return 'tie'
    elif result == 1:
        return team1
    elif result == 2:
        return team2
    else:
        return 'Unknown result: ' + str(result)

Some predictions about the past, compared to real results:¶

Even while we know those results and some of them where used to train, that doesn't guarantee the real result is what the classifier will predict.

In [15]:

predict(1950, 'Mexico', 'Brazil')  # real result: 4-0 wins Brazil

Out[15]:

'Brazil'

In [16]:

predict(1990, 'United Arab Emirates', 'Colombia')  # real result: 2-0 wins Colombia

Out[16]:

'Colombia'

In [17]:

predict(2002, 'South Africa', 'Spain')  # real result: 2-3 wins Spain

Out[17]:

'Spain'

In [18]:

predict(2010, 'Japan', 'Cameroon')  # real result: 1-0 wins Japan

Out[18]:

'Japan'

Some predictions about the future:¶

(at least these where "future" at the moment of programming)

In [19]:

predict(2014, 'Argentina', 'Brazil')

Out[19]:

'Argentina'

In [20]:

predict(2014, 'Spain', 'Haiti')

Out[20]:

'Spain'

In [21]:

predict(2014, 'Russia', 'Germany')

Out[21]:

'Germany'

In [22]:

predict(2014, 'Russia', 'Russia')

Out[22]:

'Russia'