Predicting the World Cup with the Google Cloud Platform¶

This notebook builds a machine learning model that can be used to predict professional soccer games. The notebook was created for the "Predicting the Future with the Google Cloud Platform" talk at Google I/O 2014 by Jordan Tigani and Felipe Hoffa. A link to the presentation is here: https://www.youtube.com/watch?v=YyvvxFeADh8

Once the machine learning model is built, we use it to predict outcomes in the World Cup. If you are seeing this after the world cup is over, you can use it to predict hypothetical matchups (how would the 2010 World Cup winners do against the current champions?). You can also see how various different strategies would affect prediction outcomes. Maybe you'd like to add player salary data and see how that affects predictions (likely it will help a lot). Maybe you'd like to try Poisson Regression instead of Logistic Regression. Or maybe you'd like to try data coercion techniques like whitening or PCA.

The model uses Logistic Regression, built from touch-by-touch data about three different soccer leagues (English Premier League, Spainish La Liga, and American Major League Soccer) over multiple seasons. Because the data is licensed, only the aggregated statistics about those games are available. (If you have ideas of other statistics you'd like to see, create a new issue in the https://github.com/GoogleCloudPlatform/ipython-soccer-predictions GitHub repo and we'll see what we can do.) The match_stats.py file shows the raw queries that were used to generate the stats.

There are four python files that are used by this notebook. They must be in the path. These are:

match_stats: Reads the match statistics from BigQuery. Because we are using the pre-aggregated data, most of the code here is disabled, but it is kept in order to show the data transformations that are done from the raw data in order to build the stats.
features: Turns raw statistics into features that get fed into the machine learning model. These features combine statistics from the trailing N games to predict the next game.
world_cup: Helper methods for cleaning the data and building and running the logistic regression model.
power: Computes a "power" statistic over a number of teams who have played against eachother, attempting to come up with a ranking.

Setting up BigQuery Authentication¶

Since we're providing this notebook as part of a Docker image that can be run on Google Compute Engine, we'll override the authorization used in the Pandas BigQuery connector to use GCE auth. This will mean that you don't have to do any authorization on your own. You must, however, have the BigQuery API enabled in your Google Cloud Project (https://console.developers.google.com). Because the data sizes (after aggregation) are quite small, you may not need to enable billing.

In [1]:

from oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq

def GetMetadata(path):
  import urllib2
  BASE_PATH = 'http://metadata/computeMetadata/v1/'
  request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
  return urllib2.urlopen(request).read()

credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')

client = BigqueryClient(credentials=credentials,
                        api='https://www.googleapis.com',
                        api_version='v2',
                        project_id=GetMetadata('project/project-id'))

gbq._authenticate = lambda: client

Verifying setup¶

Loads the required modules and run a quick BigQuery query. This will test to make sure we have authentication and the pandas bigquery connector working correctly.

In [1]:

from pandas.io import gbq

# Import the four python modules that we use.
import match_stats
import features
import world_cup
import power
query = "SELECT * FROM (%(summary_query)s) LIMIT 1" % {
    'summary_query': match_stats.team_game_summary_query()}
gbq.read_gbq(query)

Waiting on bqjob_r337c26ad9dfd06bd_00000147233b4372_1 ... (0s) Current status: DONE

Out[1]:

	matchid	teamid	passes	bad_passes	pass_ratio	corners	fouls	shots	cards	pass_80	pass_70	timestamp	goals	is_home	team_name	competitionid	seasonid	expected_goals	on_target	length
0	731825	838	2.111111	0.912698	0.696335	0.007937	0.119048	0.087302	4	0.02381	0.063492	1404604038607	0	0	Costa Rica	4	2013	0.254178	0.007937	126

1 rows × 20 columns

Building features¶

This will return a pandas dataframe that contains the features that will be used to build a model.

The features query will read from the game summary table that has prepared per-game statistics that will be used to predict outcomes. The data has been aggregated from touch-by-touch data from Opta. However, since that data is not public, we use these prepared statistics instead of the raw data.

In order to predict a game, we look at the previous N games of history for each team, where N is defined here as history_size.

In [3]:

import features
reload(features)

# Sets the history size. This is how far back we will look before each game to aggregate statistics
# to predict the next game. For example, a history size of 5 will look at the previous 5 games played
# by a particular team in order to predict the next game.
history_size = 6

game_summaries = features.get_game_summaries()
data = features.get_features(history_size)

Waiting on bqjob_rdad0a47a2ca5106_0000014722ccfb80_2 ... (0s) Current status: DONE   
Waiting on bqjob_r61ce926a2f57863e_0000014722cd178a_3 ... (0s) Current status: DONE

The features include rollups from the last K games. Most of them are averages that are computed per-minute of game time. Per-minute stats are used in order to be able to normalize for games in the world cup that go into overtime.

Feature columns:¶

The following columns are the features that will be used to build the prediction model:

is_home: Whether a team is playing at home or away. This turns out to be a big deal in soccer.
avg_points: Average number of points (3 for a win, 1 for a draw, 0 for a loss) earned in the last K games.
avg_goals: Average number of goals scored in the last K games.
op_average_goals: Average number of goals scored by the opponent in the last K games.
pass_{70/80}: Number of completed passes per minute in the attacking 30%/20% of the field.
op_pass_{70/80}: Number of completed passes by the opponent in their attacking 30%/20% of the field.
expected_goals: Average number of expected goals in the last K games, where expected goals is computed based on the number of shots taken and their distance from the goal.
passes: Number of passes completed per minute.
bad_passes: Number of passes that didn't complete successfully per minute.
pass_ratio: Percentage of completed passes.
corners: Number of corner kicks awareded per minute.
fouls: Number of fouls committed per minute.
cards: Number of cards recieved (red or yellow) per game.
shots: Number of shots taken per minute.
op_*: Statistics about the opponent in the historical games. This is not the opoonent shown in op_team_name; instead, these stats show how the primary team's opponents have fared against them. For example, op_corners is how many corners the teams opponents have been awarded per minute.
*_op_ratio: Ratio of a team's statistics to their opponents.

Non-feature columns:¶

The following columns are included as metadata about the match:

matchid: Unique id for the match
teamid: Unique id for the team whose historical statistics we're looking at.
op_teamid: Unique id for the opposing team. None of these statistics reflect this opponent.
team_name: Name of the team whose historical statistics we're looking at.
op_team_name: Name of the opposing team.
timestamp: Time at which the game was played.
competitionid: Unique id for the competition (separates MLS from FIFA World CUp from EPL).

Target columns:¶

The following columns are target variables that we will be attempting to predict. These columns must be dropped before any prediction is done, but are useful when building a model. The models that we will build below will just try to predict outcome (points) but other models may choose to predict goals, which is why they are also included here.

points: The outcome of the game. 3 points for a win, 1 point for a draw, 0 for a loss. (Points are not goals!)
goals: The number of goals the team referenced by teamid scored.
op_goals: The number of goals the team referenced by op_teamid scored.

In [4]:

# Partition the world cup data and the club data. We're only going to train our model using club data.

club_data = data[data['competitionid'] <> 4]
# Show the features latest game in competition id 4, which is the world cup.
data[data['competitionid'] == 4].iloc[0]

Out[4]:

matchid                                  731828
teamid                                      366
op_teamid                                   632
competitionid                                 4
seasonid                                   2013
is_home                                       0
team_name                           Netherlands
op_team_name                          Argentina
timestamp            2014-07-09 21:00:00.000000
goals                                         0
op_goals                                      0
points                                        1
avg_points                             2.166667
avg_goals                                     2
op_avg_goals                          0.8333333
pass_70                                0.412262
pass_80                               0.1391892
op_pass_70                            0.3897345
op_pass_80                             0.114534
expected_goals                         1.799292
op_expected_goals                     0.7054955
passes                                 3.518422
bad_passes                             1.014758
pass_ratio                            0.7588293
corners                              0.04906867
fouls                                 0.1302936
cards                                  2.666667
shots                                 0.1469179
op_passes                              4.158118
op_bad_passes                          1.018166
op_corners                           0.04081354
op_fouls                              0.1938453
op_cards                                    2.5
op_shots                              0.1107791
goals_op_ratio                             1.75
shots_op_ratio                         1.428914
pass_op_ratio                         0.9701803
Name: 0, dtype: object

Compute the crosstabs for goals scored vs outcomes. Scoring more than 5 goals means you're guaranteed to win, and scoring no goals means you lose about 75% of the time (sometimes you tie!).

In [5]:

import pandas as pd
pd.crosstab(
    club_data['goals'], 
    club_data.replace(
        {'points': {
            0: 'lose', 1: 'tie', 3: 'win'}})['points'])

Out[5]:

points	lose	tie	win
goals
0	727	267	0
1	477	394	320
2	131	205	500
3	21	40	314
4	2	6	148
5	0	2	65
6	0	0	12
7	0	0	6
8	0	0	1

Training the model¶

We're going to train a logistic regression model based on the club data only. This will use an external code file world_cup.py to build the model.

The output of this cell this will be a logistic regression model and a test set that we can use to test how good we are at predicting outcomes. The cell will also print out the Rsquared value for the regression. This is a measaure of how good the fit was to the model (higher is better).

In [6]:

import world_cup
reload(world_cup)
import match_stats
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Don't train on games that ended in a draw, since they have less signal.
train = club_data.loc[club_data['points'] <> 1] 
# train = club_data

(model, test) = world_cup.train_model(
     train, match_stats.get_non_feature_columns())
print "\nRsquared: %0.03g" % model.prsquared

Rsquared: 0.164

Picking important features¶

The logistic regression model is built using regularization; this means that it penalizes complex models. It has the side effect of helping us with feature selection. Features that are not important will be dropped out of the model completely.

We can divide the features into three buckets:

Positive features: These features mean that a team is more likely to win.
Negative features: These features mean that a team is less likely to win.
Dropped features: These features aren't important, and if we included them in the model, we'd probably be overfitting.

In [7]:

def print_params(model, limit=None):    
    params = model.params.copy()
    params.sort(ascending=False)
    del params['intercept']
    
    if not limit:
        limit = len(params)

    print("Positive features")
    params.sort(ascending=False)
    print np.exp(params[[param > 0.001 for param in params]]).sub(1)[:limit]

    print("\nDropped features")
    print params[[param  == 0.0 for param in params]][:limit]

    print("\nNegative features")
    params.sort(ascending=True)
    print np.exp(params[[param < -0.001 for param in params]]).sub(1)[:limit]

print_params(model, 10)

Positive features
is_home                  0.712618
pass_70                  0.215699
opp_op_expected_goals    0.198712
opp_op_corners           0.180812
shots                    0.146956
opp_bad_passes           0.145576
op_passes                0.091629
expected_goals           0.079620
avg_points               0.075306
fouls                    0.047963
dtype: float64

Dropped features
op_avg_goals         0
goals_op_ratio       0
op_cards             0
op_bad_passes        0
op_shots             0
corners              0
cards                0
opp_pass_op_ratio    0
pass_ratio           0
passes               0
dtype: float64

Negative features
opp_pass_70          -0.177428
op_expected_goals    -0.165771
op_corners           -0.153125
opp_shots            -0.128127
bad_passes           -0.127077
opp_op_passes        -0.083938
opp_expected_goals   -0.073748
opp_avg_points       -0.070032
opp_fouls            -0.045768
opp_avg_goals        -0.020472
dtype: float64

Predicting wins in club data¶

This cell uses the test set (which was not used during the creation of the model) to predict outcomes. We can a few of the predictions to see how well we did. We'll show 5 each from two buckets: cases where we got it right, and cases where we got it wrong. We can see if these make sense. When we display these, the home team is always on the left.

For example, it might show that we predicted Manchester United playing at home beating Sunderland. This is completely reasonable and we'd expect that the outcome would be 3 points (a victory).

The columns of the output are:

team_name: Home team
op_team_name: Away team
predicted: The percentage chance that we believe the home team will win.
points: What actually happenned. 3 points for a win, 1 point for a draw, 0 points for a loss.

In [8]:

reload(world_cup)
results = world_cup.predict_model(model, test, 
    match_stats.get_non_feature_columns())

predictions = world_cup.extract_predictions(
    results.copy(), results['predicted'])

print 'Correct predictions:'
predictions[(predictions['predicted'] > 50) & (predictions['points'] == 3)][:5]

Correct predictions:

Out[8]:

	team_name	op_team_name	predicted	expected	winner	points
5	Vancouver Whitecaps	Portland Timbers	50.746754	Vancouver Whitecaps	Vancouver Whitecaps	3
23	Sporting Kansas City	Montreal Impact	71.255427	Sporting Kansas City	Sporting Kansas City	3
49	Real Madrid	Real Sociedad	70.565179	Real Madrid	Real Madrid	3
59	Real Betis	Levante	57.020318	Real Betis	Real Betis	3
65	Seattle Sounders FC	Montreal Impact	53.362012	Seattle Sounders FC	Seattle Sounders FC	3

In [9]:

print '\nIncorrect predictions:'
predictions[(predictions['predicted'] > 50) & (predictions['points'] < 3)][:5]

Incorrect predictions:

Out[9]:

	team_name	op_team_name	predicted	expected	winner
8	Sporting Kansas City	D.C. United	52.268257	Sporting Kansas City	D.C. United
17	Celta de Vigo	Valencia CF	53.402876	Celta de Vigo	Valencia CF
19	Real Madrid	Celta de Vigo	69.646704	Real Madrid	Celta de Vigo
28	Atlético de Madrid	Levante	63.517874	Atlético de Madrid	Levante
29	LA Galaxy	Colorado Rapids	55.278595	LA Galaxy	Colorado Rapids

Validating our predictions¶

Next, we want to actually quantify how good our predictions are. We can compute the lift ("How much better are we doing than random chance?"), AUC (the area under the ROC curve) and plot the ROC curve. AUC is arguable the most interesting number, it ranges between 0.5 (your model is no better than dumb luck) and 1.0 (perfect prediction).

In [10]:

import pylab as pl
# Compute a baseline, which is the percentage of overall outcomes are actually wins.
# (remember in soccer we can have draws too).
baseline = (sum([yval == 3 for yval in club_data['points']]) 
            * 1.0 / len(club_data))
y = [yval == 3 for yval in test['points']]
world_cup.validate(3, y, results['predicted'], baseline, 
                   compute_auc=True)
pl.show()

(3) Lift: 1.45 Auc: 0.745

Need.... more .... power!¶

One thing that is missing, if you're predicting the next game based on the previous few games, is that some teams may have just played a really tough schedule, while other teams have played against much weaker competition.

We can solve for schedule difficulty by running another regression; this one computes a power ranking, similar to the FIFA/CocaCola power ranking for international soccer teams (there are power rankings for other sports like college (american) football that may be familiar.)

Once we compute the power ranking (which creates a stack ranking of all of the teams), we can add that power ranking as a feature to our model, then rebuild it and re-validate it. The regression essentailly automated the process of looking at relationships like "Well, team A beat team B and team B beat team C, so A is probably better than C".

The output here will show the power ranking for various teams. This can be useful to spot check the ranking, since if we rank Wiggan at 1.0 and Chelsea at 0.0, something is likely wrong.

Note that because there isn't a strict ordering to the data (if team A beats team B and team B beats team C, sometimes team C will then beat team A) we sometimes fail to assign ordering to all of the teams (especially where the data is sparse). For teams that we can't rank, we put them in the middle (0.5).

Additionally, because the rankings for international teams are noisy and sparse, we chunk the rankings into quartiles. So teams that have been ranked will show up as 0, .33, .66, or 1.0.

Once we add this to the model, the performance generally improves significantly.

In [11]:

import power
reload(power)
reload(world_cup)
def points_to_sgn(p):
  if p > 0.1: return 1.0
  elif p < -0.1: return -1.0
  else: return 0.0
power_cols = [
  ('points', points_to_sgn, 'points'),
]

power_data = power.add_power(club_data, game_summaries, power_cols)
power_train = power_data.loc[power_data['points'] <> 1] 

# power_train = power_data
(power_model, power_test) = world_cup.train_model(
    power_train, match_stats.get_non_feature_columns())
print "\nRsquared: %0.03g, Power Coef %0.03g" % (
    power_model.prsquared, 
    math.exp(power_model.params['power_points']))

power_results = world_cup.predict_model(power_model, power_test, 
    match_stats.get_non_feature_columns())
power_y = [yval == 3 for yval in power_test['points']]
world_cup.validate(3, power_y, power_results['predicted'], baseline, 
                   compute_auc=True, quiet=False)

pl.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')
# Add the old model to the graph
world_cup.validate('old', y, results['predicted'], baseline, 
                   compute_auc=True, quiet=True)
pl.legend(loc="lower right")
pl.show()

print_params(power_model, 8)

New season 2014
New season 2013
QC check did not pass for 19 out of 20 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Could not trim params automatically due to failed QC check.  Trimming using trim_mode == 'size' will still work.
New season 2013
New season 2012
QC check did not pass for 24 out of 24 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Could not trim params automatically due to failed QC check.  Trimming using trim_mode == 'size' will still work.
New season 2012
New season 2011
QC check did not pass for 24 out of 24 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Could not trim params automatically due to failed QC check.  Trimming using trim_mode == 'size' will still work.
[u'Blackburn Rovers: 0.000', u'Real Betis: 0.000', u'D.C. United: 0.000', u'Celta de Vigo: 0.004', u'Deportivo de La Coru\xf1a: 0.009', u'Wolverhampton Wanderers: 0.021', u'Reading: 0.022', u'Real Zaragoza: 0.026', u'Real Valladolid: 0.044', u'Granada CF: 0.062', u'Queens Park Rangers: 0.073', u'Mallorca: 0.089', u'Aston Villa: 0.092', u'Bolton Wanderers: 0.102', u'Osasuna: 0.109', u'Espanyol: 0.112', u'Wigan Athletic: 0.124', u'Sunderland: 0.130', u'Rayo Vallecano: 0.138', u'Almer\xeda: 0.145', u'Levante: 0.148', u'Elche: 0.154', u'Getafe: 0.170', u'Swansea City: 0.192', u'Southampton: 0.197', u'Norwich City: 0.206', u'Toronto FC: 0.211', u'Chivas USA: 0.218', u'West Ham United: 0.220', u'West Bromwich Albion: 0.224', u'Villarreal: 0.231', u'Stoke City: 0.255', u'Fulham: 0.274', u'Valencia: 0.296', u'Valencia CF: 0.296', u'M\xe1laga: 0.305', u'Newcastle United: 0.342', u'Sevilla: 0.365', u'Columbus Crew: 0.366', u'Athletic Club: 0.386', u'Liverpool: 0.397', u'Everton: 0.417', u'Philadelphia Union: 0.466', u'Montreal Impact: 0.470', u'Chelsea: 0.530', u'Real Sociedad: 0.535', u'Tottenham Hotspur: 0.551', u'Arsenal: 0.592', u'Houston Dynamo: 0.593', u'FC Dallas: 0.612', u'Chicago Fire: 0.612', u'Vancouver Whitecaps: 0.615', u'San Jose Earthquakes: 0.632', u'New England Revolution: 0.634', u'Atl\xe9tico de Madrid: 0.672', u'Colorado Rapids: 0.743', u'Barcelona: 0.759', u'Seattle Sounders FC: 0.781', u'New York Red Bulls: 0.814', u'Sporting Kansas City: 0.854', u'LA Galaxy: 0.882', u'Real Salt Lake: 0.922', u'Manchester City: 0.928', u'Real Madrid: 1.000', u'Manchester United: 1.000', u'Portland Timbers: 1.000']

Rsquared: 0.238, Power Coef 2.22
(3) Lift: 1.48 Auc: 0.762
    Base: 0.375 Acc: 0.682 P(1|t): 0.742 P(0|f): 0.646
    Fp/Fn/Tp/Tn p/n/c: 100/228/288/416 516/516/1032
(old) Lift: 1.45 Auc: 0.745

Positive features
power_points      1.222950
is_home           0.692184
pass_70           0.178619
op_passes         0.140863
fouls             0.138612
opp_op_corners    0.122122
opp_avg_points    0.055252
opp_op_fouls      0.039738
dtype: float64

Dropped features
avg_goals            0
op_bad_passes        0
corners              0
op_shots             0
op_cards             0
opp_pass_op_ratio    0
pass_ratio           0
passes               0
dtype: float64

Negative features
opp_power_points   -0.550147
opp_pass_70        -0.151549
opp_op_passes      -0.123470
opp_fouls          -0.121738
op_corners         -0.108831
avg_points         -0.052359
op_fouls           -0.038220
bad_passes         -0.028956
dtype: float64

On to the world cup!¶

Now that we've got a model that we like, let's look at predicting the world cup. We can build the same statistics (features) for the world cup games that we did for the club games. In this case, however, we don't have the targets; that is, we don't know who won (for some of the previous games, we do know who won, but let's predict them all equally as if we didn't know).

features.get_wc_features() will return build features from the world cup games.

In [12]:

import world_cup
import features
reload(match_stats)
reload(features)
reload(world_cup)

wc_data = world_cup.prepare_data(features.get_wc_features(history_size))
wc_labeled = world_cup.prepare_data(features.get_features(history_size))
wc_labeled = wc_labeled[wc_labeled['competitionid'] == 4]
wc_power_train = game_summaries[game_summaries['competitionid'] == 4].copy()

Waiting on bqjob_r771c340a8483b8a6_0000014722cd55df_4 ... (0s) Current status: DONE   
Waiting on bqjob_r5df9ca3d043b572b_0000014722cd5dbe_5 ... (0s) Current status: DONE

Predicting the world cup¶

Once we have the model and the features, we can start predicting.

Home Team Advantage¶

There are a couple of differences between the world cup and club data. For one, while home team advantage is important in club games, who is really at home? Is it only Brazil? What about other south american teams? Some models give the 'is home' status to only Brazil, others give partial status to other teams from the same continent, since historical data shows that teams from the same continent tend to outperform.

We use a slightly modified model that is, however, somewhat subjective. We assing a value to is_home between 0.0 to 1.0 depending on the fan support (both numbers and enthusiasm) that a team enjoys. This is a result of noticing, in the early rounds, that the teams that had the more entusiastic supporters did better. For example, Chile's fans were deafining in support of their team, but Spain's fans barely showed up (Chile upset spain 2-0). There were a number of other cases like this; many involving south american sides, but many involving other teams that had sent a lot of supporters (Mexico, for example). Some teams, like the USA, had a lot of fans, but they were more reserved... they got a lower score. This factor was set based on first-hand reports from the group games.

In [13]:

import pandas as pd
wc_home = pd.read_csv('wc_home.csv')

def add_home_override(df, home_map):
  for ii in xrange(len(df)):
    team = df.iloc[ii]['teamid']
    if team in home_map:
        df['is_home'].iloc[ii] = home_map[team]
    else:
        # If we don't know, assume not at home.
        df['is_home'].iloc[ii] = 0.0
        
home_override = {}
for ii in xrange(len(wc_home)):
    row = wc_home.iloc[ii]
    home_override[row['teamid']] = row['is_home']

# Add home team overrides.
add_home_override(wc_data, home_override)    

World Cup Power Rankings¶

The lattice of teams playing eachother in the world cup is pretty sparese. Many teams haven't played eachother for decades. Many European teams rarely play South American ones, and even more rarely play Asian ones. We can use the same technique as we did for the club games, but we have to be prepared for failure.

We'll output the power rankings from the previous games. We should eyeball them to make sure they make sense.

In [14]:

# When training power data, since the games span multiple competitions, just set is_home to 0.5
# Otherwise when we looked at games from the 2010 world cup, we'd think Brazil was still at
# home instead of South Africa.
wc_power_train['is_home'] = 0.5
wc_power_data = power.add_power(wc_data, wc_power_train, power_cols)

wc_results = world_cup.predict_model(power_model, wc_power_data, 
    match_stats.get_non_feature_columns())

New season 2013
New season 2009
New season 6
QC check did not pass for 45 out of 50 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Could not trim params automatically due to failed QC check.  Trimming using trim_mode == 'size' will still work.
[u'Australia: 0.000', u'USA: 0.017', u'Nigeria: 0.204', u"C\xf4te d'Ivoire: 0.244", u'Costa Rica: 0.254', u'Algeria: 0.267', u'Paraguay: 0.277', u'Greece: 0.284', u'Switzerland: 0.291', u'Ecuador: 0.342', u'Uruguay: 0.367', u'Japan: 0.406', u'Mexico: 0.409', u'Chile: 0.413', u'England: 0.460', u'Portugal: 0.487', u'Ghana: 0.519', u'France: 0.648', u'Spain: 0.736', u'Argentina: 0.793', u'Italy: 0.798', u'Brazil: 0.898', u'Netherlands: 0.918', u'Germany: 1.000']

Predicting games¶

Now's the moment we've been waiting for. Let's predict some world cup games. Let's start with predicting the ones that have already happenned.

We will output 4 columns:

team_name: Team we're predicting
op_team_name: Team that the team we're predicting is playing against
predicted: Precentage chance (we believe) that the team will win.
points: If the game has been played, what actually happenned. (if the game hasn't been played, we'll show a NaN here). 3 points is a win, 1 point is a draw, 0 points is a loss. Note that for games in the knockout phase that went into penalty kicks, we'll mark that as a draw.

But wait! These predictions are different from the ones you published!

There are three reasons why the prediction numbers might be different from the numbers you may have seen as published predictions:

We've updated our code several times to fix bugs and improve accuracy. Our original model, for example, didn't account for extra time causing inflated statistics.
Model building is non-deterministic. Since we pick a random subset of the data to use as our training set, the results will change from run to run. Sometimes fairly significantly.
When we predicted the round of 16, we used the trailing 3 games to predict (since each team had played 3 games in the current world cup). For the quarterfinals, we used the trailing 4 games; for the semis, 5, and for the finals, we used all 6. The code below will predict based on the last 6 games; for many teams, we don't have 6 games of history, and even if we do, that history will be from previous world cups. To see a more apples-to-apples comparison, set the history_size variable to 3 and rerun the notebook.

In [15]:

pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

wc_with_points = wc_power_data.copy()
wc_with_points.index = pd.Index(
    zip(wc_with_points['matchid'], wc_with_points['teamid']))
wc_labeled.index = pd.Index(
    zip(wc_labeled['matchid'], wc_labeled['teamid']))
wc_with_points['points'] = wc_labeled['points']

wc_pred = world_cup.extract_predictions(wc_with_points, 
                                        wc_results['predicted'])

# Reverse our predictions to show the most recent first.
wc_pred.reindex(index=wc_pred.index[::-1])
# Show our predictions for the games that have already happenned.
wc_pred[wc_pred['points'] >= 0.0]

Out[15]:

	team_name	op_team_name	predicted	expected	winner	points
2	Netherlands	Argentina	45.187866	Argentina	draw	1
3	Germany	Brazil	46.087340	Brazil	Germany	3
4	Costa Rica	Netherlands	20.699193	Netherlands	draw	1
5	Germany	France	70.918295	Germany	Germany	3
6	Switzerland	Argentina	13.489369	Argentina	Argentina	0
7	Algeria	Germany	5.281820	Germany	Germany	0
8	Nigeria	France	6.469553	France	France	0
9	Greece	Costa Rica	44.465888	Costa Rica	draw	1
10	Mexico	Netherlands	25.852847	Netherlands	Netherlands	0
11	Chile	Brazil	29.159847	Brazil	draw	1
12	Germany	USA	92.599808	Germany	Germany	3
13	Ghana	Portugal	49.394952	Portugal	Portugal	0
14	France	Ecuador	84.520092	France	draw	1
15	Uruguay	Italy	24.142942	Italy	Uruguay	3
16	Spain	Australia	92.766946	Spain	Spain	3
17	Chile	Netherlands	36.429691	Netherlands	Netherlands	0
18	Portugal	USA	73.875672	Portugal	draw	1
19	Ghana	Germany	14.933490	Germany	draw	1
20	France	Switzerland	84.456472	France	France	3
21	England	Uruguay	61.402920	England	Uruguay	0
22	Netherlands	Australia	90.897030	Netherlands	Netherlands	3
23	Mexico	Brazil	18.745963	Brazil	draw	1
24	USA	Ghana	29.331909	Ghana	USA	3
25	Portugal	Germany	15.955012	Germany	Germany	0
26	Japan	Côte d'Ivoire	41.934990	Côte d'Ivoire	Côte d'Ivoire	0
27	Italy	England	73.812329	Italy	Italy	3
28	Netherlands	Spain	49.682892	Spain	Netherlands	3
29	Spain	Netherlands	53.295771	Spain	Spain	3
30	Germany	Uruguay	78.982270	Germany	Germany	3
31	Spain	Germany	41.830024	Germany	Spain	3
32	Spain	Paraguay	89.878445	Spain	Spain	3
33	Germany	Argentina	45.202720	Argentina	Germany	3
34	Brazil	Netherlands	63.908731	Brazil	Netherlands	0
35	Portugal	Spain	20.960764	Spain	Spain	0
36	Japan	Paraguay	63.519259	Japan	draw	1
37	Mexico	Argentina	27.152356	Argentina	Argentina	0
38	England	Germany	17.248992	Germany	Germany	0
39	Ghana	USA	70.933824	Ghana	Ghana	3
40	Brazil	Portugal	88.601369	Brazil	draw	1
41	Germany	Ghana	89.953475	Germany	Germany	3
42	France	Italy	39.809571	Italy	draw	1
43	Portugal	Germany	12.435939	Germany	Germany	0

Let's look at the stats for the teams in the final. We can compare them by eyeball to see which one we think will win:

In [16]:

final = wc_power_data[wc_power_data['matchid'] == '731830']
final

Out[16]:

	matchid	teamid	op_teamid	competitionid	seasonid	is_home	team_name	op_team_name	timestamp	avg_points	avg_goals	op_avg_goals	pass_70	pass_80	op_pass_70	op_pass_80	expected_goals	op_expected_goals	passes	bad_passes	pass_ratio	corners	fouls	cards	shots	op_passes	op_bad_passes	op_corners	op_fouls	op_cards	op_shots	goals_op_ratio	shots_op_ratio	pass_op_ratio	power_points
0	731830	632	357	4	2013	0.7	Argentina	Germany	2014-07-13 20:00:00.000000	2.666667	1.333333	0.500000	0.531302	0.173338	0.410517	0.141505	1.27091	0.703133	4.643395	0.954973	0.823198	0.070890	0.162126	1.000000	0.163520	3.179610	0.924281	0.047631	0.096118	1.833333	0.118528	1.083333	1.475854	1.110942	0.792664
1	731830	357	632	4	2013	0.2	Germany	Argentina	2014-07-13 20:00:00.000000	2.666667	2.833333	0.666667	0.808822	0.248188	0.427337	0.166366	2.11032	1.000115	5.643403	1.030747	0.837985	0.053194	0.152780	0.666667	0.147986	3.623266	1.008787	0.047043	0.122868	1.166667	0.132571	2.666667	1.525363	1.086513	1.000000

Now let's look at the games that made up the decisions:

In [17]:

op = game_summaries

def countryStats(d, name):
  pred = d['team_name'] == name
  return d[pred]

fr = countryStats(op, 'France')
ge = countryStats(op, 'Germany')
ar = countryStats(op, 'Argentina')
br = countryStats(op, 'Brazil')
ne = countryStats(op, 'Netherlands')
ge[:6]

Out[17]:

	matchid	teamid	passes	bad_passes	pass_ratio	corners	fouls	cards	goals	shots	is_home	team_name	pass_80	pass_70	expected_goals	on_target	length	op_teamid	op_passes	op_bad_passes	op_pass_ratio	op_corners	op_fouls	op_cards	op_goals	op_shots	op_team_name	op_pass_80	op_pass_70	op_expected_goals	op_on_target	competitionid	seasonid	shots_op_ratio	goals_op_ratio	pass_op_ratio	points	timestamp
3	731827	357	5.098901	0.989011	0.836036	0.054945	0.120879	0	7	0.153846	0	Germany	0.384615	0.912088	3.358086	0.109890	91	614	4.494505	0.934066	0.826263	0.076923	0.153846	1	1	0.197802	Brazil	0.318681	0.736264	1.846013	0.087912	4	2013	0.777778	7	1.011828	3	1404859864586
29	731824	357	3.670213	1.223404	0.748373	0.031915	0.159574	2	1	0.095745	0	Germany	0.095745	0.489362	0.937786	0.031915	94	368	3.627660	1.170213	0.754425	0.053191	0.191489	0	0	0.138298	France	0.234043	0.382979	1.199728	0.053191	4	2013	0.692308	1	0.991978	3	1404500184917
38	731820	357	6.000000	1.147541	0.838488	0.081967	0.163934	1	2	0.229508	1	Germany	0.213115	0.713115	3.414730	0.098361	122	1215	2.467213	1.254098	0.661538	0.032787	0.090164	1	1	0.081967	Algeria	0.057377	0.172131	1.044178	0.032787	4	2013	2.800000	2	1.267482	3	1404171338462
74	731811	357	7.574468	0.861702	0.896725	0.031915	0.159574	1	1	0.138298	0	Germany	0.308511	1.180851	1.156981	0.063830	94	596	3.526882	0.795699	0.813896	0.021505	0.096774	2	0	0.043011	USA	0.118280	0.376344	0.000000	0.000000	4	2013	3.215426	1	1.101769	3	1403808875580
111	731795	357	5.989362	1.127660	0.840299	0.074468	0.180851	0	2	0.127660	1	Germany	0.223404	0.755319	1.511423	0.042553	94	1219	3.425532	1.074468	0.759434	0.031915	0.117021	1	2	0.191489	Ghana	0.148936	0.478723	1.337382	0.063830	4	2013	0.666667	1	1.106480	1	1403387662729
143	731779	357	5.527473	0.835165	0.867990	0.043956	0.131868	0	4	0.142857	1	Germany	0.263736	0.802198	2.282916	0.054945	91	359	4.197802	0.824176	0.834973	0.065934	0.087912	2	0	0.142857	Portugal	0.120879	0.417582	0.573390	0.032967	4	2013	1.000000	4	1.039543	3	1402944761781

OK now that we've looked at the data every which way possible, let's predict the final results:

In [18]:

wc_pred[~(wc_pred['points'] >= 0)][[
    'team_name', 'op_team_name', 'predicted']]

Out[18]:

	team_name	op_team_name	predicted
0	Argentina	Germany	43.224980
1	Netherlands	Brazil	37.168067

In [18]: