Using Machine Learning To Pick Your Lottery Numbers

Disclaimer : I'm not a lottery player... This NoteBook will not make you rich, as on average, you lose money when you play

The pitch

It is well known that people are not creative when they choose their lottery numbers : indeed, they pick their birth dates, draw some particular shapes on the grid (lines, cross, ...), etc ...

This is not very smart. Because if you win, as many other players choosed the same numbers, the prize is split and you win less money (in average) :(

For example in USA on 2011 january 11th, there has been an extrordinary number of winners. Why ? Because the drawn combination was 4 – 8 – 15 – 25 – 47 / 42. Well, it turns out that these numbers are close to the one that appear in an episod of the TV series "Lost".

In [100]:
from IPython.display import YouTubeVideo
YouTubeVideo("65FLP1ChetI")
Out[100]:

How to choose your lottery numbers ? "Not all combinations are born equal" [0]

To optimize your earning expectation, you should play combinations that are not chosen by any other player. By doing this, if you win, you have chances to win more.

However, the problem is to know in advance the numbers or combinations rarely played. This information is known by the lottery organizers, but is kept secret.

The goal of this notebook is to explore if Machine Learning can help us to discriminate "human generated" combinations, from "machine generated" (a.k.a. random) combinations.

The most difficult thing about Big Data ? Getting the data

The data comes from OneWinner.Me. This is a crowdsourcing platform, where players can submit their lottery combinations, to check their originality. So we have in hand a set of several thousand of "human generated" lottery combinations : this constitutes the starting point of our study

What you will learn in this notebook

In this notebook, we will follow a simplified, but typical Data Science process :

  • Data acquisition
  • Data ingestion / cleaning / wrangling (yes, "Data hygienist" is really an existing job ...)
  • Data visualization / analysis
  • Features engineering
  • Machine learning modeling
    • model building
    • model performance assessment
    • model deployment
  • Data story telling
    • presentation
    • dissemination

You will also discover use of vital Data Science tools :

  • pandas : the Python library for Data manipulation
  • scikit-learn : the Python library for Machine Learning
  • matplotlib : the Python Dataviz library
  • IPython Notebook : the web-based Python interactive computational environment

An you will see that Data Science can find applications in unexpected areas ... like gambling games !

Let's start the heavy job

"Machine Learning without learning the Machinery" [1]

STEP 1 : Import the Data

In [4]:
from __future__ import print_function
import pandas as pd

# Read "human generated" combinations, and keep only numbers (not stars in this tutorial)
humandata = pd.read_csv("C:/mesfichies/OneWinner.Me/MachineLearning/challenge/OneWinnerMe-train.csv", sep=";")
humandata = humandata[['c1', 'c2', 'c3', 'c4', 'c5']]
humandatalen = humandata.shape[0]

# Create a set of random combinations, with the same size as the "human generated" dataset
randomdata = []
for i in range(humandatalen):
	randomdata.append(sorted(np.random.choice(50,5, replace=False) + 1))
randomdata = np.array(randomdata)
randomdata = pd.DataFrame(randomdata, 
                          index = [i + humandatalen for i in range(humandatalen)], 
                          columns=['c1', 'c2', 'c3', 'c4', 'c5'])
randomdatalen = randomdata.shape[0]

# Concatenate the 2 datasets : human + random
alldata = humandata.append(randomdata)

print("Number of human euromillions combinations : ", humandatalen)
print("Number of random euromillions combinations : ", randomdatalen)
print(randomdata.head(10))
Number of human euromillions combinations :  19437
Number of random euromillions combinations :  19437
       c1  c2  c3  c4  c5
19437   2   7   8  19  42
19438   3   7  16  41  46
19439   6   7  20  31  34
19440   8  44  47  48  50
19441   9  12  22  29  38
19442  26  33  36  45  50
19443   1  18  22  23  40
19444  23  24  39  44  47
19445   9  19  27  31  40
19446  16  35  37  41  47

[10 rows x 5 columns]

STEP 2 : Get some insights

"Regardless of the amount of data we have, we only have two eyeballs and one brain" [2]

First stats

In [5]:
print(humandata.describe())
print(randomdata.describe())
                 c1            c2            c3            c4            c5
count  19437.000000  19437.000000  19437.000000  19437.000000  19437.000000
mean       9.415548     17.296445     25.230643     33.295673     41.225498
std        8.453846      9.282092      9.614768      9.589071      8.803703
min        1.000000      2.000000      3.000000      4.000000      5.000000
25%        3.000000     10.000000     18.000000     28.000000     38.000000
50%        7.000000     16.000000     25.000000     35.000000     44.000000
75%       13.000000     23.000000     32.000000     41.000000     48.000000
max       46.000000     47.000000     48.000000     49.000000     50.000000

[8 rows x 5 columns]
                 c1            c2            c3            c4            c5
count  19437.000000  19437.000000  19437.000000  19437.000000  19437.000000
mean       8.521685     16.974482     25.471163     33.993518     42.498945
std        6.741020      8.527866      9.062766      8.515633      6.733166
min        1.000000      2.000000      3.000000      6.000000      7.000000
25%        3.000000     10.000000     19.000000     28.000000     39.000000
50%        7.000000     16.000000     26.000000     35.000000     44.000000
75%       12.000000     23.000000     32.000000     41.000000     48.000000
max       44.000000     46.000000     48.000000     49.000000     50.000000

[8 rows x 5 columns]

From the stats above, we see that statistical properties of the two datasets (human and random), are slighlty different.

Which combinations are played, and how many times ?

In [6]:
humangroup = humandata.groupby(['c1', 'c2', 'c3', 'c4', 'c5'])
groupsize = humangroup.size(); groupsize.sort(ascending=False)
print(groupsize)
c1  c2  c3  c4  c5
1   2   3   4   5     179
46  47  48  49  50     89
15  23  28  30  43     54
1   4   5   8   13     38
11  26  30  34  44     35
2   20  26  31  48     30
1   5   28  46  50     29
41  42  43  44  45     28
1   5   23  46  50     27
6   7   8   9   10     26
1   7   13  19  25     22
36  37  38  49  50     20
26  27  28  29  30     20
31  32  33  34  35     20
36  37  38  39  40     19
...
10  19  30  42  46    1
        29  40  49    1
            38  43    1
        28  33  47    1
            31  41    1
        27  34  50    1
            33  36    1
            29  34    1
        26  38  44    1
        24  42  46    1
                45    1
            41  45    1
            38  45    1
            36  45    1
7   14  23  33  45    1
Length: 16525, dtype: int64

The "1 2 3 4 5" combination is really a rock star .... The second most played combination is "46 47 48 49 50". Note also "1 7 13 19 25", wich is a diagonal on the grid

What are the most clicked numbers ?

In [7]:
freqs = []
for val in range(50):
    count = ((humandata['c1'] == val+1).sum() + 
             (humandata['c2'] == val+1).sum() + 
             (humandata['c3'] == val+1).sum() + 
             (humandata['c4'] == val+1).sum() + 
             (humandata['c5'] == val+1).sum())
    freqs.append(count)

ax = plt.gca(); ax.invert_yaxis()
plt.gcf().set_size_inches(4, 5)
heatmap = plt.pcolor(np.reshape(np.array(freqs), (10, 5)), cmap=matplotlib.cm.Blues)
In [8]:
from IPython.core.display import Image 
Image(filename='C:/mesfichies/OneWinner.Me/MachineLearning/grid.JPG') 
Out[8]:

From the heatmap above, we can see that people like the center of the grid, and fear the sides ...

STEP 3 : Create the features

"If people could see in high dimensions, machine learning would not be necessary" [3]

- Features #1, #2, #3, #4, #5 : simply the c1, ... c5 numbers

- Features #6 and #7 : how many combination numbers are under 31 and under 12

In [9]:
def is_under(df, c):
    return ((df['c1'] <= c).astype(int) + 
            (df['c2'] <= c).astype(int) + 
            (df['c3'] <= c).astype(int) + 
            (df['c4'] <= c).astype(int) + 
            (df['c5'] <= c).astype(int))

usedfeatures = []
alldata['nb_under_31'] = is_under(alldata, 31)
alldata['nb_under_12'] = is_under(alldata, 12)
print("it works !")
it works !

- Feature #8 : how many numbers are on the grid edge

In [10]:
def nb_in_edge(df):
    return ((df['c1'].isin(edge)).astype(int) + 
            (df['c2'].isin(edge)).astype(int) + 
            (df['c3'].isin(edge)).astype(int) + 
            (df['c4'].isin(edge)).astype(int) + 
            (df['c5'].isin(edge)).astype(int))
edge = [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 46, 47, 48, 49, 6, 11, 16, 21, 26, 31, 36, 41]
alldata['nb_in_edge'] = nb_in_edge(alldata)
print("well done !")
well done !
In [11]:
import random
# Human
rowsh = random.sample(humandata.index, 1000)
# Random
rowsr = random.sample(randomdata.index, 1000)

xh=[i + random.uniform(-0.2, 0.2) for i in alldata.ix[rowsh]['nb_in_edge']]
yh=[i + random.uniform(-0.2, 0.2) for i in alldata.ix[rowsh]['nb_under_12']]

xr=[i + random.uniform(-0.2, 0.2) for i in alldata.ix[rowsr]['nb_in_edge']]
yr=[i + random.uniform(-0.2, 0.2) for i in alldata.ix[rowsr]['nb_under_12']]

plt.figure() # make separate figure

plt.subplot(1, 2, 1)   # creates the first subplot
plot(xh, yh, 'gx'); plt.axis([-1, 6, -1, 6])
plt.title('Human'); plt.xlabel('nb_in_edge'); plt.ylabel('nb_under_12')

plt.subplot(1, 2, 2)   # creates the second subplot
plot(xr, yr, 'rx'); plt.axis([-1, 6, -1, 6])
plt.title('Random'); plt.xlabel('nb_in_edge'); plt.ylabel('nb_under_12')

plt.subplots_adjust(right=2)

show()

From the scatter plot above, we can see that the features distributions are not the same. The human distribution is more stretched than the random distribution.

In [12]:
plt.title('Human'); plt.xlabel('nb_in_edge'); plt.axis([-4, 8, 0, 0.5])
alldata.ix[rowsh]['nb_in_edge'].hist(bins=6, alpha=0.6, color='g', normed=True)
alldata.ix[rowsh]['nb_in_edge'].plot(kind='kde', style='k--')
Out[12]:
<matplotlib.axes.AxesSubplot at 0x37ae110>
In [13]:
plt.title('Random'); plt.xlabel('nb_in_edge'); plt.axis([-4, 8, 0, 0.5])
alldata.ix[rowsr]['nb_in_edge'].hist(bins=6, alpha=0.6, color='r', normed=True)
alldata.ix[rowsr]['nb_in_edge'].plot(kind='kde', style='k--')
Out[13]:
<matplotlib.axes.AxesSubplot at 0x33f9d10>

Above, we can see that the random distribution is sharper than the human distribution

- Features #9, #10, #11, #12, #13 : how many numbers are in columns 1, 2, 3, 4, 5

In [14]:
# Feature 4, 5, 6, 7, 8 : how many numbers are in columns 1, 2, 3, 4, 5
col1 = [1 + 5*i for i in range(10)]
col2 = [2 + 5*i for i in range(10)]
col3 = [3 + 5*i for i in range(10)]
col4 = [4 + 5*i for i in range(10)]
col5 = [5 + 5*i for i in range(10)]

def nb_in_col(df, col):
    res = ((df['c1'].isin(col)).astype(int) + 
           (df['c2'].isin(col)).astype(int) +
           (df['c3'].isin(col)).astype(int) +
           (df['c4'].isin(col)).astype(int) +
           (df['c5'].isin(col)).astype(int))
    return res

alldata['nb_in_col1'] = nb_in_col(alldata, col1) 
alldata['nb_in_col2'] = nb_in_col(alldata, col2)
alldata['nb_in_col3'] = nb_in_col(alldata, col3)
alldata['nb_in_col4'] = nb_in_col(alldata, col4)
alldata['nb_in_col5'] = nb_in_col(alldata, col5)

print("so cool ...")
so cool ...

- Feature #14 : how many numbers are even

In [15]:
even = [2*i + 2 for i in range(25)]
def nb_even(df):
    return ((df['c1'].isin(even)).astype(int) + 
            (df['c2'].isin(even)).astype(int) +
            (df['c3'].isin(even)).astype(int) +
            (df['c4'].isin(even)).astype(int) +
            (df['c5'].isin(even)).astype(int))
alldata['nb_even'] = nb_even(alldata)

print("finished !")
finished !

- Feature #15 : the distance between the numbers

In [16]:
def sum_diff(df):
    return ((df.c2 - df.c1)**2 +
            (df.c3 - df.c2)**2 +
            (df.c4 - df.c3)**2 +
            (df.c5 - df.c4)**2)
alldata['sum_diff'] = sum_diff(alldata)

print("done !")
done !

Now have a look at the features

In [17]:
alldata.head(10)
Out[17]:
c1 c2 c3 c4 c5 nb_under_31 nb_under_12 nb_in_edge nb_in_col1 nb_in_col2 nb_in_col3 nb_in_col4 nb_in_col5 nb_even sum_diff
0 5 10 15 20 25 5 2 5 0 0 0 0 5 2 100
1 10 15 20 25 30 5 1 5 0 0 0 0 5 3 100
2 15 20 25 30 35 4 0 5 0 0 0 0 5 2 100
3 30 35 40 45 50 1 0 5 0 0 0 0 5 3 100
4 30 34 38 42 46 1 0 2 1 1 1 1 1 5 64
5 26 32 38 44 50 1 0 2 1 1 1 1 1 5 144
6 1 2 3 4 5 5 5 5 1 1 1 1 1 2 4
7 6 7 8 9 10 5 5 2 1 1 1 1 1 3 4
8 11 12 13 14 15 5 2 2 1 1 1 1 1 2 4
9 16 17 18 19 20 5 0 2 1 1 1 1 1 3 4

10 rows × 15 columns

STEP 4 : Train the machine learning model

"Data Science is a simple game in which you throw data at a bunch of methods and in the end random forests always win" [4]

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

X = np.array(alldata)

# We will code with "0" the human combinations, and with "1" the random combinations
y = np.array([0] * humandatalen + [1] * randomdatalen)

# We split our data set in 2 parts :
# A "train set" (80% of the available data) : used to train the model
# A "test set" (20% of the available data) : used to evaluate the model performance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

# In this example we use a Random Forest model. Many other are available in the scikit-learn library
model = RandomForestClassifier(n_estimators = 100)

# We train the model on the "train set"
model.fit(X_train, y_train)
Out[18]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)
In [19]:
# show which features have been detected as most important for the model
usedfeatures = alldata.columns.tolist()
idx = np.argsort(model.feature_importances_)
for i in range(len(model.feature_importances_)):
    print(usedfeatures[idx[i]], " : ", model.feature_importances_[idx[i]])
nb_under_12  :  0.0216610749755
nb_under_31  :  0.024301780995
nb_in_col1  :  0.0373718714573
nb_in_col5  :  0.038123452007
nb_in_edge  :  0.0397677738747
nb_in_col2  :  0.0420371508656
nb_in_col4  :  0.0421651400592
nb_in_col3  :  0.0432070759727
nb_even  :  0.0613390887568
c1  :  0.0848004234656
c5  :  0.0879936060574
c2  :  0.0991337814575
c3  :  0.101263258337
c4  :  0.101640258047
sum_diff  :  0.175194263672

This ranking shows that the "sum_diff" feature is the most important for the model. The "nb_under_12" is the less important

STEP 5 : Do prediction and analyze results

"A prediction is always a missed target" [5]

In [20]:
from sklearn import metrics

# Do predictions on the "test set"
preds_hard = model.predict(X_test)
preds_soft = model.predict_proba(X_test)

print("made the job")
made the job
In [21]:
# Compute de confusion matrix, to visualization the performance of the model
cm = metrics.confusion_matrix(y_test, preds_hard)
cm_ = pd.DataFrame(cm, 
                   index=['random real (1)', 'human real (0)'], 
                   columns=['random predicted (1)', 'human predicted (0)'])
cm_
Out[21]:
random predicted (1) human predicted (0)
random real (1) 2127 1711
human real (0) 1346 2591

2 rows × 2 columns

In [22]:
# Compute the F1 score, a metric measuring the model performance
f1score = metrics.f1_score(y_test, preds_hard)
print("f1 score : {:.2f} %".format(f1score * 100))
f1 score : 62.90 %
In [23]:
# Compute how many times we made a good guess
accuracy = float(cm[0,0] + cm[1,1]) / float(cm[0,0] + cm[0,1] +
cm[1,0] + cm[1,1])
print("good guess : {:.2f} %".format(accuracy * 100))
good guess : 60.68 %

As you can see, we made a good guess more than 50% of the cases. This means that our model learned something, and is not doing random guesses !!!

In [24]:
# Compute the ROC AUC score (an other metric measuring the model performance)
print("ROC AUC score : {:.2f} ".format(metrics.roc_auc_score(y_test, preds_soft[:, 1])))
ROC AUC score : 0.65 

STEP 6 : Let's do scoring !

There are Data Scientists practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp) [6]

In [25]:
mycomb = [[2,17,27,29,33], 
          [2,4,6,8,10], 
          [2,4,6,8,11], 
          [11,21,29,34,37], 
          [4,7,8,10,20], 
          [8,10,21,29,35], 
          [12,16,17,23,26],
          [12,16,17,23,31]]
mycombfeat = pd.DataFrame(mycomb, columns=['c1', 'c2', 'c3', 'c4', 'c5'])
mycombfeat['nb_under_31'] = is_under(mycombfeat, 31)
mycombfeat['nb_under_12'] = is_under(mycombfeat, 12)
mycombfeat['nb_in_edge'] = nb_in_edge(mycombfeat)
mycombfeat['nb_in_col1'] = nb_in_col(mycombfeat, col1) 
mycombfeat['nb_in_col2'] = nb_in_col(mycombfeat, col2)
mycombfeat['nb_in_col3'] = nb_in_col(mycombfeat, col3)
mycombfeat['nb_in_col4'] = nb_in_col(mycombfeat, col4)
mycombfeat['nb_in_col5'] = nb_in_col(mycombfeat, col5)
mycombfeat['nb_even'] = nb_even(mycombfeat)
mycombfeat['sum_diff'] = sum_diff(mycombfeat)

print("done. What's next ?")
done. What's next ?
In [26]:
mycombfeat
Out[26]:
c1 c2 c3 c4 c5 nb_under_31 nb_under_12 nb_in_edge nb_in_col1 nb_in_col2 nb_in_col3 nb_in_col4 nb_in_col5 nb_even sum_diff
0 2 17 27 29 33 4 1 1 0 3 1 1 0 1 345
1 2 4 6 8 10 5 5 4 1 1 1 1 1 5 16
2 2 4 6 8 11 5 5 4 2 1 1 1 0 4 21
3 11 21 29 34 37 3 1 2 2 1 0 2 0 1 198
4 4 7 8 10 20 5 4 3 0 1 1 1 2 4 114
5 8 10 21 29 35 4 2 3 1 0 1 1 2 2 225
6 12 16 17 23 26 5 1 2 2 2 1 0 0 3 62
7 12 16 17 23 31 5 1 2 2 2 1 0 0 2 117

8 rows × 15 columns

In [27]:
for i in range(len(mycomb)):
    print("Randomness score for combination ", 
          mycomb[i], 
          " : ", 
          int(100.0* model.predict_proba(mycombfeat.iloc[[i]])[:, 1]), " %")
Randomness score for combination  [2, 17, 27, 29, 33]  :  33  %
Randomness score for combination  [2, 4, 6, 8, 10]  :  0  %
Randomness score for combination  [2, 4, 6, 8, 11]  :  3  %
Randomness score for combination  [11, 21, 29, 34, 37]  :  54  %
Randomness score for combination  [4, 7, 8, 10, 20]  :  53  %
Randomness score for combination  [8, 10, 21, 29, 35]  :  63  %
Randomness score for combination  [12, 16, 17, 23, 26]  :  44  %
Randomness score for combination  [12, 16, 17, 23, 31]  :  39  %

Conclusions and next steps

With a very simple model, and a basic features set, we designed a scoring method to predict if a given combination comes more likely from a human being, or a truely random machine.

This scoring framework will be pushed on live production on the OneWinner.Me platform

There are definitely improvement margins, to make the model more accurate. Here are some hints :

  • train and evaluate the classifier on a dataset without the "trivial" combinations (1 2 3 4 5, etc. ...)
  • try other classifiers, like Gradient Boosting Trees, SVM, logistic regression, ...
  • tune the classifiers hyper-parameters
  • use the stars
  • create and evaluate new features
  • make a features selection (maybe all features are not useful ?)
  • better assess model performance through K-Fold cross-validation
  • ...

Contact : @chris_bour Christophe Bourguignat