Hitting the real world¶

AKA Bolster got totally nerd sniped and this is my life now

Quite often all these fancy machinations come well after the real work; asking awkward questions.

Have a look at this site, it looks like there is some kind of recommendation engine behind the scenes.

However, it is very difficult to try and think about something that has a dozen categories and four sliding scales; we need to collect a lot of data and then work out if we can find any other hidden relationships.

Question 4:¶

Calculate how many queries you'd have to make to fully explore a deterministic recommendation space?

Observe, there are 12 classes of 'interests', Four 0-10 sliders, and each request responds with 6 recommended clubs.

In [10]:

import numpy as np
import requests
from random import choices, randint
import hashlib
from bs4 import BeautifulSoup

base = 'https://hookup-qubsu.org/home/GetResults'

categories = [
    "Activism",
    "Community",
    "Competing",
    "Culture",
    "Democracy",
    "Gaming",
    "Learning",
    "MakeFriends",
    "Network",
    "Outdoors",
    "Perform",
    "Stayactive"
]

def gen_q():
 
    c = list(np.random.permutation(categories)[:int(np.random.normal((len(categories)-1)//2))])
    _c = [categories.index(k) for k in (c)]
    q = {
        "Categories": c,
        "Budget": str(randint(0,10)),
        "Time": str(randint(0,10)),
        "Travel": str(randint(0,10)),
        "Joined": str(randint(0,10))
    }
    h = hashlib.md5(str(q).encode('utf-8')).digest()
    return h,q,_c
    
def get_clubs(q):
    response = requests.post(base, data=q)
    content = response.content
    duration = response.elapsed.total_seconds()
    s = BeautifulSoup(content, 'html.parser')
    clubs = [h.get_text() for h in s.select('div.answers > h2')]
    return clubs, duration

def get_random_result():
    h,q,_c = gen_q()
    q['Recommended'], q['Duration']=get_clubs(q)
    return q

get_random_result()

Out[10]:

{'Categories': ['Democracy', 'Activism', 'Learning', 'Outdoors', 'Gaming'],
 'Budget': '3',
 'Time': '3',
 'Travel': '4',
 'Joined': '3',
 'Recommended': ["What's The Big Idea?",
  'Amnesty',
  'Motor Club',
  'Handy Helpers',
  'Alternative Dispute Resolution Society',
  'Players Society'],
 'Duration': 0.690516}

In [14]:

from time import time
from tqdm.auto import tqdm

s = time()
for _ in tqdm(range(10)):
    f = get_random_result()['Duration']
duration = (time()-s)
print(f'D:{duration}, M:{duration/10}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

D:3.6381988525390625, M:0.36381988525390624

In [16]:

import concurrent.futures
results = []

s = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(get_random_result) for _ in range(10)}
    for future in tqdm(concurrent.futures.as_completed(futures)):
        results.append(future.result())

duration = (time()-s)
print(f'D:{duration}, M:{duration/10}')

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

D:0.8237330913543701, M:0.08237330913543701

In [18]:

import concurrent.futures

s = time()
batch_size=int(1e4)
with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    while True: #run forever
        results =[]
        futures = {executor.submit(get_random_result) for _ in range(batch_size)}
        for future in tqdm(concurrent.futures.as_completed(futures), total=batch_size):
            results.append(future.result())
        

HBox(children=(FloatProgress(value=0.0, max=10000.0), HTML(value='')))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-340366478f4d> in <module>
      9     for future in tqdm(concurrent.futures.as_completed(futures), total=batch_size):
     10         results.append(future.result())
---> 11 print(done)

NameError: name 'done' is not defined

In [49]:

import pandas as pd
df=pd.DataFrame(results)
df

Out[49]:

	Categories	Budget	Time	Travel	Joined	Recommended	Duration
0	[Network, Stayactive, Community, Learning, Dem...	9	5	9	5	[Motor Club, Sign Language Society, Lawyers Wi...	0.550702
1	[Network, Community, Activism, Democracy, Lear...	5	11	3	1	[Tennis Club, Adventure Sports Clubs, Airsoft ...	0.667156
2	[Culture, Perform, Gaming, Activism, Stayactive]	1	2	0	11	[Art Society, Badminton Club, Dragonslayers, C...	0.759456
3	[Gaming, MakeFriends, Activism, Democracy, Out...	5	5	10	8	[Snooker & Pool Club]	0.773687
4	[Democracy, Learning, Outdoors, Network]	11	0	2	8	[Innovateher]	0.758372
...	...	...	...	...	...	...	...
9995	[Network, Culture, Outdoors, Community, Activism]	2	1	2	3	[Lacrosse Club, Academic Societies, Online Vol...	0.715925
9996	[Network, Perform, Learning, MakeFriends, Cult...	10	10	3	0	[Archery Club, Volunteering, Esports Society, ...	0.752570
9997	[Outdoors, Competing, MakeFriends, Stayactive,...	9	7	5	9	[University Air Squadron Society, Enactus Soci...	0.759842
9998	[Perform, Democracy, Culture, Community, Learn...	3	1	4	1	[Dance Club, Martial Arts & Combat Sports Club...	0.766073
9999	[MakeFriends, Outdoors, Gaming, Stayactive, De...	7	8	9	6	[Golf Club]	0.770303

10000 rows × 7 columns

Now we have a much more complex data structure with each record having both the 'categorical' interests that were queried, and another set of categorical responses.

This is a very messy way of manipulating data; it would make much more sense for there to be a binary 'mask' for each observation.

There are clever ways of doing this using the preprocessing capabilities from a world famous library we haven't quite dealt with yet; sklearn, so lets use this as an excuse.

But First! Back up your data! (Even if the format is not cleaned!)

In [50]:

df.to_parquet('data/hookup_10k.pa.pq', engine='pyarrow')
df = pd.read_parquet('data/hookup_10k.pa.pq')
df.head()

Out[50]:

	Categories	Budget	Time	Travel	Joined	Recommended	Duration
0	[Network, Stayactive, Community, Learning, Dem...	9	5	9	5	[Motor Club, Sign Language Society, Lawyers Wi...	0.550702
1	[Network, Community, Activism, Democracy, Lear...	5	11	3	1	[Tennis Club, Adventure Sports Clubs, Airsoft ...	0.667156
2	[Culture, Perform, Gaming, Activism, Stayactive]	1	2	0	11	[Art Society, Badminton Club, Dragonslayers, C...	0.759456
3	[Gaming, MakeFriends, Activism, Democracy, Out...	5	5	10	8	[Snooker & Pool Club]	0.773687
4	[Democracy, Learning, Outdoors, Network]	11	0	2	8	[Innovateher]	0.758372

Drifting Near Actual Data Science: Validation¶

While we've spent the majority of our time in pandas land for clarity and easy manipulation, the majority of projects in the classification and model evaluation landscape operate on the underlying numpy arrays directly, with the following rough convention;

In [46]:

from collections import Counter # this is a very cool module
clubs = Counter()
for _l in df['Recommended'].values:
    for _i in _l:
        clubs[_i]+=1
clubs.most_common()

Out[46]:

[('Golf Club', 993),
 ('University Air Squadron Society', 949),
 ('Officer Training Corps Society', 942),
 ('Snooker & Pool Club', 916),
 ('Cricket Club', 885),
 ('Scout Network QUB', 754),
 ('Lawyers Without Borders Society', 738),
 ('Tennis Club', 715),
 ('Inspiring Leaders', 676),
 ('Rugby Club', 639),
 ('GAA Clubs ', 630),
 ('Equestrian Club', 628),
 ('Homework Clubs', 618),
 ('Nightline Society', 614),
 ('Soccer Club', 607),
 ('Hockey Club', 604),
 ('Basketball Club', 590),
 ('Innovateher', 589),
 ('Medical Societies', 585),
 ('Literific - Debating Society', 584),
 ('Alternative Dispute Resolution Society', 581),
 ('Music Society', 580),
 ('Musical Theatre Society', 577),
 ('Activist Societies', 570),
 ('RAG (Raise and Give)', 568),
 ('Enactus Society', 560),
 ('Athletics Club', 547),
 ('iLive Leadership Society', 543),
 ("QUB Dragons' Den", 539),
 ('Adventure Sports Clubs', 537),
 ('Electronic Music Society', 521),
 ('Netball Club', 513),
 ('Amnesty', 501),
 ('Mind Matters Society', 495),
 ('Yoga and Care Corner', 493),
 ('Airsoft Club', 490),
 ('Players Society', 486),
 ('Kpop Society', 480),
 ('Watersports Clubs', 479),
 ('Quiz Society', 477),
 ('Motor Club', 474),
 ('Feline Welfare Society', 469),
 ('Choral and Singing Society', 465),
 ('Dance Club', 464),
 ('St John Ambulance Society', 463),
 ('Juggling Club', 460),
 ("Writers' Society", 456),
 ('Become a Course Rep', 455),
 ('Student Action for Refugees Society', 454),
 ('Robotics Society', 447),
 ('Triathlon Club', 442),
 ('Visual Arts Society', 439),
 ('Photography Society', 438),
 ('Lacrosse Club', 436),
 ('Cheerleading Club', 433),
 ('Archery Club', 421),
 ('Belfast Marrow Society', 421),
 ('Student Managed Fund Society', 418),
 ('Political Societies', 414),
 ('Martial Arts & Combat Sports Clubs', 413),
 ('Online Volunteering', 411),
 ('Badminton Club', 409),
 ('Squash Club', 407),
 ('Become a Councillor', 405),
 ('Olympic Handball Club', 403),
 ('Chinese Lion Dance Society', 400),
 ('Trócaire Society', 397),
 ('Esports Society', 390),
 ('Volunteering', 387),
 ('Chess Club', 380),
 ("What's The Big Idea?", 378),
 ('Table Tennis Club', 376),
 ('LGBTQIA+ Society', 374),
 ('Sci-Fi and Fantasy Society', 373),
 ('Ultimate Frisbee Club', 372),
 ('Cavaliers in Need Society', 372),
 ('Unihoc-Floorball Club', 370),
 ('Art Society', 368),
 ('Traditional Crafts Society', 367),
 ('Dodgeball Club', 367),
 ('Aerial Sports Club', 363),
 ('Join a Student Group', 360),
 ('Volleyball Club', 359),
 ('Rowing Club', 358),
 ("Green at Queen's Society", 358),
 ('Film Society', 358),
 ('Faith-based Societies', 356),
 ('African and Caribbean Society', 352),
 ('Join the Climate Action Group', 347),
 ('Handy Helpers', 346),
 ('Trampoline Club', 339),
 ("Queen's Radio Society", 336),
 ('Cultural Societies', 327),
 ('Academic Societies', 326),
 ('Entrepreneurship ', 324),
 ('Vegan & Vegetarian Society', 319),
 ('Dragonslayers', 312),
 ('Inclusion Society', 310),
 ('Sign Language Society', 302)]

In [47]:

import plotly.express as px

club_recommended_n= pd.Series(clubs).sort_values()
px.bar(club_recommended_n - club_recommended_n.mean())

At first glance, this seems suspicious, but we definitly don't have enough information at the moment to clearly suggest that; out of a possible search space of 4,790,016,000,000, we only make 10,000 queries...

(BTW, spot where the Inclusion Society ended up....)

Lets just do a quick sense check and see what our four-dimensional space 'looks like'.

Visualising something in 4D is pretty much the edge of our reasonable explainability as lowly humans (before we have to get clever); from a graph perspective, we have X,Y,Z and Colour (or Size, but be careful with using both...).

But from a mathematical perspective, we can check things like the variance, standard deviation, and such of each 'column' first, and then validate that there is no 'covariance' between metrics, i.e. setting the slider to 'Budget' = 9 does not imply that the 'Time' goes any particular direction, and so on.

First off, lets see if our distribution makes any sense what so ever, or if it's ended up being clustered in one area, which would indicate a problem with our query generation. We can use the destriptive statistics method describe and the nice and easy boxplot to do a quick check on each value.

In [136]:

metrics_df = df[['Budget','Time','Travel','Joined']].astype(int)
metric_incidence = metrics_df\
                    .groupby(['Budget','Time','Travel','Joined'])\
                    .size().reset_index().rename(columns={0:'hits'})
metric_incidence

Out[136]:

	Budget	Time	Travel	Joined	hits
0	0	0	0	0	2
1	0	0	0	3	1
2	0	0	0	7	2
3	0	0	0	9	1
4	0	0	0	10	1
...	...	...	...	...	...
7979	11	11	11	4	1
7980	11	11	11	5	1
7981	11	11	11	8	1
7982	11	11	11	9	1
7983	11	11	11	10	1

7984 rows × 5 columns

In [119]:

metric_incidence.describe()

Out[119]:

	Budget	Time	Travel	Joined	hits
count	7984.000000	7984.000000	7984.000000	7984.000000	7984.000000
mean	5.531062	5.534945	5.471443	5.490857	1.252505
std	3.449794	3.461502	3.449125	3.454457	0.514228
min	0.000000	0.000000	0.000000	0.000000	1.000000
25%	3.000000	3.000000	2.000000	2.000000	1.000000
50%	6.000000	6.000000	5.000000	5.000000	1.000000
75%	9.000000	9.000000	8.000000	8.250000	1.000000
max	11.000000	11.000000	11.000000	11.000000	5.000000

In [120]:

metric_incidence.boxplot()

Out[120]:

<matplotlib.axes._subplots.AxesSubplot at 0x1d8d880b460>

This is a fun time to introduce another tool; you can format dataframe presentation really simply. DOCS

In [1]:

metric_incidence.cov()\
    .style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-e605c052244c> in <module>
----> 1 metric_incidence.cov()

NameError: name 'metric_incidence' is not defined

So now we can be fairly confident that our slider queries were spread over the distribution space fairly evenly, and that there do not appear to be any significant correlations between any of the query values that would effect the output distribution.

The below is a dirty dirty hack presented with no explanation what so ever; you give it a messy data frame and it tells you what order you should display those values in 3D plots.

In [122]:

from sklearn.ensemble import RandomForestClassifier


def optimal_feature_display_order(data: pd.DataFrame) -> pd.Series:
    """Use a Random Forest Classifier to identify the 'most changed' orientations to present labeled data"""
    clf = RandomForestClassifier(max_features=data.shape[1] - 1)
    clf.fit(data.values[:, 1:], data.values[:, 0])
    features = pd.Series(clf.feature_importances_, index=data.columns[1:])
    return features.sort_values(ascending=False).index.to_list()

optimal_feature_display_order(
    metric_incidence
)

Out[122]:

['Travel', 'Joined', 'Time', 'hits']

In [123]:

px.scatter_3d(metric_incidence,x='Travel',y='Joined',z='Time', color='hits')

From this we can see that we've repeatedly hit the same coordinates a few times but have generally been well spread out.

Can we say the same about our categorical inputs somehow?

In [132]:

categories_df = df['Categories'].to_frame().copy()
for c in categories:
    categories_df[c] = categories_df['Categories'].apply(lambda l:c in l)
categories_df.drop('Categories', axis=1, inplace=True)
categories_df

Out[132]:

	Activism	Community	Competing	Culture	Democracy	Gaming	Learning	MakeFriends	Network	Outdoors	Perform	Stayactive
0	False	True	False	False	True	False	True	False	True	False	False	True
1	True	True	False	False	True	False	True	False	True	False	False	False
2	True	False	False	True	False	True	False	False	False	False	True	True
3	True	False	False	False	True	True	False	True	False	True	False	False
4	False	False	False	False	True	False	True	False	True	True	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...
9995	True	True	False	True	False	False	False	False	True	True	False	False
9996	False	False	False	True	False	False	True	True	True	False	True	False
9997	False	False	True	False	True	False	False	True	True	True	False	True
9998	False	True	False	True	True	False	True	False	False	False	True	False
9999	False	False	False	False	True	True	False	True	False	True	False	True

10000 rows × 12 columns

As we can see, because these new 'interest' categories are booleans, we get different outputs from describe based on the type

In [133]:

categories_df.describe()

Out[133]:

	Activism	Community	Competing	Culture	Democracy	Gaming	Learning	MakeFriends	Network	Outdoors	Perform	Stayactive
count	10000	10000	10000	10000	10000	10000	10000	10000	10000	10000	10000	10000
unique	2	2	2	2	2	2	2	2	2	2	2	2
top	False	False	False	False	False	False	False	False	False	False	False	False
freq	6257	6253	6139	6194	6116	6248	6299	6266	6255	6226	6222	6342

But we can still perform a covariance check to see if anything stands out. (It doesn't!)

In [135]:

categories_df.cov()\
    .style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

Out[135]:

	Activism	Community	Competing	Culture	Democracy	Gaming	Learning	MakeFriends	Network	Outdoors	Perform	Stayactive
Activism	0.234223	-0.014652	-0.011918	-0.013360	-0.013079	-0.013639	-0.014630	-0.011865	-0.010676	-0.011162	-0.012312	-0.017521
Community	-0.014652	0.234323	-0.013873	-0.014112	-0.014135	-0.012989	-0.014378	-0.008814	-0.015927	-0.014213	-0.015063	-0.014067
Competing	-0.011918	-0.013873	0.237050	-0.008250	-0.016263	-0.015666	-0.012697	-0.013771	-0.009795	-0.013315	-0.015570	-0.012037
Culture	-0.013360	-0.014112	-0.008250	0.235767	-0.012126	-0.011802	-0.015562	-0.013817	-0.015336	-0.014440	-0.013092	-0.009524
Democracy	-0.013079	-0.014135	-0.016263	-0.012126	0.237569	-0.012229	-0.011748	-0.013730	-0.016357	-0.012883	-0.011839	-0.012178
Gaming	-0.013639	-0.012989	-0.015666	-0.011802	-0.012229	0.234448	-0.011563	-0.012501	-0.013614	-0.012502	-0.009752	-0.013850
Learning	-0.014630	-0.014378	-0.012697	-0.015562	-0.011748	-0.011563	0.233149	-0.011296	-0.014704	-0.013177	-0.011225	-0.014484
MakeFriends	-0.011865	-0.008814	-0.013771	-0.013817	-0.013730	-0.012501	-0.011296	0.233996	-0.010339	-0.015023	-0.017172	-0.011691
Network	-0.010676	-0.015927	-0.009795	-0.015336	-0.016357	-0.013614	-0.014704	-0.010339	0.234273	-0.010837	-0.012987	-0.011993
Outdoors	-0.011162	-0.014213	-0.013315	-0.014440	-0.012883	-0.012502	-0.013177	-0.015023	-0.010837	0.234993	-0.014183	-0.012554
Perform	-0.012312	-0.015063	-0.015570	-0.013092	-0.011839	-0.009752	-0.011225	-0.017172	-0.012987	-0.014183	0.235091	-0.015001
Stayactive	-0.017521	-0.014067	-0.012037	-0.009524	-0.012178	-0.013850	-0.014484	-0.011691	-0.011993	-0.012554	-0.015001	0.232014

We can even go crazy and combine the to record-sets and see if there are any cross-observation patterns!

We can do this because both dataframes still share the same index.

In [146]:

pd.concat([metrics_df, categories_df],  join='inner', axis=1).astype(int).cov()\
    .style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

Out[146]:

	Budget	Time	Travel	Joined	Activism	Community	Competing	Culture	Democracy	Gaming	Learning	MakeFriends	Network	Outdoors	Perform	Stayactive
Budget	11.883027	-0.050050	0.015763	0.153248	0.028350	-0.005566	-0.024122	-0.013000	-0.004841	-0.009632	0.008878	-0.013377	-0.022962	-0.002500	-0.000212	0.021662
Time	-0.050050	12.041452	0.169866	0.181255	0.016080	-0.003445	-0.003506	0.007164	0.008412	-0.004225	0.017923	-0.014821	0.013368	-0.009352	-0.035178	0.006821
Travel	0.015763	0.169866	11.907401	0.000879	0.009884	0.020202	0.019684	-0.037504	0.000929	0.004071	-0.020497	-0.006706	0.009892	-0.007137	0.010582	0.039677
Joined	0.153248	0.181255	0.000879	11.837555	-0.013373	-0.024276	-0.001735	0.013046	0.002903	0.004174	-0.026151	0.038487	0.005228	0.006363	-0.008541	0.022627
Activism	0.028350	0.016080	0.009884	-0.013373	0.234223	-0.014652	-0.011918	-0.013360	-0.013079	-0.013639	-0.014630	-0.011865	-0.010676	-0.011162	-0.012312	-0.017521
Community	-0.005566	-0.003445	0.020202	-0.024276	-0.014652	0.234323	-0.013873	-0.014112	-0.014135	-0.012989	-0.014378	-0.008814	-0.015927	-0.014213	-0.015063	-0.014067
Competing	-0.024122	-0.003506	0.019684	-0.001735	-0.011918	-0.013873	0.237050	-0.008250	-0.016263	-0.015666	-0.012697	-0.013771	-0.009795	-0.013315	-0.015570	-0.012037
Culture	-0.013000	0.007164	-0.037504	0.013046	-0.013360	-0.014112	-0.008250	0.235767	-0.012126	-0.011802	-0.015562	-0.013817	-0.015336	-0.014440	-0.013092	-0.009524
Democracy	-0.004841	0.008412	0.000929	0.002903	-0.013079	-0.014135	-0.016263	-0.012126	0.237569	-0.012229	-0.011748	-0.013730	-0.016357	-0.012883	-0.011839	-0.012178
Gaming	-0.009632	-0.004225	0.004071	0.004174	-0.013639	-0.012989	-0.015666	-0.011802	-0.012229	0.234448	-0.011563	-0.012501	-0.013614	-0.012502	-0.009752	-0.013850
Learning	0.008878	0.017923	-0.020497	-0.026151	-0.014630	-0.014378	-0.012697	-0.015562	-0.011748	-0.011563	0.233149	-0.011296	-0.014704	-0.013177	-0.011225	-0.014484
MakeFriends	-0.013377	-0.014821	-0.006706	0.038487	-0.011865	-0.008814	-0.013771	-0.013817	-0.013730	-0.012501	-0.011296	0.233996	-0.010339	-0.015023	-0.017172	-0.011691
Network	-0.022962	0.013368	0.009892	0.005228	-0.010676	-0.015927	-0.009795	-0.015336	-0.016357	-0.013614	-0.014704	-0.010339	0.234273	-0.010837	-0.012987	-0.011993
Outdoors	-0.002500	-0.009352	-0.007137	0.006363	-0.011162	-0.014213	-0.013315	-0.014440	-0.012883	-0.012502	-0.013177	-0.015023	-0.010837	0.234993	-0.014183	-0.012554
Perform	-0.000212	-0.035178	0.010582	-0.008541	-0.012312	-0.015063	-0.015570	-0.013092	-0.011839	-0.009752	-0.011225	-0.017172	-0.012987	-0.014183	0.235091	-0.015001
Stayactive	0.021662	0.006821	0.039677	0.022627	-0.017521	-0.014067	-0.012037	-0.009524	-0.012178	-0.013850	-0.014484	-0.011691	-0.011993	-0.012554	-0.015001	0.232014

From this we can actually see a few interesting things;

First off that covariance isn't directional; For instance, having a higher'joined' entry very slightly implies that the category 'MakeFriends' (i.e. reading left to right), but that having 'MakeFriends' as a category doesn't imply a larger 'Joined' rating.

Second, that mixing data types in this was is a very risky thing and should only be used to make yourself feel better about a dataset, not to derive any real conclusions from...

Data Science: The Queries are fine, so what's going on with the Results?¶

So we've confirmed out assumptions that our query generator isn't doing something insane. So what, if anything, is the recommender doing.

Lets go back to our first 'output' graph; this was formatted to show the clubs that got 'more' or 'less' recommendations.

Note, that we've established that we are asking random questions... so an initial hypothesis is that the answers are random; how can we test that?

In [147]:

px.bar(club_recommended_n - club_recommended_n.mean())

Lets start off with some maths; if you're interested, go nosy at the Central Limit Theorem

Fundamentally, as you sample a random variable, the distribution of that variable should approach that of the standard normal distribution. For instance, below, if you stopped at 16 dice rolls you could be forgiven for suspecting that your die was rigged.

In [186]:

f,axes = plt.subplots(5,1, sharex=True, figsize=(16,20))
random_dice = np.random.randint(0,6,100)
random_dice_mean = random_dice.cumsum() / np.arange(1,len(random_dice)+1)
for ax,i in zip(axes,[4,8,16,32,64]):
    pd.Series(random_dice_mean[:i]).hist(ax=ax)
    ax.set_ylabel(f'{i} rolls')
    ax.set_xlim(0,6)

So, can we test our hypothesis that the recommender might just be random? Lets look as our diagram slightly differently

In [196]:

px.histogram(club_recommended_n.to_frame('Recommendations'), range_x=(0,1000))

In [201]:

px.box(club_recommended_n.to_frame('Recommendations'), x='Recommendations', range_x=(0,1000))

In [204]:

club_recommended_n[club_recommended_n > 700]

Out[204]:

Tennis Club                        715
Lawyers Without Borders Society    738
Scout Network QUB                  754
Cricket Club                       885
Snooker & Pool Club                916
Officer Training Corps Society     942
University Air Squadron Society    949
Golf Club                          993
dtype: int64

These clubs appear to be recommended a relatively significant amount more than the average.

This doesn't in anyway disprove the hypothesis that the recommender is just a random generator, just potentially that we haven't gathered enough data to get a 'feel' for the surface.

How about going back to our friendly 3D plot and seeing if there is anything strange about the requests that led to the recommendations of these clubs?

In [208]:

recommended_df = df['Recommended'].to_frame().copy()
for c in clubs.keys():
    recommended_df[c] = recommended_df['Recommended'].apply(lambda l: c in l)
recommended_df.drop('Recommended', axis=1, inplace=True)
recommended_df

Out[208]:

	Motor Club	Sign Language Society	Lawyers Without Borders Society	GAA Clubs	Triathlon Club	Martial Arts & Combat Sports Clubs	Tennis Club	Adventure Sports Clubs	Airsoft Club	Equestrian Club	...	Volunteering	Traditional Crafts Society	Watersports Clubs	Yoga and Care Corner	Robotics Society	QUB Dragons' Den	Dodgeball Club	Belfast Marrow Society	Film Society	Aerial Sports Club
0	True	True	True	True	True	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
9996	False	False	False	False	False	False	False	False	False	False	...	True	False	False	False	False	False	False	False	False	False
9997	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
9998	False	False	False	False	False	True	False	False	False	False	...	False	False	False	False	False	False	True	False	True	False
9999	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

10000 rows × 99 columns

In [214]:

top_clubs_idx = recommended_df[club_recommended_n[club_recommended_n > 700].keys()].any(axis=1) # Which rows produced any recommendation for these
top_clubs_idx

Out[214]:

0        True
1        True
2       False
3        True
4       False
        ...  
9995    False
9996    False
9997     True
9998    False
9999     True
Length: 10000, dtype: bool

In [221]:

top_clubs_incidence = metrics_df[top_clubs_idx]\
                    .groupby(['Budget','Time','Travel','Joined'])\
                    .size().reset_index().rename(columns={0:'hits'})
top_clubs_incidence

Out[221]:

	Budget	Time	Travel	Joined	hits
0	0	0	0	10	1
1	0	0	1	11	1
2	0	0	2	11	1
3	0	0	7	0	1
4	0	0	7	3	1
...	...	...	...	...	...
4250	11	11	10	11	1
4251	11	11	11	0	1
4252	11	11	11	1	1
4253	11	11	11	4	1
4254	11	11	11	8	1

4255 rows × 5 columns

In [229]:

top_clubs_incidence.cov()\
    .style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

Out[229]:

	Budget	Time	Travel	Joined	hits
Budget	10.651911	-0.311813	-0.812403	0.305707	-0.025428
Time	-0.311813	11.599528	-3.038244	-0.875595	0.100824
Travel	-0.812403	-3.038244	10.913013	-0.727112	-0.051677
Joined	0.305707	-0.875595	-0.727112	10.365902	0.000362
hits	-0.025428	0.100824	-0.051677	0.000362	0.196289

Now that is a lot more significant than we've seen before, and would definitly indicate that these clubs likely to be recommended across the board except for people saying they are particularly low on any of the metrics.

The 3D Plot doesn't massively help us here, but it does clearly show a significant blank space near the origin.

In [230]:

px.scatter_3d(top_clubs_incidence, x='Travel',y='Joined',z='Time', size='hits')

Bonus Round¶

Check the website again. We've been using the phrases 'Travel','Joined','Time',and 'Budget' for the slider values.... Is that correct?

Learning Outcomes¶

In this Section we've covered:

How to interrogate a web service in a structured way
How to statistically and graphically validate the assumptions about a data set
How to assess if a distribution is random or not

In [ ]: