AKA Bolster got totally nerd sniped and this is my life now
Quite often all these fancy machinations come well after the real work; asking awkward questions.
Have a look at this site, it looks like there is some kind of recommendation engine behind the scenes.
However, it is very difficult to try and think about something that has a dozen categories and four sliding scales; we need to collect a lot of data and then work out if we can find any other hidden relationships.
Calculate how many queries you'd have to make to fully explore a deterministic recommendation space?
Observe, there are 12 classes of 'interests', Four 0-10 sliders, and each request responds with 6 recommended clubs.
import numpy as np
import requests
from random import choices, randint
import hashlib
from bs4 import BeautifulSoup
base = 'https://hookup-qubsu.org/home/GetResults'
categories = [
"Activism",
"Community",
"Competing",
"Culture",
"Democracy",
"Gaming",
"Learning",
"MakeFriends",
"Network",
"Outdoors",
"Perform",
"Stayactive"
]
def gen_q():
c = list(np.random.permutation(categories)[:int(np.random.normal((len(categories)-1)//2))])
_c = [categories.index(k) for k in (c)]
q = {
"Categories": c,
"Budget": str(randint(0,10)),
"Time": str(randint(0,10)),
"Travel": str(randint(0,10)),
"Joined": str(randint(0,10))
}
h = hashlib.md5(str(q).encode('utf-8')).digest()
return h,q,_c
def get_clubs(q):
response = requests.post(base, data=q)
content = response.content
duration = response.elapsed.total_seconds()
s = BeautifulSoup(content, 'html.parser')
clubs = [h.get_text() for h in s.select('div.answers > h2')]
return clubs, duration
def get_random_result():
h,q,_c = gen_q()
q['Recommended'], q['Duration']=get_clubs(q)
return q
get_random_result()
{'Categories': ['Democracy', 'Activism', 'Learning', 'Outdoors', 'Gaming'], 'Budget': '3', 'Time': '3', 'Travel': '4', 'Joined': '3', 'Recommended': ["What's The Big Idea?", 'Amnesty', 'Motor Club', 'Handy Helpers', 'Alternative Dispute Resolution Society', 'Players Society'], 'Duration': 0.690516}
from time import time
from tqdm.auto import tqdm
s = time()
for _ in tqdm(range(10)):
f = get_random_result()['Duration']
duration = (time()-s)
print(f'D:{duration}, M:{duration/10}')
HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))
D:3.6381988525390625, M:0.36381988525390624
import concurrent.futures
results = []
s = time()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(get_random_result) for _ in range(10)}
for future in tqdm(concurrent.futures.as_completed(futures)):
results.append(future.result())
duration = (time()-s)
print(f'D:{duration}, M:{duration/10}')
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
D:0.8237330913543701, M:0.08237330913543701
import concurrent.futures
s = time()
batch_size=int(1e4)
with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
while True: #run forever
results =[]
futures = {executor.submit(get_random_result) for _ in range(batch_size)}
for future in tqdm(concurrent.futures.as_completed(futures), total=batch_size):
results.append(future.result())
HBox(children=(FloatProgress(value=0.0, max=10000.0), HTML(value='')))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-18-340366478f4d> in <module> 9 for future in tqdm(concurrent.futures.as_completed(futures), total=batch_size): 10 results.append(future.result()) ---> 11 print(done) NameError: name 'done' is not defined
import pandas as pd
df=pd.DataFrame(results)
df
Categories | Budget | Time | Travel | Joined | Recommended | Duration | |
---|---|---|---|---|---|---|---|
0 | [Network, Stayactive, Community, Learning, Dem... | 9 | 5 | 9 | 5 | [Motor Club, Sign Language Society, Lawyers Wi... | 0.550702 |
1 | [Network, Community, Activism, Democracy, Lear... | 5 | 11 | 3 | 1 | [Tennis Club, Adventure Sports Clubs, Airsoft ... | 0.667156 |
2 | [Culture, Perform, Gaming, Activism, Stayactive] | 1 | 2 | 0 | 11 | [Art Society, Badminton Club, Dragonslayers, C... | 0.759456 |
3 | [Gaming, MakeFriends, Activism, Democracy, Out... | 5 | 5 | 10 | 8 | [Snooker & Pool Club] | 0.773687 |
4 | [Democracy, Learning, Outdoors, Network] | 11 | 0 | 2 | 8 | [Innovateher] | 0.758372 |
... | ... | ... | ... | ... | ... | ... | ... |
9995 | [Network, Culture, Outdoors, Community, Activism] | 2 | 1 | 2 | 3 | [Lacrosse Club, Academic Societies, Online Vol... | 0.715925 |
9996 | [Network, Perform, Learning, MakeFriends, Cult... | 10 | 10 | 3 | 0 | [Archery Club, Volunteering, Esports Society, ... | 0.752570 |
9997 | [Outdoors, Competing, MakeFriends, Stayactive,... | 9 | 7 | 5 | 9 | [University Air Squadron Society, Enactus Soci... | 0.759842 |
9998 | [Perform, Democracy, Culture, Community, Learn... | 3 | 1 | 4 | 1 | [Dance Club, Martial Arts & Combat Sports Club... | 0.766073 |
9999 | [MakeFriends, Outdoors, Gaming, Stayactive, De... | 7 | 8 | 9 | 6 | [Golf Club] | 0.770303 |
10000 rows × 7 columns
Now we have a much more complex data structure with each record having both the 'categorical' interests that were queried, and another set of categorical responses.
This is a very messy way of manipulating data; it would make much more sense for there to be a binary 'mask' for each observation.
There are clever ways of doing this using the preprocessing
capabilities from a world famous library we haven't quite dealt with yet; sklearn
, so lets use this as an excuse.
But First! Back up your data! (Even if the format is not cleaned!)
df.to_parquet('data/hookup_10k.pa.pq', engine='pyarrow')
df = pd.read_parquet('data/hookup_10k.pa.pq')
df.head()
Categories | Budget | Time | Travel | Joined | Recommended | Duration | |
---|---|---|---|---|---|---|---|
0 | [Network, Stayactive, Community, Learning, Dem... | 9 | 5 | 9 | 5 | [Motor Club, Sign Language Society, Lawyers Wi... | 0.550702 |
1 | [Network, Community, Activism, Democracy, Lear... | 5 | 11 | 3 | 1 | [Tennis Club, Adventure Sports Clubs, Airsoft ... | 0.667156 |
2 | [Culture, Perform, Gaming, Activism, Stayactive] | 1 | 2 | 0 | 11 | [Art Society, Badminton Club, Dragonslayers, C... | 0.759456 |
3 | [Gaming, MakeFriends, Activism, Democracy, Out... | 5 | 5 | 10 | 8 | [Snooker & Pool Club] | 0.773687 |
4 | [Democracy, Learning, Outdoors, Network] | 11 | 0 | 2 | 8 | [Innovateher] | 0.758372 |
While we've spent the majority of our time in pandas
land for clarity and easy manipulation, the majority of projects in the classification and model evaluation landscape operate on the underlying numpy
arrays directly, with the following rough convention;
from collections import Counter # this is a very cool module
clubs = Counter()
for _l in df['Recommended'].values:
for _i in _l:
clubs[_i]+=1
clubs.most_common()
[('Golf Club', 993), ('University Air Squadron Society', 949), ('Officer Training Corps Society', 942), ('Snooker & Pool Club', 916), ('Cricket Club', 885), ('Scout Network QUB', 754), ('Lawyers Without Borders Society', 738), ('Tennis Club', 715), ('Inspiring Leaders', 676), ('Rugby Club', 639), ('GAA Clubs ', 630), ('Equestrian Club', 628), ('Homework Clubs', 618), ('Nightline Society', 614), ('Soccer Club', 607), ('Hockey Club', 604), ('Basketball Club', 590), ('Innovateher', 589), ('Medical Societies', 585), ('Literific - Debating Society', 584), ('Alternative Dispute Resolution Society', 581), ('Music Society', 580), ('Musical Theatre Society', 577), ('Activist Societies', 570), ('RAG (Raise and Give)', 568), ('Enactus Society', 560), ('Athletics Club', 547), ('iLive Leadership Society', 543), ("QUB Dragons' Den", 539), ('Adventure Sports Clubs', 537), ('Electronic Music Society', 521), ('Netball Club', 513), ('Amnesty', 501), ('Mind Matters Society', 495), ('Yoga and Care Corner', 493), ('Airsoft Club', 490), ('Players Society', 486), ('Kpop Society', 480), ('Watersports Clubs', 479), ('Quiz Society', 477), ('Motor Club', 474), ('Feline Welfare Society', 469), ('Choral and Singing Society', 465), ('Dance Club', 464), ('St John Ambulance Society', 463), ('Juggling Club', 460), ("Writers' Society", 456), ('Become a Course Rep', 455), ('Student Action for Refugees Society', 454), ('Robotics Society', 447), ('Triathlon Club', 442), ('Visual Arts Society', 439), ('Photography Society', 438), ('Lacrosse Club', 436), ('Cheerleading Club', 433), ('Archery Club', 421), ('Belfast Marrow Society', 421), ('Student Managed Fund Society', 418), ('Political Societies', 414), ('Martial Arts & Combat Sports Clubs', 413), ('Online Volunteering', 411), ('Badminton Club', 409), ('Squash Club', 407), ('Become a Councillor', 405), ('Olympic Handball Club', 403), ('Chinese Lion Dance Society', 400), ('Trócaire Society', 397), ('Esports Society', 390), ('Volunteering', 387), ('Chess Club', 380), ("What's The Big Idea?", 378), ('Table Tennis Club', 376), ('LGBTQIA+ Society', 374), ('Sci-Fi and Fantasy Society', 373), ('Ultimate Frisbee Club', 372), ('Cavaliers in Need Society', 372), ('Unihoc-Floorball Club', 370), ('Art Society', 368), ('Traditional Crafts Society', 367), ('Dodgeball Club', 367), ('Aerial Sports Club', 363), ('Join a Student Group', 360), ('Volleyball Club', 359), ('Rowing Club', 358), ("Green at Queen's Society", 358), ('Film Society', 358), ('Faith-based Societies', 356), ('African and Caribbean Society', 352), ('Join the Climate Action Group', 347), ('Handy Helpers', 346), ('Trampoline Club', 339), ("Queen's Radio Society", 336), ('Cultural Societies', 327), ('Academic Societies', 326), ('Entrepreneurship ', 324), ('Vegan & Vegetarian Society', 319), ('Dragonslayers', 312), ('Inclusion Society', 310), ('Sign Language Society', 302)]
import plotly.express as px
club_recommended_n= pd.Series(clubs).sort_values()
px.bar(club_recommended_n - club_recommended_n.mean())
At first glance, this seems suspicious, but we definitly don't have enough information at the moment to clearly suggest that; out of a possible search space of 4,790,016,000,000, we only make 10,000 queries...
(BTW, spot where the Inclusion Society ended up....)
Lets just do a quick sense check and see what our four-dimensional space 'looks like'.
Visualising something in 4D is pretty much the edge of our reasonable explainability as lowly humans (before we have to get clever); from a graph perspective, we have X,Y,Z and Colour (or Size, but be careful with using both...).
But from a mathematical perspective, we can check things like the variance, standard deviation, and such of each 'column' first, and then validate that there is no 'covariance' between metrics, i.e. setting the slider to 'Budget' = 9 does not imply that the 'Time' goes any particular direction, and so on.
First off, lets see if our distribution makes any sense what so ever, or if it's ended up being clustered in one area, which would indicate a problem with our query generation. We can use the destriptive statistics method describe
and the nice and easy boxplot
to do a quick check on each value.
metrics_df = df[['Budget','Time','Travel','Joined']].astype(int)
metric_incidence = metrics_df\
.groupby(['Budget','Time','Travel','Joined'])\
.size().reset_index().rename(columns={0:'hits'})
metric_incidence
Budget | Time | Travel | Joined | hits | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 2 |
1 | 0 | 0 | 0 | 3 | 1 |
2 | 0 | 0 | 0 | 7 | 2 |
3 | 0 | 0 | 0 | 9 | 1 |
4 | 0 | 0 | 0 | 10 | 1 |
... | ... | ... | ... | ... | ... |
7979 | 11 | 11 | 11 | 4 | 1 |
7980 | 11 | 11 | 11 | 5 | 1 |
7981 | 11 | 11 | 11 | 8 | 1 |
7982 | 11 | 11 | 11 | 9 | 1 |
7983 | 11 | 11 | 11 | 10 | 1 |
7984 rows × 5 columns
metric_incidence.describe()
Budget | Time | Travel | Joined | hits | |
---|---|---|---|---|---|
count | 7984.000000 | 7984.000000 | 7984.000000 | 7984.000000 | 7984.000000 |
mean | 5.531062 | 5.534945 | 5.471443 | 5.490857 | 1.252505 |
std | 3.449794 | 3.461502 | 3.449125 | 3.454457 | 0.514228 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 3.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 |
50% | 6.000000 | 6.000000 | 5.000000 | 5.000000 | 1.000000 |
75% | 9.000000 | 9.000000 | 8.000000 | 8.250000 | 1.000000 |
max | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 5.000000 |
metric_incidence.boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x1d8d880b460>
This is a fun time to introduce another tool; you can format dataframe presentation really simply. DOCS
metric_incidence.cov()\
.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-e605c052244c> in <module> ----> 1 metric_incidence.cov() NameError: name 'metric_incidence' is not defined
So now we can be fairly confident that our slider queries were spread over the distribution space fairly evenly, and that there do not appear to be any significant correlations between any of the query values that would effect the output distribution.
The below is a dirty dirty hack presented with no explanation what so ever; you give it a messy data frame and it tells you what order you should display those values in 3D plots.
from sklearn.ensemble import RandomForestClassifier
def optimal_feature_display_order(data: pd.DataFrame) -> pd.Series:
"""Use a Random Forest Classifier to identify the 'most changed' orientations to present labeled data"""
clf = RandomForestClassifier(max_features=data.shape[1] - 1)
clf.fit(data.values[:, 1:], data.values[:, 0])
features = pd.Series(clf.feature_importances_, index=data.columns[1:])
return features.sort_values(ascending=False).index.to_list()
optimal_feature_display_order(
metric_incidence
)
['Travel', 'Joined', 'Time', 'hits']
px.scatter_3d(metric_incidence,x='Travel',y='Joined',z='Time', color='hits')
From this we can see that we've repeatedly hit the same coordinates a few times but have generally been well spread out.
Can we say the same about our categorical inputs somehow?
categories_df = df['Categories'].to_frame().copy()
for c in categories:
categories_df[c] = categories_df['Categories'].apply(lambda l:c in l)
categories_df.drop('Categories', axis=1, inplace=True)
categories_df
Activism | Community | Competing | Culture | Democracy | Gaming | Learning | MakeFriends | Network | Outdoors | Perform | Stayactive | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | True | False | False | True | False | True | False | True | False | False | True |
1 | True | True | False | False | True | False | True | False | True | False | False | False |
2 | True | False | False | True | False | True | False | False | False | False | True | True |
3 | True | False | False | False | True | True | False | True | False | True | False | False |
4 | False | False | False | False | True | False | True | False | True | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | True | True | False | True | False | False | False | False | True | True | False | False |
9996 | False | False | False | True | False | False | True | True | True | False | True | False |
9997 | False | False | True | False | True | False | False | True | True | True | False | True |
9998 | False | True | False | True | True | False | True | False | False | False | True | False |
9999 | False | False | False | False | True | True | False | True | False | True | False | True |
10000 rows × 12 columns
As we can see, because these new 'interest' categories are booleans, we get different outputs from describe
based on the type
categories_df.describe()
Activism | Community | Competing | Culture | Democracy | Gaming | Learning | MakeFriends | Network | Outdoors | Perform | Stayactive | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
unique | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
top | False | False | False | False | False | False | False | False | False | False | False | False |
freq | 6257 | 6253 | 6139 | 6194 | 6116 | 6248 | 6299 | 6266 | 6255 | 6226 | 6222 | 6342 |
But we can still perform a covariance check to see if anything stands out. (It doesn't!)
categories_df.cov()\
.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])
Activism | Community | Competing | Culture | Democracy | Gaming | Learning | MakeFriends | Network | Outdoors | Perform | Stayactive | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Activism | 0.234223 | -0.014652 | -0.011918 | -0.013360 | -0.013079 | -0.013639 | -0.014630 | -0.011865 | -0.010676 | -0.011162 | -0.012312 | -0.017521 |
Community | -0.014652 | 0.234323 | -0.013873 | -0.014112 | -0.014135 | -0.012989 | -0.014378 | -0.008814 | -0.015927 | -0.014213 | -0.015063 | -0.014067 |
Competing | -0.011918 | -0.013873 | 0.237050 | -0.008250 | -0.016263 | -0.015666 | -0.012697 | -0.013771 | -0.009795 | -0.013315 | -0.015570 | -0.012037 |
Culture | -0.013360 | -0.014112 | -0.008250 | 0.235767 | -0.012126 | -0.011802 | -0.015562 | -0.013817 | -0.015336 | -0.014440 | -0.013092 | -0.009524 |
Democracy | -0.013079 | -0.014135 | -0.016263 | -0.012126 | 0.237569 | -0.012229 | -0.011748 | -0.013730 | -0.016357 | -0.012883 | -0.011839 | -0.012178 |
Gaming | -0.013639 | -0.012989 | -0.015666 | -0.011802 | -0.012229 | 0.234448 | -0.011563 | -0.012501 | -0.013614 | -0.012502 | -0.009752 | -0.013850 |
Learning | -0.014630 | -0.014378 | -0.012697 | -0.015562 | -0.011748 | -0.011563 | 0.233149 | -0.011296 | -0.014704 | -0.013177 | -0.011225 | -0.014484 |
MakeFriends | -0.011865 | -0.008814 | -0.013771 | -0.013817 | -0.013730 | -0.012501 | -0.011296 | 0.233996 | -0.010339 | -0.015023 | -0.017172 | -0.011691 |
Network | -0.010676 | -0.015927 | -0.009795 | -0.015336 | -0.016357 | -0.013614 | -0.014704 | -0.010339 | 0.234273 | -0.010837 | -0.012987 | -0.011993 |
Outdoors | -0.011162 | -0.014213 | -0.013315 | -0.014440 | -0.012883 | -0.012502 | -0.013177 | -0.015023 | -0.010837 | 0.234993 | -0.014183 | -0.012554 |
Perform | -0.012312 | -0.015063 | -0.015570 | -0.013092 | -0.011839 | -0.009752 | -0.011225 | -0.017172 | -0.012987 | -0.014183 | 0.235091 | -0.015001 |
Stayactive | -0.017521 | -0.014067 | -0.012037 | -0.009524 | -0.012178 | -0.013850 | -0.014484 | -0.011691 | -0.011993 | -0.012554 | -0.015001 | 0.232014 |
We can even go crazy and combine the to record-sets and see if there are any cross-observation patterns!
We can do this because both dataframes still share the same index
.
pd.concat([metrics_df, categories_df], join='inner', axis=1).astype(int).cov()\
.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])
Budget | Time | Travel | Joined | Activism | Community | Competing | Culture | Democracy | Gaming | Learning | MakeFriends | Network | Outdoors | Perform | Stayactive | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Budget | 11.883027 | -0.050050 | 0.015763 | 0.153248 | 0.028350 | -0.005566 | -0.024122 | -0.013000 | -0.004841 | -0.009632 | 0.008878 | -0.013377 | -0.022962 | -0.002500 | -0.000212 | 0.021662 |
Time | -0.050050 | 12.041452 | 0.169866 | 0.181255 | 0.016080 | -0.003445 | -0.003506 | 0.007164 | 0.008412 | -0.004225 | 0.017923 | -0.014821 | 0.013368 | -0.009352 | -0.035178 | 0.006821 |
Travel | 0.015763 | 0.169866 | 11.907401 | 0.000879 | 0.009884 | 0.020202 | 0.019684 | -0.037504 | 0.000929 | 0.004071 | -0.020497 | -0.006706 | 0.009892 | -0.007137 | 0.010582 | 0.039677 |
Joined | 0.153248 | 0.181255 | 0.000879 | 11.837555 | -0.013373 | -0.024276 | -0.001735 | 0.013046 | 0.002903 | 0.004174 | -0.026151 | 0.038487 | 0.005228 | 0.006363 | -0.008541 | 0.022627 |
Activism | 0.028350 | 0.016080 | 0.009884 | -0.013373 | 0.234223 | -0.014652 | -0.011918 | -0.013360 | -0.013079 | -0.013639 | -0.014630 | -0.011865 | -0.010676 | -0.011162 | -0.012312 | -0.017521 |
Community | -0.005566 | -0.003445 | 0.020202 | -0.024276 | -0.014652 | 0.234323 | -0.013873 | -0.014112 | -0.014135 | -0.012989 | -0.014378 | -0.008814 | -0.015927 | -0.014213 | -0.015063 | -0.014067 |
Competing | -0.024122 | -0.003506 | 0.019684 | -0.001735 | -0.011918 | -0.013873 | 0.237050 | -0.008250 | -0.016263 | -0.015666 | -0.012697 | -0.013771 | -0.009795 | -0.013315 | -0.015570 | -0.012037 |
Culture | -0.013000 | 0.007164 | -0.037504 | 0.013046 | -0.013360 | -0.014112 | -0.008250 | 0.235767 | -0.012126 | -0.011802 | -0.015562 | -0.013817 | -0.015336 | -0.014440 | -0.013092 | -0.009524 |
Democracy | -0.004841 | 0.008412 | 0.000929 | 0.002903 | -0.013079 | -0.014135 | -0.016263 | -0.012126 | 0.237569 | -0.012229 | -0.011748 | -0.013730 | -0.016357 | -0.012883 | -0.011839 | -0.012178 |
Gaming | -0.009632 | -0.004225 | 0.004071 | 0.004174 | -0.013639 | -0.012989 | -0.015666 | -0.011802 | -0.012229 | 0.234448 | -0.011563 | -0.012501 | -0.013614 | -0.012502 | -0.009752 | -0.013850 |
Learning | 0.008878 | 0.017923 | -0.020497 | -0.026151 | -0.014630 | -0.014378 | -0.012697 | -0.015562 | -0.011748 | -0.011563 | 0.233149 | -0.011296 | -0.014704 | -0.013177 | -0.011225 | -0.014484 |
MakeFriends | -0.013377 | -0.014821 | -0.006706 | 0.038487 | -0.011865 | -0.008814 | -0.013771 | -0.013817 | -0.013730 | -0.012501 | -0.011296 | 0.233996 | -0.010339 | -0.015023 | -0.017172 | -0.011691 |
Network | -0.022962 | 0.013368 | 0.009892 | 0.005228 | -0.010676 | -0.015927 | -0.009795 | -0.015336 | -0.016357 | -0.013614 | -0.014704 | -0.010339 | 0.234273 | -0.010837 | -0.012987 | -0.011993 |
Outdoors | -0.002500 | -0.009352 | -0.007137 | 0.006363 | -0.011162 | -0.014213 | -0.013315 | -0.014440 | -0.012883 | -0.012502 | -0.013177 | -0.015023 | -0.010837 | 0.234993 | -0.014183 | -0.012554 |
Perform | -0.000212 | -0.035178 | 0.010582 | -0.008541 | -0.012312 | -0.015063 | -0.015570 | -0.013092 | -0.011839 | -0.009752 | -0.011225 | -0.017172 | -0.012987 | -0.014183 | 0.235091 | -0.015001 |
Stayactive | 0.021662 | 0.006821 | 0.039677 | 0.022627 | -0.017521 | -0.014067 | -0.012037 | -0.009524 | -0.012178 | -0.013850 | -0.014484 | -0.011691 | -0.011993 | -0.012554 | -0.015001 | 0.232014 |
From this we can actually see a few interesting things;
First off that covariance isn't directional; For instance, having a higher'joined' entry very slightly implies that the category 'MakeFriends' (i.e. reading left to right), but that having 'MakeFriends' as a category doesn't imply a larger 'Joined' rating.
Second, that mixing data types in this was is a very risky thing and should only be used to make yourself feel better about a dataset, not to derive any real conclusions from...
So we've confirmed out assumptions that our query generator isn't doing something insane. So what, if anything, is the recommender doing.
Lets go back to our first 'output' graph; this was formatted to show the clubs that got 'more' or 'less' recommendations.
Note, that we've established that we are asking random questions... so an initial hypothesis is that the answers are random; how can we test that?
px.bar(club_recommended_n - club_recommended_n.mean())
Lets start off with some maths; if you're interested, go nosy at the Central Limit Theorem
Fundamentally, as you sample a random variable, the distribution of that variable should approach that of the standard normal distribution. For instance, below, if you stopped at 16 dice rolls you could be forgiven for suspecting that your die was rigged.
f,axes = plt.subplots(5,1, sharex=True, figsize=(16,20))
random_dice = np.random.randint(0,6,100)
random_dice_mean = random_dice.cumsum() / np.arange(1,len(random_dice)+1)
for ax,i in zip(axes,[4,8,16,32,64]):
pd.Series(random_dice_mean[:i]).hist(ax=ax)
ax.set_ylabel(f'{i} rolls')
ax.set_xlim(0,6)
So, can we test our hypothesis that the recommender might just be random? Lets look as our diagram slightly differently
px.histogram(club_recommended_n.to_frame('Recommendations'), range_x=(0,1000))
px.box(club_recommended_n.to_frame('Recommendations'), x='Recommendations', range_x=(0,1000))
club_recommended_n[club_recommended_n > 700]
Tennis Club 715 Lawyers Without Borders Society 738 Scout Network QUB 754 Cricket Club 885 Snooker & Pool Club 916 Officer Training Corps Society 942 University Air Squadron Society 949 Golf Club 993 dtype: int64
These clubs appear to be recommended a relatively significant amount more than the average.
This doesn't in anyway disprove the hypothesis that the recommender is just a random generator, just potentially that we haven't gathered enough data to get a 'feel' for the surface.
How about going back to our friendly 3D plot and seeing if there is anything strange about the requests that led to the recommendations of these clubs?
recommended_df = df['Recommended'].to_frame().copy()
for c in clubs.keys():
recommended_df[c] = recommended_df['Recommended'].apply(lambda l: c in l)
recommended_df.drop('Recommended', axis=1, inplace=True)
recommended_df
Motor Club | Sign Language Society | Lawyers Without Borders Society | GAA Clubs | Triathlon Club | Martial Arts & Combat Sports Clubs | Tennis Club | Adventure Sports Clubs | Airsoft Club | Equestrian Club | ... | Volunteering | Traditional Crafts Society | Watersports Clubs | Yoga and Care Corner | Robotics Society | QUB Dragons' Den | Dodgeball Club | Belfast Marrow Society | Film Society | Aerial Sports Club | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | True | True | True | True | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
9996 | False | False | False | False | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | False | False | False |
9997 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
9998 | False | False | False | False | False | True | False | False | False | False | ... | False | False | False | False | False | False | True | False | True | False |
9999 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
10000 rows × 99 columns
top_clubs_idx = recommended_df[club_recommended_n[club_recommended_n > 700].keys()].any(axis=1) # Which rows produced any recommendation for these
top_clubs_idx
0 True 1 True 2 False 3 True 4 False ... 9995 False 9996 False 9997 True 9998 False 9999 True Length: 10000, dtype: bool
top_clubs_incidence = metrics_df[top_clubs_idx]\
.groupby(['Budget','Time','Travel','Joined'])\
.size().reset_index().rename(columns={0:'hits'})
top_clubs_incidence
Budget | Time | Travel | Joined | hits | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 10 | 1 |
1 | 0 | 0 | 1 | 11 | 1 |
2 | 0 | 0 | 2 | 11 | 1 |
3 | 0 | 0 | 7 | 0 | 1 |
4 | 0 | 0 | 7 | 3 | 1 |
... | ... | ... | ... | ... | ... |
4250 | 11 | 11 | 10 | 11 | 1 |
4251 | 11 | 11 | 11 | 0 | 1 |
4252 | 11 | 11 | 11 | 1 | 1 |
4253 | 11 | 11 | 11 | 4 | 1 |
4254 | 11 | 11 | 11 | 8 | 1 |
4255 rows × 5 columns
top_clubs_incidence.cov()\
.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])
Budget | Time | Travel | Joined | hits | |
---|---|---|---|---|---|
Budget | 10.651911 | -0.311813 | -0.812403 | 0.305707 | -0.025428 |
Time | -0.311813 | 11.599528 | -3.038244 | -0.875595 | 0.100824 |
Travel | -0.812403 | -3.038244 | 10.913013 | -0.727112 | -0.051677 |
Joined | 0.305707 | -0.875595 | -0.727112 | 10.365902 | 0.000362 |
hits | -0.025428 | 0.100824 | -0.051677 | 0.000362 | 0.196289 |
Now that is a lot more significant than we've seen before, and would definitly indicate that these clubs likely to be recommended across the board except for people saying they are particularly low on any of the metrics.
The 3D Plot doesn't massively help us here, but it does clearly show a significant blank space near the origin.
px.scatter_3d(top_clubs_incidence, x='Travel',y='Joined',z='Time', size='hits')
Check the website again. We've been using the phrases 'Travel','Joined','Time',and 'Budget' for the slider values.... Is that correct?
In this Section we've covered: