This article is a behind the scenes look at the inspection, cleaning, processing, analysis, interpretation, and modelling of the data. It is geared more towards a techincal audience. If you want to see a user-friendly summary, there is a slideshow here. View this project's readme for reproducibility information.
In this project, I will work with data from OKCupid, an online dating app. This dataset was provided to me by Codecademy as part of their "Data Science Career Path". In this project I seek to accomplish the following:
Let's begin with inspection of the data.
import pandas as pd
df = pd.read_csv('profiles.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 59946 entries, 0 to 59945 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59946 non-null int64 1 body_type 54650 non-null object 2 diet 35551 non-null object 3 drinks 56961 non-null object 4 drugs 45866 non-null object 5 education 53318 non-null object 6 essay0 54458 non-null object 7 essay1 52374 non-null object 8 essay2 50308 non-null object 9 essay3 48470 non-null object 10 essay4 49409 non-null object 11 essay5 49096 non-null object 12 essay6 46175 non-null object 13 essay7 47495 non-null object 14 essay8 40721 non-null object 15 essay9 47343 non-null object 16 ethnicity 54266 non-null object 17 height 59943 non-null float64 18 income 59946 non-null int64 19 job 51748 non-null object 20 last_online 59946 non-null object 21 location 59946 non-null object 22 offspring 24385 non-null object 23 orientation 59946 non-null object 24 pets 40025 non-null object 25 religion 39720 non-null object 26 sex 59946 non-null object 27 sign 48890 non-null object 28 smokes 54434 non-null object 29 speaks 59896 non-null object 30 status 59946 non-null object dtypes: float64(1), int64(2), object(28) memory usage: 14.2+ MB
Let's see how these feature values actually look like:
pd.set_option("display.max_columns", None)
df.head()
age | body_type | diet | drinks | drugs | education | essay0 | essay1 | essay2 | essay3 | essay4 | essay5 | essay6 | essay7 | essay8 | essay9 | ethnicity | height | income | job | last_online | location | offspring | orientation | pets | religion | sex | sign | smokes | speaks | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22 | a little extra | strictly anything | socially | never | working on college/university | about me:<br />\n<br />\ni would love to think... | currently working as an international agent fo... | making people laugh.<br />\nranting about a go... | the way i look. i am a six foot half asian, ha... | books:<br />\nabsurdistan, the republic, of mi... | food.<br />\nwater.<br />\ncell phone.<br />\n... | duality and humorous things | trying to find someone to hang out with. i am ... | i am new to california and looking for someone... | you want to be swept off your feet!<br />\nyou... | asian, white | 75.0 | -1 | transportation | 2012-06-28-20-30 | south san francisco, california | doesn’t have kids, but might want them | straight | likes dogs and likes cats | agnosticism and very serious about it | m | gemini | sometimes | english | single |
1 | 35 | average | mostly other | often | sometimes | working on space camp | i am a chef: this is what that means.<br />\n1... | dedicating everyday to being an unbelievable b... | being silly. having ridiculous amonts of fun w... | NaN | i am die hard christopher moore fan. i don't r... | delicious porkness in all of its glories.<br /... | NaN | NaN | i am very open and will share just about anyth... | NaN | white | 70.0 | 80000 | hospitality / travel | 2012-06-29-21-41 | oakland, california | doesn’t have kids, but might want them | straight | likes dogs and likes cats | agnosticism but not too serious about it | m | cancer | no | english (fluently), spanish (poorly), french (... | single |
2 | 38 | thin | anything | socially | NaN | graduated from masters program | i'm not ashamed of much, but writing public te... | i make nerdy software for musicians, artists, ... | improvising in different contexts. alternating... | my large jaw and large glasses are the physica... | okay this is where the cultural matrix gets so... | movement<br />\nconversation<br />\ncreation<b... | NaN | viewing. listening. dancing. talking. drinking... | when i was five years old, i was known as "the... | you are bright, open, intense, silly, ironic, ... | NaN | 68.0 | -1 | NaN | 2012-06-27-09-10 | san francisco, california | NaN | straight | has cats | NaN | m | pisces but it doesn’t matter | no | english, french, c++ | available |
3 | 23 | thin | vegetarian | socially | NaN | working on college/university | i work in a library and go to school. . . | reading things written by old dead people | playing synthesizers and organizing books acco... | socially awkward but i do my best | bataille, celine, beckett. . .<br />\nlynch, j... | NaN | cats and german philosophy | NaN | NaN | you feel so inclined. | white | 71.0 | 20000 | student | 2012-06-28-14-22 | berkeley, california | doesn’t want kids | straight | likes cats | NaN | m | pisces | no | english, german (poorly) | single |
4 | 29 | athletic | NaN | socially | never | graduated from college/university | hey how's it going? currently vague on the pro... | work work work work + play | creating imagery to look at:<br />\nhttp://bag... | i smile a lot and my inquisitive nature | music: bands, rappers, musicians<br />\nat the... | NaN | NaN | NaN | NaN | NaN | asian, black, other | 66.0 | -1 | artistic / musical / writer | 2012-06-27-21-26 | san francisco, california | NaN | straight | likes dogs and likes cats | NaN | m | aquarius | no | english | single |
To get a proper idea regarding these essays and columns containing text, let's print out a single user's data.
def show_user_data(df, index):
'''Takes a row index and prints out all the features of that index'''
for each, col in zip(df.iloc[index], df.columns):
print('\033[1m' + col + ':' + '\033[0m' + ' ' + str(each))
show_user_data(df, 0)
age: 22 body_type: a little extra diet: strictly anything drinks: socially drugs: never education: working on college/university essay0: about me:<br /> <br /> i would love to think that i was some some kind of intellectual: either the dumbest smart guy, or the smartest dumb guy. can't say i can tell the difference. i love to talk about ideas and concepts. i forge odd metaphors instead of reciting cliches. like the simularities between a friend of mine's house and an underwater salt mine. my favorite word is salt by the way (weird choice i know). to me most things in life are better as metaphors. i seek to make myself a little better everyday, in some productively lazy way. got tired of tying my shoes. considered hiring a five year old, but would probably have to tie both of our shoes... decided to only wear leather shoes dress shoes.<br /> <br /> about you:<br /> <br /> you love to have really serious, really deep conversations about really silly stuff. you have to be willing to snap me out of a light hearted rant with a kiss. you don't have to be funny, but you have to be able to make me laugh. you should be able to bend spoons with your mind, and telepathically make me smile while i am still at work. you should love life, and be cool with just letting the wind blow. extra points for reading all this and guessing my favorite video game (no hints given yet). and lastly you have a good attention span. essay1: currently working as an international agent for a freight forwarding company. import, export, domestic you know the works.<br /> online classes and trying to better myself in my free time. perhaps a hours worth of a good book or a video game on a lazy sunday. essay2: making people laugh.<br /> ranting about a good salting.<br /> finding simplicity in complexity, and complexity in simplicity. essay3: the way i look. i am a six foot half asian, half caucasian mutt. it makes it tough not to notice me, and for me to blend in. essay4: books:<br /> absurdistan, the republic, of mice and men (only book that made me want to cry), catcher in the rye, the prince.<br /> <br /> movies:<br /> gladiator, operation valkyrie, the producers, down periscope.<br /> <br /> shows:<br /> the borgia, arrested development, game of thrones, monty python<br /> <br /> music:<br /> aesop rock, hail mary mallon, george thorogood and the delaware destroyers, felt<br /> <br /> food:<br /> i'm down for anything. essay5: food.<br /> water.<br /> cell phone.<br /> shelter. essay6: duality and humorous things essay7: trying to find someone to hang out with. i am down for anything except a club. essay8: i am new to california and looking for someone to wisper my secrets to. essay9: you want to be swept off your feet!<br /> you are tired of the norm.<br /> you want to catch a coffee or a bite.<br /> or if you want to talk philosophy. ethnicity: asian, white height: 75.0 income: -1 job: transportation last_online: 2012-06-28-20-30 location: south san francisco, california offspring: doesn’t have kids, but might want them orientation: straight pets: likes dogs and likes cats religion: agnosticism and very serious about it sex: m sign: gemini smokes: sometimes speaks: english status: single
To find out how long ago this sample was taken, let's look at the range of values of last_online.
print(df.last_online.min())
print(df.last_online.max())
2011-06-27-01-52 2012-07-01-08-57
Finally, before doing anything to the data, let's see which features have missing values.
import missingno as msno
msno.matrix(df)
print('Missing values per column')
nulls = [(df[feature].isnull().sum(), feature) for feature in df.columns]
print(sorted(nulls))
Missing values per column [(0, 'age'), (0, 'income'), (0, 'last_online'), (0, 'location'), (0, 'orientation'), (0, 'sex'), (0, 'status'), (3, 'height'), (50, 'speaks'), (2985, 'drinks'), (5296, 'body_type'), (5488, 'essay0'), (5512, 'smokes'), (5680, 'ethnicity'), (6628, 'education'), (7572, 'essay1'), (8198, 'job'), (9638, 'essay2'), (10537, 'essay4'), (10850, 'essay5'), (11056, 'sign'), (11476, 'essay3'), (12451, 'essay7'), (12603, 'essay9'), (13771, 'essay6'), (14080, 'drugs'), (19225, 'essay8'), (19921, 'pets'), (20226, 'religion'), (24395, 'diet'), (35561, 'offspring')]
In this section we have learned the following:
The data has to be cleaned and preprocessed before it can be analyzed. Let's start with replacing the '-1' in the income field with NaN, the null value recognized by Pandas and NumPy.
import numpy as np
df.income.replace(-1, np.nan, inplace = True)
We can reinspect the missing values visualization to confirm the replacement of the null values.
msno.matrix(df)
print('Missing values per column')
nulls = [(df[feature].isnull().sum(), feature) for feature in df.columns]
print(sorted(nulls))
Missing values per column [(0, 'age'), (0, 'last_online'), (0, 'location'), (0, 'orientation'), (0, 'sex'), (0, 'status'), (3, 'height'), (50, 'speaks'), (2985, 'drinks'), (5296, 'body_type'), (5488, 'essay0'), (5512, 'smokes'), (5680, 'ethnicity'), (6628, 'education'), (7572, 'essay1'), (8198, 'job'), (9638, 'essay2'), (10537, 'essay4'), (10850, 'essay5'), (11056, 'sign'), (11476, 'essay3'), (12451, 'essay7'), (12603, 'essay9'), (13771, 'essay6'), (14080, 'drugs'), (19225, 'essay8'), (19921, 'pets'), (20226, 'religion'), (24395, 'diet'), (35561, 'offspring'), (48442, 'income')]
It appears that income is one of the pieces of information that people would least like to share. Later on, we could make a detailed comparison of the percentage of missing values of each column to evaluate "willingness of users to share information".
Let's move on to the HTML formatted text data. Not only is it more difficult to read, it is also not suitable for Natural Language Processing. Let's clean up the text using an HTML Parser and Regex. I will demonstrate the process when applied to a single entry first.
from bs4 import BeautifulSoup
import re
#Create html_parser function
def html_parser(raw_html):
raw_html = str(raw_html)
soup = BeautifulSoup(raw_html, 'html.parser')
soup_string = soup.get_text()
soup_string = re.sub('<.*>', ' ', soup_string)
return soup_string
Before:
print(df.essay9[0])
you want to be swept off your feet!<br /> you are tired of the norm.<br /> you want to catch a coffee or a bite.<br /> or if you want to talk philosophy.
After:
print(html_parser(df.essay9[0]))
you want to be swept off your feet! you are tired of the norm. you want to catch a coffee or a bite. or if you want to talk philosophy.
Now let's apply this to all of the text in the data.
#Takes a few minutes - Parse all the raw HTML in the data
import warnings
warnings.filterwarnings('ignore')
object_cols = [series for series in df.columns if df[series].dtype == 'O']
for series in object_cols:
df[series] = df[series].apply(lambda row: html_parser(row) if pd.notnull(row) else row)
# warnings.resetwarnings()
Lots of warnings given, mostly because users are including YouTube URLs. It's good to be made aware, as the URLs could affect the anlaysis later.
In this section, we accomplished the following:
Machine Learning can work better if there are more columns because we're giving it more points of comparison. The addition of more columns, also called features, is feature engineering.
I'm going to back up the dataframe in its current form, so that even after modification and addition of columns, we can easily refer to the original data if necessary.
original_df_backup = df.copy(deep = True)
We can produce new features by splitting the existing ones. Some of our features are actually describing two variables that are potentially independent of each other. Later on, before we develop our model, we will investigate variable codependence by checking Pearson correlations.
def two_split_columns(delimiter, df, feature, col1_name, col2_name, swap = False, stringify_None = False, strip = False):
'''splits a column into two. Takes a delimiter, DataFrame, and Series name'''
new = df[feature].str.split(delimiter, n = 1, expand = True)
if swap:
new[1] = new.apply(lambda row: row[0] if not(isinstance(row[1], str)) else row[1], axis = 1)
new[0] = new.apply(lambda row: '[No Prefix]' if row[0] == row[1] else row[0], axis = 1)
if stringify_None:
new[1] = new[1].apply(lambda col: 'None' if col is None else col)
new[0] = new[0].apply(lambda col: 'None' if col is None else col)
if strip:
new[1] = new[1].apply(lambda col: col.strip() if isinstance(col, str) else col)
new[0] = new[0].apply(lambda col: col.strip() if isinstance(col, str) else col)
df[col1_name]= new[0]
df[col2_name]= new[1]
return new
two_split_columns(" ", df, "diet", "diet_adherence", "diet_type", True)
two_split_columns(",", df, "location", "city", "state/country")
two_split_columns(",", df, "offspring", "offspring_want", "offspring_attitude", False, False)
##Religion
df['temp'] = df['religion'].apply(lambda col: col.replace("but", "and") if isinstance(col, str) else col)
two_split_columns("and", df, "temp", "religion_type", "religion_attitude", stringify_None = True, strip = True)
del df['temp']
## Sign
df['temp'] = df['sign'].apply(lambda col: col.replace("but", "and") if isinstance(col, str) else col)
two_split_columns("and", df, "temp", "sign_type", "sign_attitude", stringify_None = True, strip = True)
del df['temp']
Columns split in two:
Original Feature | New Feature 1 | New Feature 2 |
---|---|---|
diet | diet_adherence | diet_type |
location | city | state |
offspring | offspring_want | offspring_attitude |
religion | religion_type | religion_attitude |
sign | sign_type | sign_attitude |
There are really two kinds of pets among the values: dogs and cats. Let's proceed by splitting 'pets' into 'dogs' and 'cats'.
temp = two_split_columns("and", df, "pets", "dogs", "cats", False, False, strip = True)
temp[1] = temp.apply(lambda row: row[0] if "cat" in str(row[0]) else row[1], axis = 1)
temp[0] = temp.apply(lambda row: None if "cat" in str(row[0]) else row[0], axis = 1)
df["dogs"] = temp[0]
df["cats"] = temp[1]
Having a pet can have nothing to do with liking a pet. Let's further split this into 'dog_preference', 'has_dogs', 'cat_preference', 'has_cats'. Let's also remove 'dogs' and 'cats'.
df['dog_preference'] = df.apply(lambda row: row['dogs'] if "like" in str(row['dogs']) else None, axis = 1)
df['cat_preference'] = df.apply(lambda row: row['cats'] if "like" in str(row['cats']) else None, axis = 1)
df['has_dogs'] = df.apply(lambda row: 1 if "has" in str(row['dogs']) else 0, axis = 1)
df['has_dogs'] = df.apply(lambda row: row['has_dogs'] if isinstance(row['pets'], str) else None, axis = 1)
df['has_cats'] = df.apply(lambda row: 1 if "has" in str(row['cats']) else 0, axis = 1)
df['has_cats'] = df.apply(lambda row: row['has_cats'] if isinstance(row['pets'], str) else None, axis = 1)
del df['dogs']
del df['cats']
df[['dog_preference', 'has_dogs', 'cat_preference', 'has_cats']].head()
dog_preference | has_dogs | cat_preference | has_cats | |
---|---|---|---|---|
0 | likes dogs | 0.0 | likes cats | 0.0 |
1 | likes dogs | 0.0 | likes cats | 0.0 |
2 | None | 0.0 | None | 1.0 |
3 | None | 0.0 | likes cats | 0.0 |
4 | likes dogs | 0.0 | likes cats | 0.0 |
There's quite some unpacking that needs to be done here. There appear to be several different languages and different options for fluency. Before we start making new columns, let's get a better sense for what exactly our values are.
languages = df.speaks.value_counts()
languages = languages.index
languages = [each.split(',') for each in languages]
languages = [item for sublist in languages for item in sublist]
languages = set([each.strip() for each in languages])
temp = languages
languages = [each.split(" (") for each in languages]
languages = [each[0] for each in languages]
languages = sorted(set(languages))
print("There are {} different language options:".format(len(languages)))
print(languages)
print("There are 4 different fluency options per language:")
print(sorted(temp)[:4])
There are 77 different language options: ['afrikaans', 'albanian', 'ancient greek', 'arabic', 'armenian', 'basque', 'belarusan', 'bengali', 'breton', 'bulgarian', 'c++', 'catalan', 'cebuano', 'chechen', 'chinese', 'croatian', 'czech', 'danish', 'dutch', 'english', 'esperanto', 'estonian', 'farsi', 'finnish', 'french', 'frisian', 'georgian', 'german', 'greek', 'gujarati', 'hawaiian', 'hebrew', 'hindi', 'hungarian', 'icelandic', 'ilongo', 'indonesian', 'irish', 'italian', 'japanese', 'khmer', 'korean', 'latin', 'latvian', 'lisp', 'lithuanian', 'malay', 'maori', 'mongolian', 'norwegian', 'occitan', 'other', 'persian', 'polish', 'portuguese', 'romanian', 'rotuman', 'russian', 'sanskrit', 'sardinian', 'serbian', 'sign language', 'slovak', 'slovenian', 'spanish', 'swahili', 'swedish', 'tagalog', 'tamil', 'thai', 'tibetan', 'turkish', 'ukrainian', 'urdu', 'vietnamese', 'welsh', 'yiddish'] There are 4 different fluency options per language: ['afrikaans', 'afrikaans (fluently)', 'afrikaans (okay)', 'afrikaans (poorly)']
After isolation of terms it is revealed that among the 'speaks' values there are 77 different languages and four different descriptors of language fluency. The ambiguity of the fluency options presents a dilemma. What is 'afrikaans' supposed to mean compared to 'afrikaans (okay)'? Because of this ambiguity, we will not make use of the fluency descriptors in our visualization. We will create a new column for each language containing a 1 to indicate if the language is spoken, 0 if not. The result is actually called a sparse matrix. 'Sparse' because it contains much more 0s than 1s.
#Takes a few minutes
for language in languages:
df[language] = df.apply(lambda row: 1 if (language in str(row['speaks'])) else 0, axis = 1)
df[languages].head(3)
afrikaans | albanian | ancient greek | arabic | armenian | basque | belarusan | bengali | breton | bulgarian | c++ | catalan | cebuano | chechen | chinese | croatian | czech | danish | dutch | english | esperanto | estonian | farsi | finnish | french | frisian | georgian | german | greek | gujarati | hawaiian | hebrew | hindi | hungarian | icelandic | ilongo | indonesian | irish | italian | japanese | khmer | korean | latin | latvian | lisp | lithuanian | malay | maori | mongolian | norwegian | occitan | other | persian | polish | portuguese | romanian | rotuman | russian | sanskrit | sardinian | serbian | sign language | slovak | slovenian | spanish | swahili | swedish | tagalog | tamil | thai | tibetan | turkish | ukrainian | urdu | vietnamese | welsh | yiddish | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Print memory usage function. Let's investigate the memory being used up by the sparse matrix.
#Credit where credit is due: Function from Dafni Sidiropoulou Velidou's article "Working with sparse data sets in pandas and sklearn"
BYTES_TO_MB_DIV = 0.000001
def print_memory_usage_of_data_frame(df):
mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 3)
print("Memory usage is " + str(mem) + " MB")
print_memory_usage_of_data_frame(df[languages])
Memory usage is 36.927 MB
Converting our sparse matrix to a Pandas Sparse Array reduces its memory usage and allows AI algorithms to take less time training on it.
temp = df[languages].copy()
for (columnName, columnData) in temp.iteritems():
df[columnName] = pd.arrays.SparseArray(columnData.values, dtype='uint8')
print_memory_usage_of_data_frame(df[languages])
Memory usage is 0.553 MB
There are other ways to derive new features in addition to splitting.
temp = df['ethnicity'].apply(lambda col: col.split(",") if isinstance(col, str) else col)
df['num_ethnicities'] = temp.apply(lambda col: len(col) if isinstance(col, list) else col)
df['optional_%unfilled'] = ((original_df_backup.isnull().sum(axis = 1))/24) * 100
df['num_languages'] = df[languages].sum(axis = 1).replace(0, None)
Added Columns:
New Feature | Description |
---|---|
num_ethnicities | Contains the number of ethnicities listed in 'ethnicity' |
optional_%unfilled | Percentage of optional fields unfilled |
num_languages | Count of languages spoken |
In this section, we accomplished the following:
Original Feature | New Feature 1 | New Feature 2 |
---|---|---|
diet | diet_adherence | diet_type |
location | city | state |
offspring | offspring_want | offspring_attitude |
religion | religion_type | religion_attitude |
sign | sign_type | sign_attitude |
New Feature | Description |
---|---|
num_ethnicities | Contains the number of ethnicities listed in 'ethnicity' |
optional_%unfilled | Percentage of optional fields unfilled |
num_languages | Count of languages spoken |
The best way to explore data is to visualize it. Let's start by generating histograms and boxplots of our numerical features.
import plotly.express as px
selection = ['age', 'height', 'income', 'optional_%unfilled']
fig = px.histogram(df, x = selection, marginal = 'box', histnorm = 'percent').update_traces(visible = False, showlegend=False)
fig.data[0].visible = True
fig.data[1].visible = True
def create_button(column):
temp = []
for each in selection:
temp.append(each == column)
temp.append(each == column)
return dict(
args=[{"visible": temp}],
label=column,
method="update"
)
fig.update_layout(
title = "2012 OKCupid Profiles",
updatemenus=[
#Add a dropdown selector
dict(
buttons = [create_button(column) for column in selection],
direction="down",
pad={"r": 10, "t": 10},
showactive=True,
)
]
)
fig.show()
Use the dropdown selector to switch between features. Datapoints show more information on mouseover. The graph can be panned and zoomed.
Feature | Comment |
---|---|
Age | The median age is 30. The distribution is right-skewed. Most users are young and working adults. |
Height | Most heights range between 59 and 78 in (4.9 - 6.5 ft) with a median of 68in(5.67ft). The distribution appears normal. |
Income | The median income is \$50k. A quarter of all incomes are $25k. Most income distributions are log-normal (if the ultra-wealthy are not included), and this is no different. |
All Features Above | From the box plot, we can see that all of the distributions have outliers - There is a 4.5% group of \$1M earners. There are two people over 100 years old. There is a height of 1in. Below, we will inspect these data points (user profiles) to see what's really going on. |
optional_%unfilled | The y-axis shows the percentage of users and the x-axis shows the percentage of optional user information fields left unfilled. Interquartile Range is 12% - 30%, meaning half of all users don't bother to fill 12% - 30% of optional fields. A quarter of users are below that range and a quarter of users are above that range. This is a feature that was engineered from the given raw data. |
Let's look a look at the profiles of our centennial users.
display(df.iloc[2512:2513])
display(df.iloc[25324:25325])
age | body_type | diet | drinks | drugs | education | essay0 | essay1 | essay2 | essay3 | essay4 | essay5 | essay6 | essay7 | essay8 | essay9 | ethnicity | height | income | job | last_online | location | offspring | orientation | pets | religion | sex | sign | smokes | speaks | status | diet_adherence | diet_type | city | state/country | offspring_want | offspring_attitude | religion_type | religion_attitude | sign_type | sign_attitude | dog_preference | cat_preference | has_dogs | has_cats | afrikaans | albanian | ancient greek | arabic | armenian | basque | belarusan | bengali | breton | bulgarian | c++ | catalan | cebuano | chechen | chinese | croatian | czech | danish | dutch | english | esperanto | estonian | farsi | finnish | french | frisian | georgian | german | greek | gujarati | hawaiian | hebrew | hindi | hungarian | icelandic | ilongo | indonesian | irish | italian | japanese | khmer | korean | latin | latvian | lisp | lithuanian | malay | maori | mongolian | norwegian | occitan | other | persian | polish | portuguese | romanian | rotuman | russian | sanskrit | sardinian | serbian | sign language | slovak | slovenian | spanish | swahili | swedish | tagalog | tamil | thai | tibetan | turkish | ukrainian | urdu | vietnamese | welsh | yiddish | num_ethnicities | optional_%unfilled | num_languages | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2512 | 110 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 67.0 | NaN | NaN | 2012-06-27-22-16 | daly city, california | NaN | straight | NaN | NaN | f | NaN | NaN | english | single | NaN | NaN | daly city | california | NaN | NaN | NaN | NaN | NaN | NaN | None | None | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 95.833333 | 1 |
age | body_type | diet | drinks | drugs | education | essay0 | essay1 | essay2 | essay3 | essay4 | essay5 | essay6 | essay7 | essay8 | essay9 | ethnicity | height | income | job | last_online | location | offspring | orientation | pets | religion | sex | sign | smokes | speaks | status | diet_adherence | diet_type | city | state/country | offspring_want | offspring_attitude | religion_type | religion_attitude | sign_type | sign_attitude | dog_preference | cat_preference | has_dogs | has_cats | afrikaans | albanian | ancient greek | arabic | armenian | basque | belarusan | bengali | breton | bulgarian | c++ | catalan | cebuano | chechen | chinese | croatian | czech | danish | dutch | english | esperanto | estonian | farsi | finnish | french | frisian | georgian | german | greek | gujarati | hawaiian | hebrew | hindi | hungarian | icelandic | ilongo | indonesian | irish | italian | japanese | khmer | korean | latin | latvian | lisp | lithuanian | malay | maori | mongolian | norwegian | occitan | other | persian | polish | portuguese | romanian | rotuman | russian | sanskrit | sardinian | serbian | sign language | slovak | slovenian | spanish | swahili | swedish | tagalog | tamil | thai | tibetan | turkish | ukrainian | urdu | vietnamese | welsh | yiddish | num_ethnicities | optional_%unfilled | num_languages | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25324 | 109 | athletic | mostly other | NaN | never | working on masters program | NaN | NaN | NaN | nothing | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 95.0 | NaN | student | 2012-06-30-18-18 | san francisco, california | might want kids | straight | NaN | other and somewhat serious about it | m | aquarius but it doesn’t matter | when drinking | english (okay) | available | mostly | other | san francisco | california | might want kids | None | other | somewhat serious about it | aquarius | it doesn’t matter | None | None | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 54.166667 | 1 |
The first one has 95% unfilled optional fields. The second has 54% unfilled optional fields and a height of 95inches(8ft tall). Do you think these profiles are reliable?
Below is an example of the kind of user profile which 'has a height of 1 inch'.
show_user_data(df, 45959)
age: 36 body_type: nan diet: nan drinks: very often drugs: never education: graduated from college/university essay0: i'm a transplant from southern california with roots in lovely oakland. i find myself bored quite a bit. hence why i'm here. i like whiskey and find the term hipsters laughable yet amusing. the two seem to go hand in hand around oaklands many dive bars. people often describe me as "intense" or "stoic". i find either of those words loaded and annoying. i have a gift for writing and although i take myself pretty seriously, i don't take this online stuff too seriously. i'm looking to find someone a little more burly than me who enjoys a good day holding hands on the beach, whale watching and a fantastic trip to the olive garden. i am nerdcore, nice smelling, and a unique snowflake essay1: living each day as if it'd be my last, and managing to hold down a day-job in the process. of course i'm going to make a crack at wanting a quaint craftsman home in rockridge; complete with a white picket fence, a loving wife (preferably not mail order), 2.5 kids, and a green prius parked in the driveway. with a pony... just kidding. essay2: writing, rambling, fixing anything thats breakable, photography. drinking you under the table. traveling the road less traveled. essay3: my otherwise perfect hair, my half-sleeve tattoo(s), my perfect teeth i suppose. oh and my rather dry and snarky sense of humor. essay4: i enjoy reading louis sachar, james burke, alan kaufman,hunter s. thompson, chuck palahniuk for starters. as far as music goes: i tend to listen to old and new punk with bouts involving the classics: frank sinatra tom waits frankie avalon the misfits crass love and rockets ugly ducklings beastie boys the vandals propaghandi minor threat sonic youth anything like old crusty-punk on stereophonic vinyl you get the idea. i noticed that its a very "hipster" thing to say that you don't own or even like tv's when i actually own 2 *and* tivo. hows that for going against the grain? netflix, hulu and rapidshare are my best friend. essay5: irish whiskey broadband internet my iphone a decent pair of good fitting jeans cheese my needy and probably neurotic cat "peanut" essay6: zombies, and ways not to get eaten by them. my next big adventure essay7: peering through the bottom of a whiskey glass. perpetually rearranging my itunes playlist. riding my iconic vintage peugeot fixed-gear conversion bicycle uphill on a pbr run. writing poetry in spiral patterns on napkins in various coffee shops and giving them to complete strangers. essay8: i can't stand online dating, or the stigma attached to it. essay9: if you're a ninja (with skills) or a pirate. you're well mannered. you're not a raging sociopath or a creeper. you like whiskey, you know your whiskeys. you dont type like dis. and if you picked up on the dry sarcasm , you should **definitely** message me. ethnicity: nan height: 1.0 income: 80000.0 job: science / tech / engineering last_online: 2012-06-28-14-18 location: oakland, california offspring: nan orientation: straight pets: likes dogs and has cats religion: judaism and laughing about it sex: m sign: leo and it’s fun to think about smokes: when drinking speaks: english (fluently), c++ (fluently), hebrew (poorly), other (poorly) status: single diet_adherence: nan diet_type: nan city: oakland state/country: california offspring_want: nan offspring_attitude: nan religion_type: judaism religion_attitude: laughing about it sign_type: leo sign_attitude: it’s fun to think about dog_preference: likes dogs cat_preference: None has_dogs: 0.0 has_cats: 1.0 afrikaans: 0 albanian: 0 ancient greek: 0 arabic: 0 armenian: 0 basque: 0 belarusan: 0 bengali: 0 breton: 0 bulgarian: 0 c++: 1 catalan: 0 cebuano: 0 chechen: 0 chinese: 0 croatian: 0 czech: 0 danish: 0 dutch: 0 english: 1 esperanto: 0 estonian: 0 farsi: 0 finnish: 0 french: 0 frisian: 0 georgian: 0 german: 0 greek: 0 gujarati: 0 hawaiian: 0 hebrew: 1 hindi: 0 hungarian: 0 icelandic: 0 ilongo: 0 indonesian: 0 irish: 0 italian: 0 japanese: 0 khmer: 0 korean: 0 latin: 0 latvian: 0 lisp: 0 lithuanian: 0 malay: 0 maori: 0 mongolian: 0 norwegian: 0 occitan: 0 other: 1 persian: 0 polish: 0 portuguese: 0 romanian: 0 rotuman: 0 russian: 0 sanskrit: 0 sardinian: 0 serbian: 0 sign language: 0 slovak: 0 slovenian: 0 spanish: 0 swahili: 0 swedish: 0 tagalog: 0 tamil: 0 thai: 0 tibetan: 0 turkish: 0 ukrainian: 0 urdu: 0 vietnamese: 0 welsh: 0 yiddish: 0 num_ethnicities: nan optional_%unfilled: 16.666666666666664 num_languages: 4
Below is the profile of someone who has an income of $1M.
show_user_data(df, 27590)
age: 32 body_type: fit diet: anything drinks: socially drugs: never education: graduated from space camp essay0: update: okstupid removed my macbook "fun house mirror" effect self-portrait, so you will no longer be able to see how hideously disfigured i was in that photo. apparently, pictures of the back of your head are o.k., but pictures taken with a slightly distorting lens are not. is that how it works? i don't quite get it. mi hermana hablas espanol sehr gut. aber yo hablo espanol solamente en restorants. por ehemplo: "por favor, ocho tacos en dos platos para llevar. si, quatro y quatro. quatro de asada y quatro de carnitas. con todo. si, para llevar. yo already hablandoed that." update: i am no longer eating cheap-ass tacos. zdes' net russkoi klaviatury, tak chto, pridetsa popol'zovatsa transliteracijej. schas, posmotrim, chto napisal po-anglijskij, i perevozhu. mm, da. ja javno byl nemnogo nedovolen etim saitom kogda pridumal svoj username. mne nravyatsa sobaki. mne nravitsa snowboarding. ya tol'ko nedavno nachal chasto ezdit' snowboarding. v pozoproshlom sezone gde-to 10-12 dnei posnowboardal. a v proshlom sezone kak-to ne poluchilas' - vsego 4-5 dnei snowboardal. essay1: what's the point of winter? i mean in the mountains. i want to go snowboarding right now. essay2: nothing. i suck at everything. april 11th addendum: apparently my above opinion means i'm actually good at everything, which is what people have been telling me. but they're all incompetent (sp?). http://www.zenspider.com/rwd/thoughts/inept.html i enjoy long walks on the moon with my pet lap giraffe. essay3: is that i'm there. essay4: ich liebe gern nur rammstein. du hast. du hasst mich. du hasst mich! du hast mich gefragt. du hast mich gefragt. du hast mich gefragt, und ich hab nichts gesagt. that song is so deep. movies: universal soldier, big momma's house, norbit (or anything else with eddy murphy since about 1995), anything starring kurt russell (except captain ron and big trouble in little china). species viii. jurassic park iv (straight to video). the 6th element (sequel to the 5th element, also straight to video, and starring milla jovovich's little sister, milla jovovich). 7th sense (sequel to 6th sense). ocean's 2.35 x 10^3(starring 800 cgi clones each of brad pittt, georgi cluni, and milla jovovich's little sister, milla jovovich). best movie ever: jordi laforge, the movie. i don't think i can list all my favorites. the list is just going too long. anything with bill pullman, except spaceballs and independence day (wait... was there anything else?). etc.,etc.,et.,ce.t,c.,t music: i can't tell you because if i list anything first it will seem like that's what i probably like most and will seem to define me as a particular character in your pop-culture dictionary. i don't watch t.v. and don't care if you do. i don't feel self-important about it. but it also means i am much less affected by the minute-to-minute cultural updates in terminology and opinions and events that inform the masses. i might not have any idea who your are talking about a lot of the time. i just listen to music that i like. but i like [portishead], [led zeppelin],[bach], blues, blues rock, some jazz, electronic music... whatever is danceable and singable (i.e. more complex than a monotonous repetitive beat). books: haven't read much fiction lately. most fiction these days seems to be a sham. how many books were published last year? compare that to the number of books worth reading that were published in the 20th century. so, generally, if you avoid reading new books you are probably avoiding bad books. there are many proven masterpieces out there. i liked the trial (kafka)a lot. i like russian authors a lot (dead souls is one of the funniest books i've ever read). i love funny books. [catch-22], some kurt vonnegut. but lately i have been reading more nonfiction. and i much enjoyed the 10-15 [shakespeare] plays i've seen, though i've only read few. i read [illiad] and [odyssey] last year. they are now two of my favorite books. i tried to read the aeneid, but, i'm sorry, it just sucks. i think i managed to get through about 1/3 of it. the epithets are less prevalent. the repeated phrases, like, "dawn with her rose-red fingers..." that began many verses in homer and helped tie together his story are sorely lacking in virgil's [aeneid]. i thought that because it was translated by the same brilliant man, robert fagles, whose translations of homer i read, it would be just as good. apparently, a book's quality depends on more than just its translator. this is funny to me: . , . : , , - , , ! , ! , , , , . , - , . , ! . - , , , ! , , , , , . ! . , ! , , , , ! 1 , , , , . , , , , , , . , . essay5: 1.) the layer of apprehension that intercedes between reality and the perceiver. 2.) if perveiver isn't a word, then insert whatever word you like. c.) alphabet. essay6: thinking about thinking. essay7: may it never be typical. essay8: is the stuff i wrote in the other sections. isn't that obvious? essay9: you can't find any good reason not to. or ... if you want to help me finish off this johnnie walker, green label. i'm not a big hard alcohol fan (i.e. not an alcoholic), but this stuff is so good. and most times when it comes up, people think i'm making it (green label) up, which i can understand. i used to be really gullible when i was a kid. one kid in my class told me that there was a super mario bros. 34 in japan and described all kinds of awesome things you could do in it. japan is really futuristic, and video games come out there decades before they do here. the wii came out in the 1920's. so, i can understand why people might think i'm pulling their leg(s). but now they can just check the internet and see that i'm right. just don't check wikipedia, because i've modified a few articles where i thought fact was less instructive than fiction. never mind, you're too late. i already finished it. ethnicity: asian, middle eastern, black, native american, indian, pacific islander, hispanic / latin, white, other height: 70.0 income: 1000000.0 job: science / tech / engineering last_online: 2012-06-28-20-59 location: san francisco, california offspring: nan orientation: straight pets: likes dogs religion: nan sex: m sign: nan smokes: no speaks: english (fluently), russian (fluently), german (okay), french (okay), spanish (okay) status: single diet_adherence: [No Prefix] diet_type: anything city: san francisco state/country: california offspring_want: nan offspring_attitude: nan religion_type: nan religion_attitude: nan sign_type: nan sign_attitude: nan dog_preference: likes dogs cat_preference: None has_dogs: 0.0 has_cats: 0.0 afrikaans: 0 albanian: 0 ancient greek: 0 arabic: 0 armenian: 0 basque: 0 belarusan: 0 bengali: 0 breton: 0 bulgarian: 0 c++: 0 catalan: 0 cebuano: 0 chechen: 0 chinese: 0 croatian: 0 czech: 0 danish: 0 dutch: 0 english: 1 esperanto: 0 estonian: 0 farsi: 0 finnish: 0 french: 1 frisian: 0 georgian: 0 german: 1 greek: 0 gujarati: 0 hawaiian: 0 hebrew: 0 hindi: 0 hungarian: 0 icelandic: 0 ilongo: 0 indonesian: 0 irish: 0 italian: 0 japanese: 0 khmer: 0 korean: 0 latin: 0 latvian: 0 lisp: 0 lithuanian: 0 malay: 0 maori: 0 mongolian: 0 norwegian: 0 occitan: 0 other: 0 persian: 0 polish: 0 portuguese: 0 romanian: 0 rotuman: 0 russian: 1 sanskrit: 0 sardinian: 0 serbian: 0 sign language: 0 slovak: 0 slovenian: 0 spanish: 1 swahili: 0 swedish: 0 tagalog: 0 tamil: 0 thai: 0 tibetan: 0 turkish: 0 ukrainian: 0 urdu: 0 vietnamese: 0 welsh: 0 yiddish: 0 num_ethnicities: 9.0 optional_%unfilled: 12.5 num_languages: 5
Graduated from space camp, complains about OKCupid picture takedown in essay0, nine ethnicities, five languages, 5' 10" in height. Do you think this income is reliable?
We will remove outliers for age and height. Outliers have the potential to greatly increase memory usage, variance, and training time. The current objective is to make a predictive model that works. If that objective is accomplished, we have the option of reiterating on this project to incorporate outliers.
df = df[(df['height'] >= 57) & (df['height'] <= 80) & (df['age'] <= 69)]
from ipywidgets import interact, fixed
def make_treemap(df, feature, dropna):
counts_table = df[feature].value_counts(normalize = True, dropna = dropna).reset_index()
counts_table.rename(columns={feature: 'Percent'}, inplace = True)
counts_table['Percent'] = counts_table['Percent'] * 100
if dropna:
names = 'index'
else:
names = counts_table['index'].apply(str)
fig = px.treemap(
counts_table,
parents = [feature] * len(counts_table), #Need to define a tree root
names = names,
values = 'Percent',
title = '2012 OKCupid Profiles',
color = 'Percent',
color_continuous_scale='Blues'
)
fig.update_traces(marker_showscale=False, texttemplate='%{label}<br>%{value:.2f%}%')
fig.show()
essay_feats = ['essay0','essay1','essay2','essay3','essay4','essay5','essay6','essay7','essay8','essay9',]
numerical_feats = ['age', 'height', 'income']
drop_list = essay_feats + numerical_feats + ['last_online']
interact(make_treemap, df = fixed(df), feature = [each for each in df.columns if each not in drop_list], dropna = False)
interactive(children=(Dropdown(description='feature', options=('body_type', 'diet', 'drinks', 'drugs', 'educat…
<function __main__.make_treemap(df, feature, dropna)>
An interactive tool for visualizing the categorical features as a treemap. The dropdown selector allows us to choose the feature to display, and the tickbox allows us to choose whether to include NaN values in the treemap.
print('**Nan represents unfilled/unshared optional information, \
signifiying that a user has not shared the\n respective information with OKCupid and it\'s missing from their profile')
print('\033[1m'+'*Extrapolated or split from raw data variables'+'\033[0m')
import matplotlib.pyplot as plt
drop = ['diet', 'diet_adherence', 'location', 'offspring_want', 'offspring_attitude', 'pets', 'religion', 'sign', 'speaks',
'religion_attitude', 'sign_attitude', 'optional_%unfilled',
'last_online'] + languages + ['age', 'height', 'income'] + essay_feats
for_plotting = df.drop(columns = drop).columns.to_list()
extrapolated_vars = ['diet_type', 'city', 'state/country', 'religion_type', 'sign_type',
'dog_preference', 'has_dogs', 'cat_preference', 'has_cats', 'num_ethnicities', 'num_languages']
def collapse_category(feat, threshold):
'''Needed to ensure that only a mask would be used to mark categories to collapse,
and to preserve the original data frame'''
sum = df[feat].value_counts(dropna = False).reset_index()[feat].sum()
temp = df[feat].copy()
mask1 = temp.value_counts()/sum < threshold
mask2 = temp.isin(mask1[mask1 == True].index)
temp[mask2] = 'other'
return temp.value_counts(dropna = False).reset_index()
for i in range(1, 12):
plt.figure(figsize = (16, 9))
j = 2*i
k = j-2
for num, feat in zip(range(1,3), for_plotting[k:j]):
plt.subplot(1, 2, num)
temp = collapse_category(feat, 0.01)
labels = temp['index']
theme = plt.get_cmap("tab20" if len(labels) > 10 else "tab10")
plt.pie(x=temp[feat], autopct="%.1f%%", labels=labels, pctdistance=0.77,
#radius = 1,
colors = theme(np.arange(len(labels))),
wedgeprops=dict(width=0.10,
edgecolor="k",
linewidth=0.7))
text = feat + '*' if feat in extrapolated_vars else feat
plt.text(0, 0, text,
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 20)
plt.show()
plt.clf()
**Nan represents unfilled/unshared optional information, signifiying that a user has not shared the
respective information with OKCupid and it's missing from their profile
*Extrapolated or split from raw data variables
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
Bias:
The biggest indicator that this sample does not meet the statistical criteria for independent random sampling is the 'state/country' variable. OKCupid was founded in 2004. It's simply not possible that by 2012 all of their users would only be from the state of California. When you get a dataset, regardless of what you've been told about the quality of the sampling, always check for signs of bias. The process of checking for bias in a study or in a sample is sometimes called a "Risk of Bias Assessment".
Even though the sample looks like it's heavily biased, we will still draw inference about what is represented in the sample.
The charts tell us that the typical profile on OKCupid back in 2012 was:
It's possible that people misrepresent themselves on their profile, paint themselves more positively, and carefully omit negative information.
import seaborn as sns
lang_percent = []
for language in languages:
lang_percent.append((language, df[language].value_counts(normalize = True).loc[1]))
def sort_func(pair):
return pair[1]
sorted_lang_percent = sorted(lang_percent, key = sort_func, reverse = True)
#Make barplot
plt.figure(figsize = (16, 9))
rects = sns.barplot(x = [each[1] for each in sorted_lang_percent[:10]],
y = [each[0].title() for each in sorted_lang_percent[:10]],
orient = 'h',
palette = "tab20")
#Add Data Labels outside of bar
rect_labels = []
for rect in rects.patches:
yloc = rect.get_y() + rect.get_height() / 2
label = plt.annotate(format(rect.get_width(), '.1%'), xy=(rect.get_width(), yloc), xytext=(4, 0),
textcoords="offset points", horizontalalignment='left', verticalalignment='center',
color='black', clip_on=True)
rect_labels.append(label)
plt.title("Top 10 Languages")
plt.ylabel("Language")
plt.xlabel('% of users who speak it'.title())
plt.show()
Everyone speaks English. Hilariously some people speak C++... why not Python?
sorted_nulls = sorted(nulls, reverse = True)
sample_size = 59946
#Make barplot
plt.figure(figsize = (16, 9))
rects = sns.barplot(x = [each[0]/sample_size for each in sorted_nulls[:23]],
y = [each[1].title() for each in sorted_nulls[:23]],
orient = 'h',
palette = "tab20")
#Add Data Labels outside of bar
rect_labels = []
for rect in rects.patches:
yloc = rect.get_y() + rect.get_height() / 2
label = plt.annotate(format(rect.get_width(), '.2%'), xy=(rect.get_width(), yloc), xytext=(4, 0),
textcoords="offset points", horizontalalignment='left', verticalalignment='center',
color='black', clip_on=True)
rect_labels.append(label)
plt.title("Unfilled Optional Fields")
plt.ylabel("Fields")
plt.xlabel('% of users who did not give information'.title())
plt.show()
Disclaimer: My comments below are pure speculation and hypothesis
Feature | Hypothesis For Not Sharing Information |
---|---|
Income | If you're rich you don't want the IRS to know. If you're poor, you don't want potential matches to see that either |
Children | Finding out someone has kids can be a turn off and perceived as extra baggage |
Diet | People are afraid of being criticized for choosing to shun some foods |
Religion | Some people are xenophobic. As a result, some people hide their religion so as not to immediately turn away those xenophobes |
Pets | No idea why a third of users don't share this information |
Essays | There's a pattern where the higher up the chart you go, the higher the essay number. This is because the questions are presented to the user in a fixed order. Not all users have the patience to answer questions all the way to the last one. The reason why essay 8 stands out significantly more than the rest is because the question being asked is "Share something private", which of course is rather controversial considering you can't take back what you share online. See below section for the essay questions. |
Drugs | Drugs are illegal in some states |
The essay questions are a perfect candidate for Natural Language Processing (NLP) Topic Modelling. More specifically, we'll be using term frequency - inverse document frequency (tf-idf), a model which initially counts the occurences of each word, then applies a word weighting scheme which deprioritizes common words such as "the". The expected result is to condense the essay answers into specific keywords, which may be visualized.
Let's create a copy of all of the essay answers ,just in case we need to start over in the preprocessing for NLP.
nlp_essay_df = df[essay_feats].copy(deep = True)
Before we can apply tf-idf to our essay questions, we need to convert them to a suitable format. The process is called text preprocessing. The words of our essay questions will be converted to their root words(also called lemma).
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from collections import Counter
import unicodedata
import string
# Lemmatizer class object - converts tokens to root words
lemmatizer = WordNetLemmatizer()
# Fetch nltk stopwords
stopwords = nltk.corpus.stopwords.words('english')
def preprocess_corpus(DataFrame, Series):
# Remove accents function
def remove_accents(data):
return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters or x == " ")
# Look at synonyms of the word if it's likely to be a noun, verb, adjective, or adverb
def get_part_of_speech(word):
# Fetch synonyms of word from wordnet database
probable_part_of_speech = wn.synsets(word)
# Count the number of nouns, verbs, adjectives, and adverbs among the synonyms
pos_counts = Counter()
pos_counts["n"] = len( [ item for item in probable_part_of_speech if item.pos()=="n"] )
pos_counts["v"] = len( [ item for item in probable_part_of_speech if item.pos()=="v"] )
pos_counts["a"] = len( [ item for item in probable_part_of_speech if item.pos()=="a"] )
pos_counts["r"] = len( [ item for item in probable_part_of_speech if item.pos()=="r"] )
# Get the speech type with the highest count and return it
most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech
# Convert each word into its lemma(root word)
def preprocess_text(text, min_word_length):
# Tokenize
tokens = nltk.word_tokenize(text)
# Convert to lowercase
tokens = [token.lower() for token in tokens]
# Remove accents
tokens = [remove_accents(token) for token in tokens]
# Remove punctuations
tokens = [str(token).translate(string.punctuation) for token in tokens]
tokens = [token for token in tokens if token != '']
# Remove stopwords
tokens = [token for token in tokens if token not in stopwords]
# Remove short words
tokens = [token for token in tokens if len(token) >= min_word_length]
# Lemmatize
lemmas = " ".join([lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokens])
return lemmas
# Convert Data Series to a List
# list_corpus = DataFrame[Series].dropna().to_list()
list_corpus = DataFrame[Series].to_list()
# Remove linebreaks
list_corpus = [" ".join(text.splitlines(False)) if isinstance(text, str) else ' ' for text in list_corpus ]
# Remove '/' characters
list_corpus = [text.replace('/', ' ') if isinstance(text, str) else ' ' for text in list_corpus]
#Normalize
processed_corpus = [preprocess_text(text, 3) if isinstance(text, str) else ' ' for text in list_corpus]
return processed_corpus
Now that we have our root words, let's apply tf-idf. It will assign a score to each root word. For each user's answer we will get the highest scoring word (keyword). This keyword represents what our model believes is the most significant word of a user's response.
from sklearn.feature_extraction.text import TfidfVectorizer
# Tfidf class object - Scores each word
vectorizer = vectorizer = TfidfVectorizer()
def tfidf_for_keywords(processed_corpus):
'''Generates one keyword for each document in the corpus. Returns all keywords as a list.'''
tfidf_scores = vectorizer.fit_transform(processed_corpus)
feature_names = vectorizer.get_feature_names()
keywords = [feature_names[row.argmax()] for row in tfidf_scores]
return keywords, tfidf_scores, feature_names
Let's iterate the whole process over all of our essay features, and let's collect only the most common keywords.
essay_vectors = []
essay_feature_names = []
def series_to_keywords(DataFrame, Series, n_top, export_vectors = False):
processed_corpus = preprocess_corpus(DataFrame, Series)
keywords, tfidf_scores, feature_names = tfidf_for_keywords(processed_corpus)
essay_vectors.append(tfidf_scores)
essay_feature_names.append(feature_names)
# temp = DataFrame[Series].dropna().reset_index()
temp = DataFrame[Series].reset_index()
temp[str(Series) + '_keywords'] = keywords
if export_vectors:
temp[str(Series) + '_keywords'].value_counts().iloc[:n_top].to_csv(str(Series) + '_top_keys.csv')
# Training NLP Model commented out due to long training time
# for feature in essay_feats:
# series_to_keywords(nlp_essay_df, feature, 40)
Let's generate wordclouds for each of the essay questions.
Developer Note: The code is embedded as an image because it was run in a separate Python 3.7 environment.
The level of insight gained from each wordcloud, using the existing model algorithm, is not the same. Some give sufficiently interesting and useful results. Some tell us more about what essay question is being asked rather than what the answers to those questions are. The results merit further filtering, tweaking, and refinement of the algorithm to give us better keywords. Further NLP modelling and analysis deserves to be discussed in its own lengthy separate article/notebook. We will not explore it further here.
In addition to the wordclouds, manual reading of several user responses was done to better interpret the tf-idf results.
Essay Number | Personal best guess of the question asked | Comment on tf-idf Result |
---|---|---|
essay 0 | Describe yourself | Users use this essay question to talk about what they like, love, and the qualities of who(someone) they're looking for |
essay 1 | What are you currently doing? | A lot of the smaller words are clear answers to the question while the biggest words are a little bit more difficult to interpret or could be weighted to have a lower score |
essay 2 | What are you good at? | Users say they're good at listening and that they have a great smile and laugh |
essay 3 | Describe yourself physically | Users talk about their eyes, hair, smile, and height |
essay 4 | What are your favorite books, movies, tv shows, music, food, etc | Not a very meaningful result. More useful for guessing the essay question. |
essay 5 | You can't live without... | Users cannot live without their cellphone, money, gym, job, god, sports, fun. Not sure though what it means that 'good' is top keyword |
essay 6 | What do you think about? | Users think about life and the future |
essay 7 | What is a typical friday night for you? | Users are with their friends on friday nights |
essay 8 | Share something private | Users say 'message/ask me about private things and ill share it with you, but I won't share it here on my public profile'. From the previous section, we can also see that this is the least answered essay question. |
essay 9 | You would like me if... | Doesn't seem like a very meaningful result |
There is still additional NLP analysis that we will not explore here but can definitely be applied. For example, we can see in essay0 the top keywords are 'love' and 'like'. What do users really mean when they use the word 'love'? Is it 'making love' or 'looking for love'? Are they using 'love' and 'like' interchangeably?
Additional NLP:
In this section, we visualized the following:
From our Numerical Features we learned:
Feature | Comment |
---|---|
Age | The median age is 30. The distribution is right-skewed. Most users are young and working adults. |
Height | Most heights range between 59 and 78 in (4.9 - 6.5 ft) with a median of 68in(5.67ft). The distribution appears normal. |
Income | The median income is \$50k. A quarter of all incomes are $25k. Most income distributions are log-normal (if the ultra-wealthy are not included), and this is no different. |
optional_%unfilled | Half of all users don't bother to fill 12% - 30% of optional fields. A quarter of users are below that range and a quarter of users are above that range. |
From our Categorical Features we learned the stereotypical profile is:
We also discovered that our sample has a high risk of bias. In our sample data which is dated from 2012, although OKCupid was founded in 2004, practically everyone lives in California and half of that in the City of San Francisco.
With the use of bar plots we learned:
From our Essay Features we learned:
Essay Number | Personal best guess of the question asked | Comment on tf-idf Result |
---|---|---|
essay 0 | Describe yourself | Users use this essay question to talk about what they like, love, and the qualities of who(someone) they're looking for |
essay 1 | What are you currently doing? | A lot of the smaller words are clear answers to the question while the biggest words are a little bit more difficult to interpret or could be weighted to have a lower score |
essay 2 | What are you good at? | Users say they're good at listening and that they have a great smile and laugh |
essay 3 | Describe yourself physically | Users talk about their eyes, hair, smile, and height |
essay 4 | What are your favorite books, movies, tv shows, music, food, etc | Not a very meaningful result. More useful for guessing the essay question. |
essay 5 | You can't live without... | Users cannot live without their cellphone, money, gym, job, god, sports, fun. Not sure though what it means that 'good' is the top keyword |
essay 6 | What do you think about? | Users think about life and the future |
essay 7 | What is a typical friday night for you? | Users are with their friends on friday nights |
essay 8 | Share something private | Users say 'message/ask me about private things and ill share it with you, but I won't share it here on my public profile'. From the previous section, we can also see that this is the least answered essay question. |
essay 9 | You would like me if... | Doesn't seem like a very meaningful result |
In this section, we made the decision to strip outliers of age and height:
Outliers have the potential to increase memory usage, variance, and training time. The current objective is to make a predictive model that works. If that objective is accomplished, we have the option of reiterating on this project to incorporate outliers.
We need to do even more feature engineering before we go into Machine Learning. All our features have to be properly formatted and expanded/encoded. Any value that is not a number will not be understood by our ML training algorithm.
We will make a backup copy of our data at this point, so that we can reset back to this checkpoint if we want to undo any modifications. Let's also drop our languages sparse matrix. Previously we didn't include the fluency descriptors in our sparse matrix for our visualization. Later, when one-hot encoding, we will make a new sparse matrix for languages which contains the fluency descriptors.
expanded_df_backup = df.drop(columns = languages).copy(deep = True)
df = expanded_df_backup.copy(deep=True)
Let's convert last_online from a string to a datetime format and split it up.
df['last_online'] = pd.to_datetime(df.last_online, format = "%Y-%m-%d-%H-%M")
df['last_online_year'] = df.last_online.dt.year
df['last_online_month'] = df.last_online.dt.month
df['last_online_day'] = df.last_online.dt.day
df['last_online_hour'] = df.last_online.dt.hour
Let's drop columns that we have created splits from. If we need to use them as target variables for prediction we can recover any of them from our backup dataframe. Let's drop income because 80% of it's values are missing. Let's also drop num_ethnicities and optional__%unfilled.
df = df.drop(columns = ['diet', 'location', 'offspring', 'religion', 'sign', 'pets', 'income', 'last_online',
'num_ethnicities', 'optional_%unfilled'])
Let's apply one-hot encoding to our categorical variables. We will also encode a category to represent the null values of each feature, so that our ML model will include the user's decisions to share particular information when making predictions.
print('Before one-hot encoding:')
df.info()
Before one-hot encoding: <class 'pandas.core.frame.DataFrame'> Int64Index: 59811 entries, 0 to 59945 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59811 non-null int64 1 body_type 54538 non-null object 2 drinks 56846 non-null object 3 drugs 45755 non-null object 4 education 53211 non-null object 5 essay0 54351 non-null object 6 essay1 52275 non-null object 7 essay2 50210 non-null object 8 essay3 48375 non-null object 9 essay4 49318 non-null object 10 essay5 49010 non-null object 11 essay6 46085 non-null object 12 essay7 47404 non-null object 13 essay8 40636 non-null object 14 essay9 47245 non-null object 15 ethnicity 54159 non-null object 16 height 59811 non-null float64 17 job 51641 non-null object 18 orientation 59811 non-null object 19 sex 59811 non-null object 20 smokes 54320 non-null object 21 speaks 59761 non-null object 22 status 59811 non-null object 23 diet_adherence 35481 non-null object 24 diet_type 35481 non-null object 25 city 59811 non-null object 26 state/country 59811 non-null object 27 offspring_want 24334 non-null object 28 offspring_attitude 9711 non-null object 29 religion_type 39631 non-null object 30 religion_attitude 39631 non-null object 31 sign_type 48787 non-null object 32 sign_attitude 48787 non-null object 33 dog_preference 28880 non-null object 34 cat_preference 21293 non-null object 35 has_dogs 39931 non-null float64 36 has_cats 39931 non-null float64 37 num_languages 59811 non-null int64 38 last_online_year 59811 non-null int64 39 last_online_month 59811 non-null int64 40 last_online_day 59811 non-null int64 41 last_online_hour 59811 non-null int64 dtypes: float64(3), int64(6), object(33) memory usage: 19.6+ MB
df = pd.get_dummies(df, columns = ['body_type', 'drinks', 'drugs', 'education', 'job', 'orientation', 'sex',
'smokes', 'status', 'diet_adherence', 'diet_type', 'city', 'state/country',
'offspring_want', 'offspring_attitude', 'religion_type', 'religion_attitude',
'sign_type', 'sign_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats',],
sparse=True,
dummy_na=True)
df.rename(columns = {'sex_m':'sex'}, inplace = True)
df = df.drop(columns = ['sex_f'])
def string_dummies(df, series, sep):
if df[series].isna().any():
df[series + '_nan'] = df[series].isna().astype('int64')
temp_df = df[series].str.get_dummies(sep).add_prefix(str(series) + '_')
df = pd.concat([df, temp_df], axis = 1)
return df.drop(columns = [str(series)])
selection = ['ethnicity', 'speaks']
sep_list = [', ', ', ']
for series, sep in zip (selection, sep_list):
df = string_dummies(df, series, sep)
print('After one-hot encoding:')
df.info()
After one-hot encoding: <class 'pandas.core.frame.DataFrame'> Int64Index: 59811 entries, 0 to 59945 Columns: 737 entries, age to speaks_yiddish (poorly) dtypes: Sparse[uint8, 0](407), float64(1), int64(319), object(10) memory usage: 157.5+ MB
Let's make our dataframe have uniform datatypes. Just ignore the datatype of the essays for now (the ten object coumns).
df['height'] = df.height.astype('int64')
for each in df.select_dtypes(include=['int64']).columns.to_list():
df[each] = pd.arrays.SparseArray(df[each].values, dtype='uint8')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 59811 entries, 0 to 59945 Columns: 737 entries, age to speaks_yiddish (poorly) dtypes: Sparse[uint8, 0](727), object(10) memory usage: 14.3+ MB
We are now ready to move on to Machine Learning.
In this Section we accomplished the following:
We will evaluate and compare two different Machine Learning Models to predict gender.
Now is a good time to backup the dataframe.
sparse_essay_df = df.copy(deep = True)
sparse_essay_df.to_csv('ml_ready_data.csv')
def feature_selection_to_list(df, cat_selection, numeric_selection):
categorical_feats = []
for each in cat_selection:
categorical_feats = categorical_feats + df.loc[:, df.columns.str.startswith(each)].columns.to_list()
return categorical_feats + numeric_selection
from sklearn.linear_model import LogisticRegression
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
def model_with_logistic_regression(feature_selection, target_selection, predictor_legend, use_csr_feats = False,
swap_legend_colors = False):
#Select predictors and target
predictors = sparse_essay_df[feature_selection].sparse.to_dense()
if use_csr_feats:
predictors = predictors.sparse.to_coo()
predictors = predictors.tocsr()
target = sparse_essay_df[target_selection]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.2)
# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)
print('Train score: ' + str(model.score(X_train, y_train)))
print('Test score: ' + str(model.score(X_test, y_test)))
# Analyze the predictor coefficients
if swap_legend_colors:
colors = [predictor_legend[1] if c > 0 else predictor_legend[0] for c in model.coef_[0]]
else:
colors = [predictor_legend[0] if c > 0 else predictor_legend[1] for c in model.coef_[0]]
fig = px.bar(
y=predictors.columns,
x=abs(model.coef_[0]),
color=colors,
color_discrete_sequence=['red', 'blue'],
labels=dict(x='Predictor', y='Linear coefficient'),
title='Top 20 predictors',
)
fig.update_yaxes(categoryorder = "total ascending", range=(len(predictors.columns) - 20.6, len(predictors.columns)))
fig.show()
cat_selection = ['body_type', 'drinks', 'drugs', 'education', 'job', 'orientation',
'smokes', 'status', 'diet_adherence', 'diet_type', 'last_online',
'offspring_want', 'offspring_attitude', 'religion_type', 'religion_attitude',
'sign_type', 'sign_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats', 'ethnicity', 'speaks']
numeric_selection = ['age', 'height']
feature_selection = feature_selection_to_list(sparse_essay_df, cat_selection, numeric_selection)
predictor_legend = ['Male Predictor', 'Female Predictor']
model_with_logistic_regression(feature_selection, target_selection = 'sex', use_csr_feats = False, predictor_legend = predictor_legend)
Train score: 0.893976759739174 Test score: 0.8878207807406169
Height and body_type_curvy are our top predictors. Probably because men are taller than women on average, and because men are not likely to describe themselves as curvy whereas women are.
AI Ethics: A model like this highlights the ethical consideration we must take when developing AI. For example, one of the predictors is job_computer / hardware / software. This could be misused to discriminate by gender, proclaiming that one gender is not fit for working in a technical computer job. Never be hasty to deploy a model that will be used on people.
from sklearn.tree import DecisionTreeClassifier
dec_tree_df = sparse_essay_df.copy()
cat_selection = ['body_type', 'drinks', 'drugs', 'education', 'orientation',
'smokes', 'status', 'diet_adherence', 'diet_type', 'job', 'last_online',
'offspring_want', 'offspring_attitude', 'religion_attitude', 'religion_type',
'sign_type', 'sign_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats', 'ethnicity', 'speaks']
numeric_selection = ['age', 'height']
feature_selection = feature_selection_to_list(dec_tree_df, cat_selection, numeric_selection)
target = dec_tree_df['sex']
predictors = dec_tree_df[feature_selection]
train_data, test_data, train_labels, test_labels = train_test_split(predictors, target)
scores = []
for i in range(1, 18):
tree = DecisionTreeClassifier(max_depth = i)
tree.fit(train_data, train_labels)
score = tree.score(test_data, test_labels)
scores.append(score)
plt.plot(range(1, 18), scores, marker='o')
plt.xlabel('Tree Depth')
plt.ylabel('Test Accuracy')
plt.show()
print("The best result is attained when the Decision Tree Model is stopped at a tree depth of {}, providing a test accuracy of {:.2f}%".format(scores.index(max(scores)), max(scores)*100))
The best result is attained when the Decision Tree Model is stopped at a tree depth of 6, providing a test accuracy of 86.38%
In this section we accomplished the following:
Clustering, is a technique which groups similar data points together. Let's use this to group similar people together and recommend who you should date. People in the same cluster as you are the people who we will recommend.
We start by reviewing the features we have on hand. Which, among the features, would you want your date to have in common with you?
expanded_df_backup.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 59811 entries, 0 to 59945 Data columns (total 48 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59811 non-null int64 1 body_type 54538 non-null object 2 diet 35481 non-null object 3 drinks 56846 non-null object 4 drugs 45755 non-null object 5 education 53211 non-null object 6 essay0 54351 non-null object 7 essay1 52275 non-null object 8 essay2 50210 non-null object 9 essay3 48375 non-null object 10 essay4 49318 non-null object 11 essay5 49010 non-null object 12 essay6 46085 non-null object 13 essay7 47404 non-null object 14 essay8 40636 non-null object 15 essay9 47245 non-null object 16 ethnicity 54159 non-null object 17 height 59811 non-null float64 18 income 11456 non-null float64 19 job 51641 non-null object 20 last_online 59811 non-null object 21 location 59811 non-null object 22 offspring 24334 non-null object 23 orientation 59811 non-null object 24 pets 39931 non-null object 25 religion 39631 non-null object 26 sex 59811 non-null object 27 sign 48787 non-null object 28 smokes 54320 non-null object 29 speaks 59761 non-null object 30 status 59811 non-null object 31 diet_adherence 35481 non-null object 32 diet_type 35481 non-null object 33 city 59811 non-null object 34 state/country 59811 non-null object 35 offspring_want 24334 non-null object 36 offspring_attitude 9711 non-null object 37 religion_type 39631 non-null object 38 religion_attitude 39631 non-null object 39 sign_type 48787 non-null object 40 sign_attitude 48787 non-null object 41 dog_preference 28880 non-null object 42 cat_preference 21293 non-null object 43 has_dogs 39931 non-null float64 44 has_cats 39931 non-null float64 45 num_ethnicities 54159 non-null float64 46 optional_%unfilled 59811 non-null float64 47 num_languages 59811 non-null int64 dtypes: float64(6), int64(2), object(40) memory usage: 22.4+ MB
After asking someone which features they would want to have in common with them, the chosen features are:
Let's isolate that subset of features.
clustering_df = sparse_essay_df.copy()
cat_selection = ['drugs', 'orientation',
'diet_adherence', 'diet_type','religion_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats']
numeric_selection = []
feature_selection = feature_selection_to_list(clustering_df, cat_selection, numeric_selection)
clustering_df = clustering_df[feature_selection]
#Conversion to Scipy csr_matrix
clustering_coo = clustering_df.sparse.to_coo()
clustering_csr = clustering_coo.tocsr()
clustering_csr.get_shape()
(59811, 37)
Now that we have our subset let's search for the best number of k clusters to use in our model.
#Recommended setting for training Kmeans on Windows
import os
os.environ["OMP_NUM_THREADS"] = "1"
from sklearn.cluster import KMeans
execute = False
if execute:
num_clusters = list(range(1,200))
inertias= []
for i in num_clusters:
model = KMeans(n_clusters = i)
model.fit(clustering_csr)
inertias.append(model.inertia_)
plt.plot(num_clusters, inertias, '-o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()
The Inertia vs k graph we now have below took 5 hours to produce (hence the default setting above to skip execution of the code snippet). Inertia, is a metric which represents how spaced out the points of a cluster are relative to its centroid. A line has been drawn over the tail end of the graph so that we can clearly mark where the graph becomes linear. The specific point where the graph becomes linear is known as the elbow point, and is the number of clusters that we should use.
k = 100 is where the linearity begins. Let's create the model we will use for clustering with k = 100.
model = KMeans(n_clusters = 100)
model.fit(clustering_csr)
labels = model.predict(clustering_csr)
Let's investigate the uniformity of our cluster distribution.
ser_labels = pd.Series(labels)
ser_labels_props = ser_labels.value_counts()
plt.figure(figsize = (16,9))
plt.pie(x=ser_labels_props.values, labels = ser_labels_props.index,
wedgeprops=dict(width=0.10,
edgecolor="k",
linewidth=0.7))
plt.text(0, 0, 'Cluster Distribution',
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 20)
plt.show()
The clusters are somewhat distributed across users.
Let's discover who you can date! Select from the dropdown lists to enter your information. Click the 'Run Interact' button to predict your cluster number and display the profile of a potential date. Your cluster contains users that are similar to you - someone you might want to date! Click Run to see another random profile from your cluster.
from ipywidgets import interact_manual
import random
def make_profile(sex, drugs, orientation, diet_adherence, diet_type, religion_attitude,
dog_preference, cat_preference, has_dogs, has_cats, show_same_gender):
user_info = [drugs, orientation, diet_adherence, diet_type, religion_attitude,
dog_preference, cat_preference, has_dogs, has_cats]
user_dict = {}
for each in clustering_df.columns.to_list():
user_dict[each] = [0]
for each in user_info:
user_dict[each] = [1]
user_df = pd.DataFrame(data=user_dict, index = ['You'])
sparse_user_df = user_df.copy()
for each in sparse_user_df.columns.to_list():
sparse_user_df[each] = pd.arrays.SparseArray(sparse_user_df[each].values, dtype='uint8')
sparse_user_df = sparse_user_df.sparse.to_coo()
sparse_user_df = sparse_user_df.tocsr()
user_cluster_label = model.predict(sparse_user_df)
user_cluster = np.where(labels == user_cluster_label)[0]
print("Your cluster is {}.\nThere are {} users in your cluster.".format(user_cluster_label, len(user_cluster)))
if show_same_gender == False:
user_sex_dict = {'Male':'m', "Female":'f'}
user_sex = user_sex_dict[sex]
#Look for an opposite gender
while user_sex == user_sex_dict[sex]:
rand_index = random.randint(0, len(user_cluster))
rand_user = user_cluster[rand_index]
user_sex = expanded_df_backup.iloc[rand_user]['sex']
else:
rand_index = random.randint(0, len(user_cluster))
rand_user = user_cluster[rand_index]
print("Is this person dateable? Press Run again to see a new recommendation.\n")
show_user_data(expanded_df_backup.iloc[:,0:31], rand_user)
interact_manual(
make_profile,
sex = ['Male', 'Female'],
drugs = feature_selection_to_list(clustering_df, ['drugs'], []),
orientation = feature_selection_to_list(clustering_df, ['orientation'], []),
diet_adherence = feature_selection_to_list(clustering_df, ['diet_adherence'], []),
diet_type = feature_selection_to_list(clustering_df, ['diet_type'], []),
religion_attitude = feature_selection_to_list(clustering_df, ['religion_attitude'], []),
dog_preference = feature_selection_to_list(clustering_df, ['dog_preference'], []),
cat_preference = feature_selection_to_list(clustering_df, ['cat_preference'], []),
has_dogs = feature_selection_to_list(clustering_df, ['has_dogs'], []),
has_cats = feature_selection_to_list(clustering_df, ['has_cats'], []),
show_same_gender = True
)
interactive(children=(Dropdown(description='sex', options=('Male', 'Female'), value='Male'), Dropdown(descript…
<function __main__.make_profile(sex, drugs, orientation, diet_adherence, diet_type, religion_attitude, dog_preference, cat_preference, has_dogs, has_cats, show_same_gender)>
In this section, we accomplished the following: