Analysis and visualization of a public OKCupid profile dataset using python and pandas

Author: Alessandro Giusti (web, email), Dalle Molle Institute for Artificial Intelligence (IDSIA), USI-SUPSI.

Discussion

After publication, this notebook received quite a lot of insightful comments on /r/python (link to post) and /r/okcupid (link to post).

Introduction

This document is an analysis of a public dataset of almost 60000 online dating profiles. The dataset has been published in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al., and its collection and distribution was explicitly allowed by OkCupid president and co-founder Christian Rudder. Using these data is therefore ethically and legally acceptable; this is in contrast to another recent release of a different OkCupid profile dataset, which was collected without permission and without anonymizing the data (more on the ethical issues in this Wired article).

Notebook setup

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='svg'
from IPython.display import display,HTML
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from prettypandas import PrettyPandas
sns.set_style("ticks")
sns.set_context(context="notebook",font_scale=1)

Dataset details

The data is available at this link. The codebook includes many details about the available fields. The dataset was collected by web scraping the OKCupid.com website on 2012/06/30, and includes almost 60k profiles of people within a 25 mile radius of San Francisco, who were online in the previous year (after 06/30/2011), with at least one profile picture.

The CSV contains a row (observation) for each profile. Let's have a look at the first 10 profiles, excluding the columns whose name contains the string "essay", which contain a lot of text and are not practical at the moment.

In [2]:
d=pd.read_csv("profiles.csv/profiles.csv")
print("The dataset contains {} records".format(len(d)))

m=d[d["sex"]=="m"] # male users
f=d[d["sex"]=="f"] # female users
print("{} males ({:.1%}), {} females ({:.1%})".format(
    len(m),len(m)/len(d),
    len(f),len(f)/len(d)))

PrettyPandas(d                                   # Prettyprints pandas dataframes
    .head(10)                                    # Sample the first 10 rows
    [[c for c in d.columns if "essay" not in c]] # Ignore columns with "essay" in the name (they are long)
)
The dataset contains 59946 records
35829 males (59.8%), 24117 females (40.2%)
Out[2]:
age body_type diet drinks drugs education ethnicity height income job last_online location offspring orientation pets religion sex sign smokes speaks status
0 22 a little extra strictly anything socially never working on college/university asian, white 75 -1 transportation 2012-06-28-20-30 south san francisco, california doesn’t have kids, but might want them straight likes dogs and likes cats agnosticism and very serious about it m gemini sometimes english single
1 35 average mostly other often sometimes working on space camp white 70 80000 hospitality / travel 2012-06-29-21-41 oakland, california doesn’t have kids, but might want them straight likes dogs and likes cats agnosticism but not too serious about it m cancer no english (fluently), spanish (poorly), french (poorly) single
2 38 thin anything socially nan graduated from masters program nan 68 -1 nan 2012-06-27-09-10 san francisco, california nan straight has cats nan m pisces but it doesn’t matter no english, french, c++ available
3 23 thin vegetarian socially nan working on college/university white 71 20000 student 2012-06-28-14-22 berkeley, california doesn’t want kids straight likes cats nan m pisces no english, german (poorly) single
4 29 athletic nan socially never graduated from college/university asian, black, other 66 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california nan straight likes dogs and likes cats nan m aquarius no english single
5 29 average mostly anything socially nan graduated from college/university white 67 -1 computer / hardware / software 2012-06-29-19-18 san francisco, california doesn’t have kids, but might want them straight likes cats atheism m taurus no english (fluently), chinese (okay) single
6 32 fit strictly anything socially never graduated from college/university white, other 65 -1 nan 2012-06-25-20-45 san francisco, california nan straight likes dogs and likes cats nan f virgo nan english single
7 31 average mostly anything socially never graduated from college/university white 65 -1 artistic / musical / writer 2012-06-29-12-30 san francisco, california doesn’t have kids, but wants them straight likes dogs and likes cats christianity f sagittarius no english, spanish (okay) single
8 24 nan strictly anything socially nan graduated from college/university white 67 -1 nan 2012-06-29-23-39 belvedere tiburon, california doesn’t have kids straight likes dogs and likes cats christianity but not too serious about it f gemini but it doesn’t matter when drinking english single
9 37 athletic mostly anything not at all never working on two-year college white 65 -1 student 2012-06-28-21-08 san mateo, california nan straight likes dogs and likes cats atheism and laughing about it m cancer but it doesn’t matter no english (fluently) single

The dataset contains many interesting information about the individual users, but this is just a fraction of the data that OKCupid has collected. The OKCupid data team reports in the OKTrends blog insightful (and entertaining) studies which make use of additional information such as for example:

Age distribution

In [3]:
print("Age statistics:\n{}".format(d["age"].describe()))
print()
print("There are {} users older than 80".format((d["age"]>80).sum()))
Age statistics:
count    59946.000000
mean        32.340290
std          9.452779
min         18.000000
25%         26.000000
50%         30.000000
75%         37.000000
max        110.000000
Name: age, dtype: float64

There are 2 users older than 80

Find the age outliers

Apparently we have one 110-year-old user, and only another one over-80. They might be outliers, let's have a look at their data.

In [4]:
PrettyPandas(d[d["age"]>80])
Out[4]:
age body_type diet drinks drugs education essay0 essay1 essay2 essay3 essay4 essay5 essay6 essay7 essay8 essay9 ethnicity height income job last_online location offspring orientation pets religion sex sign smokes speaks status
2512 110 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 67 -1 nan 2012-06-27-22-16 daly city, california nan straight nan nan f nan nan english single
25324 109 athletic mostly other nan never working on masters program nan nan nan nothing nan nan nan nan nan nan nan 95 -1 student 2012-06-30-18-18 san francisco, california might want kids straight nan other and somewhat serious about it m aquarius but it doesn’t matter when drinking english (okay) available

Let's assume the 110-year-old lady and the athletic 109-year-old gentleman (who's working on a masters program) are outliers: we get rid of them so the following plots look better. They didn't say much else about themselves, anyway.

In [5]:
d=d[d["age"]<=80] # sorry guys

print("The dataset now contains {} records".format(len(d)))

m=d[d["sex"]=="m"] # male users
f=d[d["sex"]=="f"] # female users
print("{} males ({:.1%}), {} females ({:.1%})".format(
    len(m),len(m)/len(d),
    len(f),len(f)/len(d)))

print()
print("Age statistics:\n{}".format(d["age"].describe()))
The dataset now contains 59944 records
35828 males (59.8%), 24116 females (40.2%)

Age statistics:
count    59944.000000
mean        32.337715
std          9.442423
min         18.000000
25%         26.000000
50%         30.000000
75%         37.000000
max         69.000000
Name: age, dtype: float64

Draw age histograms for male and female users

In [6]:
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.distplot(m["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="b")
ax1.set_title("Age distribution for males")
sns.distplot(f["age"], ax=ax2,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="r")
ax2.set_title("Age distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Note that both distributions are right-skewed. Then, as is often (but not always!) the case, the mean is larger than the median.

In [7]:
print("Mean and median age for males:   {:.2f}, {:.2f}".format(m["age"].mean(),m["age"].median()))
print("Mean and median age for females: {:.2f}, {:.2f}".format(f["age"].mean(),f["age"].median()))
Mean and median age for males:   32.02, 30.00
Mean and median age for females: 32.82, 30.00

Females seem to be on average slightly older than males. Let's compare the age distributions in a single plot

In [8]:
fig,(ax1,ax2) = plt.subplots(nrows=2,figsize=(10,6),sharex=True)

# Plot the age distributions of males and females on the same axis
sns.distplot(m["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="b",
             label="males")
sns.distplot(f["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="r",
             label="females")
ax1.set_ylabel("Number of users in age group")
ax1.set_xlabel("")
ax1.legend()

# Compute the fraction of males for every age value
fraction_of_males=(m["age"].value_counts()/d["age"].value_counts())
# Ignore values computed from age groups in which we have less than 100 total users (else estimates are too unstable)
fraction_of_males[d["age"].value_counts()<100]=None
barlist=ax2.bar(left=fraction_of_males.index,
        height=fraction_of_males*100-50,
        bottom=50, width=1, color="gray")
for bar,frac in zip(barlist,fraction_of_males):
    bar.set_color("b" if frac>.5 else "r")
    bar.set_alpha(0.4)
ax2.set_xlim([18,70])
ax2.set_xlabel("age")
ax2.set_ylabel("percentage of males in age group")
ax2.axhline(y=50,color="k")

for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Over-60 users are not many, but in this group there are significantly more females than males. This may be explained by the fact that, in this age group, there are more females than males in the general population.

USA by Sex and Age 2015-07-01 Picture By Delphi234 (Own work) [CC0], via Wikimedia Commons.

On this topic, also see page 5 of the 2010 US Census Brief on Age and Sex Composition. The ratio between males and females in a given population is called Sex Ratio.

Study height distribution and compare with official data from the US Centers of Disease Control and Prevention (CDC)

We first plot the height distribution for males and females in the whole dataset

In [9]:
fig,(ax,ax2) = plt.subplots(nrows=2,sharex=True,figsize=(6,6),gridspec_kw={'height_ratios':[2,1]})

# Plot histograms of height
bins=range(55,80)
sns.distplot(m["height"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="b",
             label="males")
sns.distplot(f["height"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="r",
             label="females")
ax.legend(loc="upper left")
ax.set_xlabel("")
ax.set_ylabel("Number of users with given height")
ax.set_title("height distribution of male and female users");

# Make aligned boxplots
sns.boxplot(data=d,y="sex",x="height",orient="h",ax=ax2,palette={"m":"b","f":"r"})
plt.setp(ax2.artists, alpha=.5)
ax2.set_xlim([min(bins),max(bins)])
ax2.set_xlabel("Self-reported height [inches]")

sns.despine(ax=ax)
fig.tight_layout()

Males are obviously taller than females, and the two distributions make sense.

How does this compare with general population data? Are OkCupid users maybe cheating and overreporting their height?

The CDC publishes growth charts, which contain height data for the general US population. The dataset reports statistics (3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 97th percentiles) for stature for different ages from 2 to 20 years. This (and more) data is plotted by the CDC in these beautiful charts.

In [10]:
# Import a CSV file for growth chart data
cdc=pd.read_csv("https://www.cdc.gov/growthcharts/data/zscore/statage.csv")

print("First 10 rows of raw cdc growth chart data")
display(PrettyPandas(cdc.head(10)))
First 10 rows of raw cdc growth chart data
Sex Agemos L M S P3 P5 P10 P25 P50 P75 P90 P95 P97
0 1 24 0.941524 86.4522 0.0403215 79.9108 80.7298 81.9917 84.1029 86.4522 88.8052 90.9262 92.1969 93.0227
1 1 24.5 1.00721 86.8616 0.0403956 80.2604 81.0887 82.364 84.4947 86.8616 89.228 91.3575 92.6318 93.4592
2 1 25.5 0.837251 87.6525 0.0405775 81.0053 81.8345 83.1139 85.2589 87.6525 90.0568 92.2297 93.5341 94.3828
3 1 26.5 0.681493 88.4233 0.0407231 81.7342 82.5641 83.8472 86.0052 88.4233 90.8626 93.0761 94.4088 95.2776
4 1 27.5 0.53878 89.1755 0.0408332 82.4485 83.279 84.5653 86.7351 89.1755 91.6471 93.8983 95.2575 96.1451
5 1 28.5 0.407697 89.9104 0.0409091 83.1494 83.9805 85.2696 87.4498 89.9104 92.4116 94.6976 96.0815 96.9866
6 1 29.5 0.286762 90.6291 0.0409524 83.8382 84.6695 85.961 88.1503 90.6291 93.1572 95.4752 96.882 97.8035
7 1 30.5 0.174489 91.3324 0.0409653 84.5156 85.3469 86.6403 88.8375 91.3324 93.885 96.2324 97.6603 98.5969
8 1 31.5 0.0694445 92.0213 0.04095 85.1824 86.0136 87.3082 89.512 92.0213 94.5959 96.9702 98.4176 99.3683
9 1 32.5 -0.0297206 92.6964 0.0409087 85.8393 86.67 87.9654 90.1746 92.6964 95.2908 97.6898 99.1551 100.119
In [11]:
# Adjust the data to fit our format
cdc["Age"]=cdc["Agemos"]/12 # convert age in months to age in fractional years
cdc["Sex"]=cdc["Sex"].replace({1:"m",2:"f"}) # align to our convention
percentiles=[3,5,10,25,50,75,90,95,97]
percentile_columns=["P"+str(p) for p in percentiles] # names of percentile columns
cdc[percentile_columns]=cdc[percentile_columns]*0.393701 # convert percentile columns from centimeters to inches (ugh)
cdc20=cdc[cdc["Age"]==20].set_index("Sex") # Select the two rows corresponding to 20-year-olds (males and females)

print("Height Percentiles for 20-year-old US population [inches]")
display(PrettyPandas(cdc20[percentile_columns],precision=4))
Height Percentiles for 20-year-old US population [inches]
P3 P5 P10 P25 P50 P75 P90 P95 P97
Sex
m 64.3 64.98 66.01 67.73 69.63 71.52 73.21 74.22 74.88
f 59.49 60.1 61.03 62.58 64.31 66.02 67.56 68.48 69.08

Let's compare the stats for reported heights of our 20-year-olds to the CDC stats for 20-year-olds.

Note that OKCupid height data are integers, which also causes all percentiles to be integer values. To fix this, we jitter the data by $\pm0.5$ inches by adding random uniformly distributed noise in the range $[-0.5,+0.5]$ (which won't affect the mean, but will smooth the percentiles). This makes sense if we assume that users reported their height rounded to the nearest inch.

In [12]:
mheights=m.loc[m["age"]==20,"height"] # heights of 20-year-old males
fheights=f.loc[f["age"]==20,"height"] # heights of 20-year-old females

# To smooth the computation of percentiles, jitter height data by adding
# uniformly distributed noise in the range [-0.5,+0.5]
mheightsj=mheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(mheights),))
fheightsj=fheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(fheights),))

# For each of the available percentiles in CDC data, compute the corresponding percentile from our 20-year-old users
stats=[]
for percentile,percentile_column in zip(percentiles,percentile_columns):
    stats.append({"sex":"m",
                  "percentile":percentile,
                  "CDC":cdc20.loc["m",percentile_column],
                  "users":mheightsj.quantile(percentile/100)})
    stats.append({"sex":"f",
                  "percentile":percentile,
                  "CDC":cdc20.loc["f",percentile_column],
                  "users":fheightsj.quantile(percentile/100)})
stats=pd.DataFrame(stats).set_index(["sex","percentile"]).sort_index()

# For each percentile, compute the gap between users and CDC
stats["gap"]=stats["users"]-stats["CDC"]

print("Height percentiles (in inches) for 20-year-old males")
display(PrettyPandas(stats.loc["m"],precision=4))

print("Height percentiles (in inches) for 20-year-old females")
display(PrettyPandas(stats.loc["f"],precision=4))
Height percentiles (in inches) for 20-year-old males
CDC users gap
percentile
3 64.3 64.89 0.589
5 64.98 65.56 0.5848
10 66.01 66.16 0.154
25 67.73 67.88 0.1547
50 69.63 70.19 0.567
75 71.52 72.26 0.7426
90 73.21 74.15 0.9413
95 74.22 75.47 1.241
97 74.88 76.09 1.209
Height percentiles (in inches) for 20-year-old females
CDC users gap
percentile
3 59.49 59.69 0.196
5 60.1 59.93 -0.1718
10 61.03 60.98 -0.04799
25 62.58 62.89 0.3094
50 64.31 64.85 0.5411
75 66.02 66.64 0.6133
90 67.56 68.46 0.896
95 68.48 69.46 0.9735
97 69.08 70.23 1.153
In [13]:
fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,4))
stats.loc["m"][["users","CDC"]].plot.bar(ax=ax1,color=[[0.5,0.5,1],"k"],alpha=1,width=0.8,rot=0)
stats.loc["f"][["users","CDC"]].plot.bar(ax=ax2,color=[[1,0.5,0.5],"k"],alpha=1,width=0.8,rot=0)
ax1.set_ylim([64,77])
ax2.set_ylim([58,71])
ax1.set_ylabel("Height [inches]")
ax2.set_ylabel("Height [inches]")
ax1.set_title("Height percentiles in 20y-old male users vs CDC data")
ax2.set_title("Height percentiles in 20y-old female users vs CDC data")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

The height statistics of our user population matches remarkably well with CDC data. It looks like our users, males and females alike, may be slightly over-reporting their height (still, much less than one could expect), but there could be many other reasonable explanations for this gap. For example, the San-Francisco population may be taller than the general US population. This might be a starting point to further investigate the issue.

Another explanation may be that short users are more likely to not report their height; in this case, even though other users report their height honestly, our statistics (which ignores missing values) would be biased. This problem is well known to statisticians as data missing not at random (MNAR). May we be victims of this?

In [14]:
d["height"].isnull().sum()
Out[14]:
3

Apparently not, as we only have 3 users with missing height data, which could never explain the gap. Probably, height is a required information to sign up. The reason why a few users are missing this is unknown (maybe they signed up when the information was not compulsory?)

Note that this analysis on the whole population of OKCupid users found a larger tendency to height exaggeration.

Study how height changes with age

All our users are over-18. As shown in CDC plots, height does not increase much after 18, but we can still observe the very last stage of growth, just by fractions of an inch.

In [15]:
#%% Investigate heights vs sex vs age
g=d.groupby(["sex","age"])["height"].mean()
fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,3))
ax1.plot(g["m"],color="b")
ax1.set_xlim(18,27)
ax1.set_ylim(69.5,71)
ax1.set(title="Average height vs age for males",
        ylabel="height",
        xlabel="age")
ax2.plot(g["f"],color="r")
ax2.set_xlim(18,27)
ax2.set_ylim(64,65.5)
ax2.set(title="Average height vs age for females",
        ylabel="height",
        xlabel="age");
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

We can also overlay CDC growth charts to the above plots, with minimal data wrangling.

We only have to massage our cdc data so that it is indexed by age (in integer years). For a given integer age (say 19 years) the original data has 12 rows corresponding to the ages of $(19 \cdot 12 + 0), (19 \cdot 12 + 1), \cdots, (19 \cdot 12 + 11)$ months. For a given integer age we average the corresponding percentiles of each of these rows.

In [16]:
cdc_m=cdc[cdc["Sex"]=="m"].groupby(np.floor(cdc["Age"]))[percentile_columns].mean()
cdc_f=cdc[cdc["Sex"]=="f"].groupby(np.floor(cdc["Age"]))[percentile_columns].mean()

# Result for males
display(PrettyPandas(cdc_m,precision=4))
P3 P5 P10 P25 P50 P75 P90 P95 P97
Age
2.0 32.99 33.31 33.82 34.68 35.65 36.64 37.55 38.1 38.46
3.0 35.94 36.3 36.85 37.78 38.84 39.93 40.92 41.52 41.92
4.0 38.28 38.7 39.33 40.39 41.57 42.74 43.8 44.43 44.84
5.0 40.54 41 41.7 42.87 44.16 45.44 46.58 47.26 47.7
6.0 42.83 43.31 44.06 45.3 46.69 48.08 49.33 50.08 50.56
7.0 45.09 45.59 46.37 47.68 49.16 50.65 52.01 52.82 53.36
8.0 47.17 47.7 48.52 49.9 51.47 53.07 54.53 55.42 56
9.0 48.98 49.54 50.42 51.91 53.58 55.29 56.85 57.8 58.42
10.0 50.61 51.21 52.15 53.74 55.54 57.36 59.02 60.03 60.69
11.0 52.34 52.98 53.97 55.65 57.55 59.49 61.26 62.34 63.04
12.0 54.47 55.15 56.21 58 60.01 62.06 63.93 65.06 65.8
13.0 57.02 57.78 58.94 60.87 62.99 65.11 67 68.12 68.85
14.0 59.62 60.43 61.65 63.64 65.79 67.87 69.69 70.76 71.44
15.0 61.7 62.49 63.68 65.61 67.68 69.68 71.42 72.44 73.09
16.0 63.03 63.76 64.89 66.72 68.7 70.63 72.33 73.33 73.97
17.0 63.74 64.44 65.51 67.27 69.2 71.1 72.79 73.79 74.43
18.0 64.1 64.77 65.82 67.55 69.45 71.34 73.03 74.03 74.68
19.0 64.26 64.93 65.96 67.68 69.58 71.47 73.16 74.17 74.83
20.0 64.3 64.98 66.01 67.73 69.63 71.52 73.21 74.22 74.88
In [17]:
# Compute average height per sex and age
g=d.groupby(["sex","age"])["height"].mean()

fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,5))
ax1.plot(g["m"],color="b",label="Mean of male OkCupid users")
ax1.plot(cdc_m["P75"],color="k",linestyle='dotted',label="75th percentile of male US Population")
ax1.plot(cdc_m["P50"],color="k",label="Median of male US Population")
ax1.plot(cdc_m["P25"],color="k",linestyle='dotted',label="25th percentile of male US Population")
ax1.fill_between(cdc_m.index,cdc_m["P25"],cdc_m["P75"],color="k",alpha=0.1,linewidth=0)


#ax1.legend(loc="lower right")
# Use direct labeling instead of a legend
x=cdc_m["P50"].index[-1]
ax1.text(x, g["m"].loc[:26].max(), " Mean of male OkCupid users", color="b",
         verticalalignment="bottom",fontsize="small")
ax1.text(x, cdc_m["P75"].iloc[-1]," 75th percentile of male US Population",
         verticalalignment="center",fontsize="small")
ax1.text(x, cdc_m["P50"].iloc[-1]," Median of male US Population",
         verticalalignment="center",fontsize="small")
ax1.text(x, cdc_m["P25"].iloc[-1]," 25th percentile of male US Population",
         verticalalignment="center",fontsize="small")

ax1.set_xlim(16,27)
ax1.set_ylim(67,72)
ax1.set(title="height vs age for males",
        ylabel="height [inches]",
        xlabel="age (rounded down for CDC data) [years]");
ax2.plot(g["f"],color="r",label="Mean of female OkCupid users")
ax2.plot(cdc_f["P75"],color="k",linestyle='dotted',label="75th percentile of female US Population")
ax2.plot(cdc_f["P50"],color="k",label="Median of female US Population")
ax2.plot(cdc_f["P25"],color="k",linestyle='dotted',label="25th percentile of female US Population")
ax2.fill_between(cdc_f.index,cdc_f["P25"],cdc_f["P75"],color="k",alpha=0.1,linewidth=0)

#ax2.legend(loc="lower right")
# Use direct labeling instead of a legend
x=cdc_f["P50"].index[-1]
ax2.text(x, g["f"].loc[:26].max(), " Mean of female OkCupid users", color="r",
         verticalalignment="bottom",fontsize="small")
ax2.text(x, cdc_f["P75"].iloc[-1]," 75th percentile of female US Population",
         verticalalignment="center",fontsize="small")
ax2.text(x, cdc_f["P50"].iloc[-1]," Median of female US Population",
         verticalalignment="center",fontsize="small")
ax2.text(x, cdc_f["P25"].iloc[-1]," 25th percentile of female US Population",
         verticalalignment="center",fontsize="small")

ax2.set_xlim(16,27)
ax2.set_ylim(62,67)
ax2.set(title="height vs age for females",
        ylabel="height [inches]",
        xlabel="age (rounded down for CDC data) [years]");
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

How do users self-report their body type?

An interesting categorical attribute body_type contains the self-reported body type of the user, chosen from a limited set of options.

In [14]:
fig,ax=plt.subplots(figsize=(6,5))
sns.countplot(y="body_type",hue="sex",
              order=d["body_type"].value_counts().sort_values(ascending=False).index,
              data=d,palette={"m":"b","f":"r"},alpha=0.5,ax=ax);
ax.set_title("Number of female and male users self-reporting each body type")
sns.despine(ax=ax)

In the plot above, males and females are two sub-groups of the population, whereas body_type is a categorical attribute. It is interesting to compare how users in each of the two sub-groups (i.e. males and females) are likely to use each of the available categorical values; this is normally done through contingency tables.

For example, regardless of how frequent these body types are in the general population, it's interesting to note that the terms curvy and full figured are disproportionally used by females, whereas a little extra is mostly used by males.

The function implemented below aims to visualize this, for a generic categorical attribute and a generic grouping of users. It accpets the following inputs:

  • a series with categorical data (in the example, the self-reported body type of each user)
  • two like-indexed boolean series g1 and g2 defining two subgroups of the population (in the example, whether each user is a male (True in g1) or a female (True in g2)). Note that, even though it is the case in the example, the two groups don't need to define partition of the population: they may intersect (an user may be in both g1 and g2) and they may not cover the whole population (an user may be in neither g1 nor g2). We also expect that the two subgroups may have very different sizes.
  • The function also accepts readable names and RGB (or matplotlib) colors assigned to each of the two groups.

For each unique value of series represented in the union of the two groups, the function visualizes how frequently the value is used by group g1 compared to how frequently it is used by group g2. We expect the output to be $0.5$ for values that were equally frequent in g1 and g2 (note that, since we deal with frequencies, this does not depend on the relative sizes of g1 and g2, nor on whether the frequency in both groups is very low or very high); we expect the output to be $0$ for values that were only seen in g2 and never seen in g1; we expect the output to be $1$ for values that were only seen in g1 and never seen in g2.

The function automatically discards values that appear less than 50 total times in the two groups, which would yield an unreliable estimate of the proportion.

In [15]:
import math

# Define visualization function
def compare_prevalence(series,g1,g2,g1name,g2name,g1color,g2color,ax):
    
    # for each categorical value represented in series, number of users in group g1 which have this value
    g1n=series.loc[g1].value_counts()
    # for each categorical value represented in series, number of users in group g2 which have this value
    g2n=series.loc[g2].value_counts()
    
    # join the two series in a single dataframe, filling 0 where indices don't match
    # (e.g. if a value represented in g1 did never appear in g2)
    df=pd.concat({"g1n":g1n,"g2n":g2n},axis=1).fillna(0)
    # df has one row for every distinct value of series in the union of g1 and g2
    
    # normalize the data
    df["g1f"]=df["g1n"]/(df["g1n"].sum()) # fraction of g1 users with each categorical value
    df["g2f"]=df["g2n"]/(df["g2n"].sum()) # fraction of g2 users with each categorical value
    
    assert(math.isclose(df["g1f"].sum(),1)) 
    assert(math.isclose(df["g2f"].sum(),1))
    
    # for each row of df, we now compute how frequent the value was in g1 compared to the frequency it had in g2.
    df["frac12"]=df["g1f"]/(df["g1f"]+df["g2f"])
    # we expect df["frac12"] to be 0.5 for values that were equally frequent in g1 and g2 (note that this does not depend on the size of g1 and g2)
    # we expect df["frac12"] to be 0 for values that were only seen in g2 and never seen in g1
    # we expect df["frac12"] to be 1 for values that were only seen in g1 and never seen in g2
    
    df=df[(df["g1n"]+df["g2n"])>=50] # exclude values which are too rare
    df=df.sort_values("frac12")
    
    # Draw the left bars
    ax.barh(bottom=range(len(df)),
                width=df["frac12"],
                left=0,
                height=1,
                align="center",
                color=g1color,alpha=1)
    # Draw the right bars
    ax.barh(bottom=range(len(df)),
                width=df["frac12"]-1,
                left=1,
                height=1,
                align="center",
                color=g2color,alpha=1)
    
    # Draw a faint vertical line for x=0.5
    ax.axvline(x=0.5,color="k",alpha=0.1,linewidth=5)
    ax.set(xlim=[0,1],
           ylim=[-1,len(df)-0.5],
           yticks=range(len(df)),
           yticklabels=df.index,
           xlabel="fraction of users",
           ylabel=series.name)
    
    ax.set_title("Relative prevalence of {} ($n={}$) vs {} ($n={}$)\nfor each value of {}".format(
                g1name,g1.sum(),g2name,g2.sum(),series.name),
                loc="left",fontdict={"fontsize":"medium"})
    ax.text(0.02,len(df)-1,g1name,verticalalignment="center",horizontalalignment="left",size="smaller",color="w")
    ax.text(0.98,0,g2name,verticalalignment="center",horizontalalignment="right",size="smaller",color="w")

    def color_for_frac(f):
        # Blend g1color and g2color according to f (convex linear combination):
        # 0 returns g1color, 1 returns g2color)
        ret=np.array(g1color)*f+np.array(g2color)*(1-f)
        if(np.linalg.norm(ret)>1):          # If the resulting rgb color is too bright for text,
            ret=(ret/np.linalg.norm(ret))*1 # rescale its brightness to dark (but keep hue)
        return ret
        
    for i,tl in enumerate(plt.gca().get_yticklabels()):
        tl.set_color(color_for_frac(df["frac12"].iloc[i]))
        
    sns.despine(ax=ax,left=True)

# Apply visualization function 
fig,ax = plt.subplots(figsize=(10,3))
compare_prevalence(
    series=d["body_type"],                          # Which categorical attribute?
    g1=d["sex"]=="m",      g2=d["sex"]=="f",        # Definition of the two groups
    g1name="male users",   g2name="female users",   # Names of the two groups
    g1color=[0.5,0.5,1.0], g2color=[1.0,0.5,0.5],   # Colors for the two groups
    ax=ax)
fig.tight_layout()

The plot above can be interpreted as follows. If we sample the same number of males and females from the population, and then we only keep those who describe themselves as "athletic", almost 73% of these will be male, and the remaining 27% will be female.

Similarly, if we consider users who describe themselves as "curvy", more than 98% of these will be female.

Note that the fractions do not depend on the prevalence of the two groups in the overall population, nor on the prevalence of each categorical body type value in the general population. We can therefore apply the same function to other group pairs, even when the two groups have very different sizes.

This plot is a less expressive variant of a well-known representation of contingency tables called mosaic plot, and closely resembles what here is called relative frequency segmented bar plot.

Study categorical attributes for different segments of the population

i.e., make a lot of plots like the one above, for different attributes and using different groupings.

Define a list of interesting group pairs

We now define a large amount of interesting pairs of groups that we want to compare. For each group in each pair, we define:

  • the human-readable name of the group;
  • an RGB color to represent the group (because eyecandy);
  • a boolean series indexed like d, which masks the users belonging to the group
In [16]:
groupss=[            #   Color RGB      Boolean series to index observations in d
    (("male users",      [0.5,0.5,1.0], d["sex"]=="m"),
     ("female users",    [1.0,0.5,0.5], d["sex"]=="f")),
    
    (("straight users",  [0.5,0.5,0.5], d["orientation"]=="straight"),
     ("gay/bi users",    [0.7,0.7,0.1], d["orientation"]!="straight")),
    
    (("young users",     [0.4,0.8,0.4], d["age"]<25),
     ("older users",     [0.4,0.4,0.5], d["age"]>35)),
    
    (("earns_little",    [0.4,0.4,0.4], d["income"]==20000),
     ("earns_alot",      [0.4,0.9,0.6], d["income"]>=50000)),
    # We purposefully exclude users not reporting their income (i.e.the majority). 20k is the lowest available tier
    # We also purposefully exclude intermediate tiers
    
    (("fit users",       [0.4,0.6,0.8], d["body_type"].isin(["average","fit","athletic","thin","skinny","jacked"])),
     ("unfit users",     [0.8,0.6,0.4], d["body_type"].isin(["a little extra","curvy","full figured","rather not say","overweight"]))),
    # We assume that users that do not say their body_type are unfit
    
    (("monoglots",       [0.7,0.7,0.7], d["speaks"].str.count(",")==0),
     ("polyglots",       [0.5,0.5,0.5], d["speaks"].str.count(",")>=2)),
   
    (("witty_programmers",     [0.4,0.7,0.4], d["speaks"].str.contains("\\b(?:c|lisp)\\b",na=False)),
     ("non witty_programmers", [0.5,0.5,0.6], ~(d["speaks"].str.contains("\\b(?:c|lisp)\\b",na=True)))),
    # a "witty programmer" is a nerd who, when asked which languages she/he speaks, thinks that "C++" or "perl" are reasonable answers.
    # in their defence, these are valid options in the okcupid interface (see codebook)
    # we compare them with everyone else (i.e. people who are probably not as nerd)

    (("it_lang_speakers",      [0.5,0.7,0.5], d["speaks"].str.contains("\\b(?:italian)\\b",na=False)),
     ("non it_lang_speakers",  [0.5,0.5,0.5], ~(d["speaks"].str.contains("\\b(?:italian)\\b",na=True)))),
   
    (("dead_lang_speakers",    [0.5,0.7,0.5], d["speaks"].str.contains("\\b(?:latin|ancient greek)\\b",na=False)),
     ("non dead_lang_speakers",[0.5,0.5,0.5], ~(d["speaks"].str.contains("\\b(?:latin|ancient greek)\\b",na=True)))),
    # a "dead_lang_speaker" is a nerd who, when asked which languages she/he speaks, thinks that "latin" or "ancient greek" are reasonable answers.
    # in their defence, these are valid options in the okcupid interface (see codebook)
    # we compare them with everyone else (i.e. people who are probably not as nerd)
    
    (("nonsmokers",      [0.6,0.6,1.0], d["smokes"]=="no"),
     ("smokers",         [0.3,0.3,0.3], d["smokes"]=="yes")),
    
    (("omnivores",       [0.8,0.4,0.4], d["diet"].str.split().str.get(-1).str.contains("anything",na=False)),
     ("veg(etari)ans",   [0.3,0.8,0.3], d["diet"].str.split().str.get(-1).str.contains("veg",na=False))),
   
    (("doesnt_want_kids",[0.4,0.4,0.4], d["offspring"].isin(["doesn&rsquo;t want kids","doesn&rsquo;t have kids, and doesn&rsquo;t want any"])),
     ("wants_kids",      [0.4,0.8,0.8], d["offspring"].isin(["doesn&rsquo;t have kids, but wants them","wants kids","has a kid, and wants more","has kids, and wants more"]))),
    # We sample the possible answers to the "offspring" question identifying those that explicitly say they want (more) kids
    # vs those that explicitly say they don't, regardless of whether they already have kids or not.
    
    (("cat lovers",      [1.0,0.6,0.6], d["pets"].str.contains("\\b(?:likes|has) cats",na=False)),
     ("cat haters",      [0.4,0.4,0.4], d["pets"].str.contains("\\bdislikes cats",na=False))),
    # A cat lover either has or likes cats; a cat hater dislikes cats.
   
    (("dog lovers",      [1.0,0.6,0.6], d["pets"].str.contains("\\b(?:likes|has) dogs",na=False)),
     ("dog haters",      [0.4,0.4,0.4], d["pets"].str.contains("\\bdislikes dogs",na=False)))
    # A dog lover either has or likes dogs; a dog hater dislikes dogs.
]


# We finally add a couple of religion_based groupings
atheists = d["religion"].str.contains("\\b(?:atheism|agnosticism)\\b").fillna(False)
religious = ~(d["religion"].str.contains("\\b(?:atheism|agnosticism)\\b").fillna(True))
religion_not_important = religious & d["religion"].str.contains(
    "\\b(?:but not too serious about it|and laughing about it)\\b").fillna(False)
religion_important = religious & d["religion"].str.contains(
    "\\b(?:and somewhat serious about it|and very serious about it)\\b").fillna(False)

groupss+=[
    (("atheists/agnostics", [0.4,0.5,0.4], atheists),
     ("religious",          [0.4,0.4,0.5], religious)),
    (("serious_religious",  [0.4,0.3,0.3], religion_important),
     ("laughing_religious", [0.6,0.6,0.5], religion_not_important))]

Define a list of interesting categorical attributes

Body type is not the only interesting categorical attribute. Below we add this and other attributes in a list, possibly after massaging the data in order to limit the amount of categories in each attribute.

Note that the name of each series is used to label the plots.

In [17]:
attrs=[]

attrs.append(d["body_type"])

attrs.append(d["job"])

location=d["location"].copy()
users_per_location=location.value_counts()
location.loc[~(location.isin(list(users_per_location[users_per_location>1000].index)))]="elsewhere"
attrs.append(location)
# There are too many unique locations; we rewrite all locations that appear less than 1k times as "elsewhere".

attrs.append(d["orientation"])

attrs.append(d["drinks"])
# Has potentially interesting categories beyond yes and no, like "desperately".

attrs.append(d["education"].str.replace("working on ","").str.replace("graduated from ",""))
# For each type of school, one may have (see codebook) one optional prefix in the set:
# {"working on ", "graduated from ", "dropped out of "}.  We remove the first two, so
# "working on high school", "graduated from high school", "high school" become the same category.
# "Dropped out of high school", however, will stay a separate category

cats_relationship=pd.Series("does not mention cats",index=d.index,name="cats_relationship")
for s in ["has cats","likes cats","dislikes cats"]:
    cats_relationship[d["pets"].str.contains("\\b"+s,na=False)]=s
attrs.append(cats_relationship)

dogs_relationship=pd.Series("does not mention dogs",index=d.index,name="dogs_relationship")
for s in ["has dogs","likes dogs","dislikes dogs"]:
    dogs_relationship[d["pets"].str.contains("\\b"+s,na=False)]=s
attrs.append(dogs_relationship)
    
attrs.append(d["religion"].str.split(" ").str.get(0).fillna("not entered"))
# We only keep the name of the religion (first token) and not the optional comment
# We consider as a separate value users that did not enter any religion (which is an interesting piece of information per se)

attrs.append(d["sign"].str.split(" ").str.get(0).fillna("not entered"))
# We only keep the name of the astrological sign (first token) and not the optional comment
# We consider as a separate value users that did not enter any sign (which is an interesting piece of information per se)

Actually generate and save a plot for each combination of group and attribute

In [26]:
%matplotlib
# we temporarily switch to the default non-inline backend;
# the inline backend crashes in the cell below because it won't close the figures before the cell is done executing.

def makesafefilename(filename):
    keepcharacters = (' ','.','_')
    return "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()

import pathlib
imgdir=pathlib.Path("plots")
for ((g1name,g1color,g1),(g2name,g2color,g2)) in groupss:
    html="{} (n={}) vs {} (n={})\n<ul>".format(g1name,g1.sum(),g2name,g2.sum())
    for attr in attrs:
        fig,ax = plt.subplots(figsize=(10,1.5+0.15*len(attr.unique())))
        compare_prevalence(attr,g1,g2,g1name,g2name,g1color,g2color,ax)
        fig.tight_layout()
        fn=imgdir/makesafefilename("plot_{}_{}_{}.svg".format(attr.name,g1name,g2name))
        fig.savefig(str(fn))

        html+='<li><a href="{}">For attribute {} ({} unique values)</a></li>'.format(
            str(fn),attr.name,len(attr.unique()))
        plt.close(fig)
    html+="</ul>"
    display(HTML(html))
Using matplotlib backend: Qt5Agg
In [18]:
%matplotlib inline
# back to inline plots 
In [28]:
# This cell is only needed to generate the markdown text with image links,
# which is reported and edited manually (deleting uninteresting plots) in the next cell
md=""
for ((g1name,g1color,g1),(g2name,g2color,g2)) in groupss:
    for attr in attrs:
        fn=imgdir/makesafefilename("plot_{}_{}_{}.svg".format(attr.name,g1name,g2name))
        md+='![{}]({})\n\n'.format(attr.name,str(fn))

Discussion on the generated plots

Below, we report some of the interesting plots we just generated.

Males vs Females

body_type As noted above: "curvy" is mostly used by females, understandably

job Predictably, males abound in stereotypically male-dominated environments.

education Note that those who "dropped out" of something -- especially lower education tiers -- are mostly males.

cats_relationship dogs_relationship Everybody likes cats/dogs, but females are most likely to actually own one.

religion

sign Note how, in this and other plots for the sign attribute, the actual sign has basically no effect; but not entering one's sign is an important piece of information which stands out.

Straight vs Gay/bisexual users

job

religion

sign

Young vs older users

Young users are defined as <=25 years old; older users are >=35 years old.

job

orientation

drinks

education

religion

Users that earn little vs users that earn a lot

We purposefully exclude users not reporting their income (i.e.the majority). Users that earn a little are those who selected 20k\$ as their annual income (it's the lowest available tier). Users that earn a lot chose 50k\$ or more as their annual income. We exclude users who report earnings in intermediate tiers.

job

education

religion

sign

Fit vs unfit users

Fit users reported one of the following body types: "average","fit","athletic","thin","skinny","jacked". Unfit users reported one of the following body types: "a little extra","curvy","full figured","rather not say","overweight".

job

education

religion

sign

Monoglots vs Polyglots

job

location

education

religion

sign Those who do not enter their sign are more likely to be monoglots. This is somewhat unexpected.

Witty programmers vs others

a "witty programmer" is a nerd who, when asked which languages she/he speaks, thinks that "C++" or "perl" are reasonable answers. In their defence, these are valid options in the okcupid interface (see codebook). We compare them with everyone else (i.e. people who are probably not as nerd)

job

drinks

education

Italian-speakers vs others

job

Speakers of dead languages vs others

A "dead_lang_speaker" is a nerd who, when asked which languages she/he speaks, thinks that "latin" or "ancient greek" are reasonable answers. In their defence, these are valid options in the okcupid interface (see codebook). We compare them with everyone else (i.e. people who are probably not as nerd)

job

drinks

education

Smokers vs nonsmokers

body_type

job

orientation

drinks

education

religion

sign

Omnivores vs vegetarians and vegans

body_type

job

orientation

drinks

religion

sign

Those who want kids vs those who don't

We sample the possible answers to the "offspring" question identifying those that explicitly say they want (more) kids vs those that explicitly say they don't, regardless of whether they already have kids or not.

job

orientation

drinks

cats_relationship

dogs_relationship

religion

Those who have or like cats vs those who dislike cats

Users not reporting their relationship with cats are excluded

job

drinks

education

dogs_relationship

religion

Those who have or like dogs vs those who dislike dogs

Users not reporting their relationship with dogs are excluded

job

cats_relationship

religion Islamic users tend to really dislike dogs; apparently, there are cultural reasons.

Atheists and agnostics vs those who identify with a religion

job

drinks

sign

Users who take their religion seriously vs users who don't

We exclude atheists/agnostics. Users who take their religion seriously are those who entered a religion and possibly added "and it's important". "Laughing_religious" users are those who entered a religion and added "but it's not important" or "and laughing about it".

job

religion

Analyzing essays

The data contains essays written by the users on the following topics:

  • essay0: My self summary
  • essay1: What I’m doing with my life
  • essay2: I’m really good at
  • essay3: The first thing people usually notice about me
  • essay4: Favorite books, movies, show, music, and food
  • essay5: The six things I could never do without
  • essay6: I spend a lot of time thinking about
  • essay7: On a typical Friday night I am
  • essay8: The most private thing I am willing to admit
  • essay9: You should message me if...

In the following, we concatenate all essays to a single string and ignore the different themes.

In [6]:
d["essays"]=""
for f in ["essay"+str(i) for i in range(10)]:
    d.loc[d[f].isnull(),f]=""
    d["essays"]=d["essays"]+" "+d[f]

Let's index and count all unique words in all essays.

In [7]:
import string
from collections import Counter
wordcounts=Counter()
for e in d["essays"].str.split():
    wordcounts.update([w.strip(string.whitespace+string.punctuation) for w in e])

The 20 most common words

In [21]:
for w,c in wordcounts.most_common(20):
    print(c,w)
1050302 
823608 i
714584 and
679436 a
626851 the
595608 to
360865 my
356508 of
296365 in
268116 br
196373 i'm
186633 you
175822 with
171168 for
166527 that
149093 is
141352 have
134013 like
132263 it
129212 but

Among tha 10k most common words, we filter those with at least 4 characters. We build a dataframe that contains a binary column for each of these and a row for each user. The value will be True where if the user's essay contains the word, False otherwise.

In [ ]:
import tqdm # a cool progress bar

# Let's consider the most common 10k words
words=[w for w,c in wordcounts.most_common(10000) if len(w)>=4 and w.isalpha()]

d_contains=pd.DataFrame(index=d.index)
# This operation takes a lot of time
for w in tqdm.tqdm(words):
    d_contains[w]=d["essays"].str.contains("\\b"+w+"\\b")

# These frequent words were part of the html and we should ignore them
d_contains=d_contains.drop("href",axis=1)
d_contains=d_contains.drop("ilink",axis=1)
d_contains.to_pickle("d_contains.pickle") # Save a cache
In [34]:
d_contains=pd.read_pickle("d_contains.pickle") # Read the cache, to avoid the hours of computation above
In [35]:
print("The dataset contains {} rows (users) and {} columns (words)".format(
        len(d_contains.index),len(d_contains.columns)))
The dataset contains 59944 rows (users) and 8091 columns (words)

Let's visualize the resulting dataframe.

We show the transpose of the dataframe, i.e. one row per word and one column per user. We only display 400 users (i.e. less than one hundreth of all users in the dataset), and 100 words (i.e. about 1/80th of all the frequent words we found)

In [36]:
fig,(ax1,ax2)=plt.subplots(nrows=2,sharex=True,figsize=(10,15))
sns.heatmap(d_contains.iloc[0:400,0:50].transpose(),
            ax=ax1,cbar=None)
sns.heatmap(d_contains.iloc[0:400,-50:-1].transpose(),
            ax=ax2,cbar=None)
ax1.set_title("Which of the first 400 users (columns) use which of the 50 most frequent words (rows)")
ax2.set_title("Which of the first 400 users (columns) use which of the 50 least frequent words (rows) among the "+str(d_contains.shape[1])+" most frequent ones")
for ax in (ax1,ax2):
    ax.set_xticks([])
    ax.set_xlabel("Users")
    ax.set_ylabel("User's essays contain word")
fig.tight_layout()