Subreddit Gender Ratios

This is an update of my previous analysis of subreddit gender ratios project. This time I have access to better data because /u/Stuck_In_the_Matrix has downloaded every publicly available Reddit comment and made them publicly available. Additionally, the dataset is now available in Google BigQuery allowing easy analysis.

The dataset includes users' flair like the Reddit flair API I used previously. Notably, the new dataset also allows us to generate a list of all submitters in a subreddit, which was extremely difficult previously.

Basically we will be deriving the gender for as many users as possible from their flair in various subreddits. Then we will use that set of users as a random sample of every other subbreddit and guessing the gender ratio.

Limitations

  • We have no idea if the combined set of sampled users are representative of all Reddit users. In fact, some subreddits in the results look over represented in the results because they share many users with one of the subreddits we used for deriving gender. We just use as many subreddits as we can for gender and hope it's good enough.

  • The number of sampled users is very small compared to the set of all commenters of each subreddit.

In my previous analysis, it was near impossible to get a list of all commenters or subscribers of a subreddit. Instead the list of users with flair in a given subreddit was used as a proxy. That had major issues because in many subreddits the commenters and set of flaired users are not always the same. For example, some subreddits only give flair to submitters, which biased the results significantly. With the new dataset of every Reddit comment, that is no longer an issue.

In other words, this analysis is only mostly terrible now instead of completely terrible. I really doubt the exact percentages calculated are correct, but the relative comparisons between them and rankings look plausible.

Data through the end of November 2017 is included in this analysis.

BigQuery analysis

We will be doing almost all the analysis in BigQuery using SQL directly.

I'm using Jupyter and IPython, so I'd like a nice interface to run BigQuery SQL statements. Let's quickly define %%bigquery as an IPython cell magic.

This is pretty hacky, I'm using tons of internal functions from pandas-gbq. This is mostly a modified version of run_query() that limits the retrieved results to a specified number of rows. It doesn't really bother handle the data types of the results either. We also add SQL syntax highlighting in the Jupyter notebook for our new magic.

If you reuse this code, change PROJECT_ID to your own project ID from the Google Cloud Platform Console. You will also have to enable billing on the project and ensure BigQuery is enabled. You can follow the quickstart in their documentation.

You can add the --verbose flag to the magic to get status updates as it runs.

Becauase my %%bigquery code is horribly hacky, make sure you have the exact version of pandas-gbq that I used:

pip3 install -e git+https://github.com/pydata/[email protected]#egg=pandas-gbq
In [1]:
PROJECT_ID = 'bburky'
In [2]:
import pandas
from pandas_gbq import gbq
from IPython.core.magic import register_cell_magic
from IPython.core.magic_arguments import (
    argument, magic_arguments, parse_argstring)
from google.auth.exceptions import RefreshError
from google.cloud.bigquery import QueryJobConfig
from concurrent.futures import TimeoutError
import itertools


@register_cell_magic
@magic_arguments()
@argument('-d', '--dialect', default='standard',
          choices=['standard', 'legacy'])
@argument('--writedisposition', default='WRITE_EMPTY',
          choices=['WRITE_TRUNCATE', 'WRITE_APPEND', 'WRITE_EMPTY'])
@argument('-v', '--verbose', action='store_true')
@argument('--limit', type=int, default=0,
          help='Limit rows or retreived results to this number')
@argument('--destinationtable',
          help='Save query results to destination table')
def bigquery(line, cell):
    args = parse_argstring(bigquery, line)
    query = cell

    connector = gbq.GbqConnector(
        PROJECT_ID, reauth=False, verbose=args.verbose,
        dialect=args.dialect, auth_local_webserver=False)

    job_config = {
        'query': {
            'writeDisposition': args.writedisposition,
        }
    }

    if args.destinationtable:
        dataset_id, table_id = args.destinationtable.rsplit('.', 1)
        job_config['query']['destinationTable'] = {
            'projectId': PROJECT_ID,
            'datasetId': dataset_id,
            'tableId': table_id,
        }

    connector._start_timer()
    try:
        connector._print('Requesting query... ', end="")
        query_reply = connector.client.query(
            query,
            job_config=QueryJobConfig.from_api_repr(job_config['query']))
        connector._print('ok.')
    except (RefreshError, ValueError):
        if connector.private_key:
            raise gbq.AccessDenied(
                "The service account credentials are not valid")
        else:
            raise gbq.AccessDenied(
                "The credentials have been revoked or expired, "
                "please re-run the application to re-authorize")
    except connector.http_error as ex:
        connector.process_http_error(ex)

    job_id = query_reply.job_id
    connector._print('Job ID: %s\nQuery running...' % job_id)

    while query_reply.state != 'DONE':
        connector.print_elapsed_seconds('  Elapsed', 's. Waiting...')

        timeout_ms = job_config['query'].get('timeoutMs')
        if timeout_ms and timeout_ms < connector.get_elapsed_seconds() * 1000:
            raise QueryTimeout('Query timeout: {} ms'.format(timeout_ms))

        timeout_sec = 1.0
        if timeout_ms:
            # Wait at most 1 second so we can show progress bar
            timeout_sec = min(1.0, timeout_ms / 1000.0)

        try:
            query_reply.result(timeout=timeout_sec)
        except TimeoutError:
            # Use our own timeout logic
            pass
        except connector.http_error as ex:
            connector.process_http_error(ex)

    if connector.verbose:
        if query_reply.cache_hit:
            connector._print('Query done.\nCache hit.\n')
        else:
            bytes_processed = query_reply.total_bytes_processed or 0
            bytes_billed = query_reply.total_bytes_billed or 0
            connector._print('Query done.\nProcessed: {} Billed: {}'.format(
                connector.sizeof_fmt(bytes_processed),
                connector.sizeof_fmt(bytes_billed)))
            connector._print('Standard price: ${:,.2f} USD\n'.format(
                bytes_billed * connector.query_price_for_TB))

        connector._print('Retrieving results...')

    try:
        rows_iter = query_reply.result()
    except connector.http_error as ex:
        connector.process_http_error(ex)
    if args.limit:
        connector._print('  Limiting results to {} rows'.format(args.limit))
        result_rows = list(itertools.islice(rows_iter, args.limit))
    else:
        result_rows = list(rows_iter)
    total_rows = rows_iter.total_rows
    schema = {
        'fields': [
            field.to_api_repr()
            for field in rows_iter.schema],
    }

    # print basic query stats
    connector._print('Got {} of {} rows.\n'.format(
        len(result_rows), total_rows))

    return gbq._parse_data(schema, result_rows)
In [3]:
%%javascript
IPython.CodeCell.options_default.highlight_modes['magic_text/x-sql'] = {'reg':[/^%%bigquery/]};

First get a list of every user on reddit's flair in each subreddit. Save this to a userflair table. Limit the returned results to 10 rows to avoid downloading the entire table.

In [4]:
%%bigquery --limit=10 --destinationtable=flair.userflair
SELECT
  DISTINCT author,
  subreddit,
  author_flair_text,
  author_flair_css_class
FROM
  `fh-bigquery.reddit_comments.201*`
WHERE
  author_flair_text IS NOT NULL
  OR author_flair_css_class IS NOT NULL;
Out[4]:
author subreddit author_flair_text author_flair_css_class
0 SuperSonicMC friendsafari SonicBoom: 3926-5720-7723 [Fighting:Mienfoo,Sa... mienfoo sawk tyrogue fighting
1 Jakester5112 MLBTheShow Rating: 1513 empty-athletics
2 tommofrog csgobetting virtuspro
3 TentacleBird changemyview 1∆ points
4 StSob learndota2 Silencer silencer
5 ScradleyGymson pcmasterrace SenorOiram i-desk-steam
6 MikeWizowski Gunners liked Per before it was cool
7 Zerockk leagueoflegends twistedfate
8 SlipClutch MMA Team Cruz
9 WhaterBlue GlobalOffensive Legendary Eagle rank legendary-eagle

Male and Female Flair

Then, search for all potentially gendered flair. We're searching for flair css classes that contains the substrings "male" or "female" with a regex. Order the list by the number of users with each flair css class and limit the results to flair that occurs greater than 1000 times.

I don't claim to be the best at SQL, so some of these queries could be more optimal.

In [5]:
%%bigquery
SELECT
  subreddit,
  num_flair,
  gender,
  flair.author_flair_css_class
FROM (
  SELECT
    subreddit,
    author_flair_css_class,
    REGEXP_EXTRACT(
      LOWER(author_flair_css_class),
      '(?:fe)?male') AS gender,
    COUNT(*) AS num_flair
  FROM
    flair.userflair
  GROUP BY
    subreddit,
    author_flair_css_class,
    gender) AS flair
WHERE
  gender IS NOT NULL
  AND num_flair > 1000
ORDER BY
  num_flair DESC
Out[5]:
subreddit num_flair gender author_flair_css_class
0 AskMen 32676 male male
1 AskWomen 21870 female female
2 AskWomen 16886 male male
3 AskMen 12245 female female
4 AskMenOver30 5558 male male
5 AskWomenOver30 2477 female female
6 sexover30 1285 male male

After looking at the results, and manually checking how the the subreddits use flair, we can use flair from the following subreddits:

Pink and Blue

Some subreddits use pink/blue colors to indicate genders. Let's search the flair CSS classes for that too. We use a regular expression to easily search for the strings anywhere within the flair css class.

In [6]:
%%bigquery
SELECT
  subreddit,
  num_flair,
  CASE
    WHEN color = 'pink' THEN 'female'
    WHEN color = 'blue' THEN 'male'
  END AS gender,
  author_flair_css_class
FROM (
  SELECT
    subreddit,
    REGEXP_EXTRACT(
      LOWER(author_flair_css_class),
      'pink|blue') AS color,
    COUNT(*) AS num_flair,
    author_flair_css_class
  FROM
    flair.userflair
  GROUP BY
    subreddit,
    color,
    author_flair_css_class)
WHERE
  color IS NOT NULL
  AND num_flair > 1000
ORDER BY
  num_flair DESC;
Out[6]:
subreddit num_flair gender author_flair_css_class
0 tall 22745 male blue
1 Android 15075 male userBlue
2 short 6819 male blue
3 MechanicalKeyboards 6713 male mxblue
4 gainit 5847 male blue
5 CasualConversation 5304 male blue
6 bjj 5064 male blue-belt
7 tall 5026 female pink
8 weddingplanning 4845 male blue
9 Random_Acts_Of_Amazon 3706 female hotpink
10 Cubers 3242 male blue
11 gameswap 2949 male blueshroom
12 Agario 2632 male blue
13 gonewildcurvy 2524 female verifypink
14 weddingplanning 2506 female pink
15 gonewildaudio 2099 male verifyblackblue
16 UIUC 2085 male bluei
17 destiny2 1991 male blue
18 mylittlepony 1920 female pinkiepie
19 PS4 1877 male ps-blue
20 DebateAChristian 1857 male blue brackets
21 Agario 1830 female pink
22 Supernatural 1816 male blue
23 short 1781 female pink
24 gonewildaudio 1574 female verifyblackpink
25 ModelUSGov 1488 male Blue
26 mildlyinfuriating 1438 male Blue
27 ABDL 1436 male blue
28 bjj 1419 male blue-belt-1
29 curlyhair 1387 male blue2
30 MechanicalKeyboards 1337 male mxblue2
31 proED 1335 male blue
32 ACTrade 1308 female Pink
33 CasualConversation 1302 female pink
34 bjj 1257 male blue-belt-2
35 cemu 1219 male blackandblue
36 curlyhair 1201 female pink
37 MechanicalKeyboards 1139 male blue
38 Android 1111 female userPink
39 androidapps 1034 male bluegrey
40 PowerTV 1020 male blueflaircop

Only /r/tall and /r/short appear to have greater than 1000 users with pink or blue flair to indicate gender. Others subreddits use pink and blue flair, but the subreddit rules don't necessarily suggest they were gender related.

A/S/L?

Let's attempt to parse all flair text on Reddit as A/S/L strings. Parse the flair text using a regular expression and look for a substring containing two slashes around an "M" or "F", case insensitively, with optional whitespace. The M or F is extracted with a capturing group and is then converted to the string "male" or "female".

In [7]:
%%bigquery
SELECT
  subreddit,
  num_flair,
  CASE
    WHEN asl = 'm' THEN 'male'
    WHEN asl = 'f' THEN 'female'
  END AS gender,
  sample_flair
FROM (
  SELECT
    subreddit,
    REGEXP_EXTRACT(
        LOWER(author_flair_text),
        '^[^/]+/(m|f)/[^/]+$') AS asl,
    COUNT(*) AS num_flair,
    ANY_VALUE(author_flair_text) as sample_flair
  FROM
    flair.userflair
  GROUP BY
    subreddit,
    asl)
WHERE
  asl IS NOT NULL
  AND num_flair > 1000
ORDER BY
  num_flair DESC;
Out[7]:
subreddit num_flair gender sample_flair
0 OkCupid 10839 male 19/m/CO
1 keto 5454 male 21/M/5'8" | SW: 295 | CW: 235 | GW: 165
2 OkCupid 4282 female 28/F/PA
3 keto 4180 female 37/F/5'10" SW:252 CW:150 Goal Weight Achieved
4 childfree 2962 female 23/F/Furbabies 4 lyfe
5 xxketo 2356 female 26/F/5'3" | SW-156 | CW-148 | GW-120
6 childfree 2038 male 28/M/Alberta
7 LGBTeens 1657 male 18/M/NY
8 loseit 1326 female 26/F/5'6'' SW:248 CW:202 GW:154
9 Tinder 1261 male 24/M/S

The following subreddits have greater than 1000 male or female users with A/S/L flair:

27M

After looking more carefully at the A/S/L results, I noticed some subreddits use flair of the format "27M", "M 27" or just an "M" within a string of loosely structured text. We use \p{L} in the regex to not match any Unicode letters (otherwise things like "Bayern München" match the regex). Apostrophes must also be excluded from our delimiters or the "m" will be matched from the word "I'm". A period and dollar sign must be excluded because things like "1.71m" or "$1.5M" could be matched.

In [8]:
%%bigquery
SELECT *
FROM (
    SELECT
      subreddit,
      REGEXP_EXTRACT(
        LOWER(author_flair_text),
        "(?:^|[^\\p{L}0-9'\\.\\$])\\s*(?:\\d\\d)?\\s*(f|m)\\s*(?:\\d\\d)?\\s*(?:$|[^\\p{L}0-9'\\.])") AS asl,
      COUNT(*) AS num,
      ANY_VALUE(author_flair_text) as author_flair_text
    FROM
      flair.userflair
    GROUP BY
      subreddit,
      asl
    ORDER BY num DESC)
WHERE
    asl IS NOT NULL
    AND num > 1000;
Out[8]:
subreddit asl num author_flair_text
0 keto m 31974 40/M/6'2" | SW: 255 | CW: 184 | GW: 185
1 loseit f 27250 28F 5'7" | SW 212.6 | CW 202.6 | GW 147
2 keto f 22952 27/f/5'3|SW:198.5|CW:161.6|GW:135
3 OkCupid m 15235 22/M gay af
4 pokemontrades m 14531 2595-5042-4008 || Joshua (M)
5 loseit m 13804 M27/6'3"/SW 410 CW 360 GW 240
6 xxketo f 13730 F/27/5'4" | SW 161 | CW 150 | GW 125
7 medicalschool m 8100 M-1
8 proED f 7644 4'11| CW 104 | GW: 100| 20F
9 OkCupid f 6480 21/F/NY
10 childfree f 6343 32/F/furbabies
11 LGBTeens m 5192 16 | Gay | M | IL
12 CFB m 4903 Texas A&amp;M Aggies
13 childfree m 4101 29/M/Upstate NY and Single
14 Pokemongiveaway m 3195 0319-1418-1967 | Slaapzak (M)
15 spikes m 2820 S: In Flux | M: Living End
16 SVExchange m 2611 4270-3630-0138 || Ryan (UM, M, αS, X) || 2507
17 fatlogic f 2319 27F 5'3" | SW: 151 | CW: 125 | GW: 115
18 MechanicalKeyboards m 2238 93 Model M | QFR MX Blues
19 CompetitiveWoW m 2027 4/10M Windwalker
20 financialindependence m 1890 32m/37k :: 52m/37k/1m (2017 USD / 2037 ER)
21 rawdenim f 1889 N&amp;F E3 WG / UB 101 / OldBlueCo 14.5oz Extr...
22 Tinder m 1755 20/M
23 infj m 1705 INFJ, 31M
24 100DaysofKeto f 1484 31F | 5'3'' | SW: 151.2 | CW: 146.8 | GW1: 130
25 headphones m 1359 ATH M50, HiFiMAN RE-400
26 infj f 1323 INFJ, 33, f
27 LetItDie f 1224 30F
28 infertility f 1040 31F, Severe MFI, Secondary Infertitlity, TTC 1...
29 fatlogic m 1015 M | 6'0 | Sep 2014 - 350 | CW 223

This new regex should capture everything the A/S/L regex did, but will also accept things like "27M or "M". It has some false positives, but it actually works fairly well if we limit it to the following subredits:

Collect Users

Now that we have identified queries to find gendered flair, let's collect a list of all users and their identified gender into a table.

In [9]:
%%bigquery --limit=10 --destinationtable=flair.usersubredditgenders
SELECT DISTINCT *
FROM (
  -- male/female
  SELECT
    subreddit,
    author,
    REGEXP_EXTRACT(
      LOWER(author_flair_css_class),
      '(?:fe)?male') AS gender
  FROM
    flair.userflair
  WHERE
    subreddit IN (
      'AskMen',
      'AskWomen',
      'AskMenOver30',
      'AskWomenOver30',
      'sexover30')
  UNION ALL
  -- pink/blue
  SELECT
    subreddit,
    author,
    CASE
      WHEN author_flair_css_class = 'blue' THEN 'male'
      WHEN author_flair_css_class = 'pink' THEN 'female'
    END AS gender
  FROM
    flair.userflair
  WHERE
    subreddit IN (
      'tall',
      'short')
  UNION ALL
  -- A/S/L
  SELECT
    subreddit,
    author,
    CASE
      WHEN asl = 'm' THEN 'male'
      WHEN asl = 'f' THEN 'female'
    END AS gender
  FROM (
    SELECT
      subreddit,
      author,
      REGEXP_EXTRACT(
        LOWER(author_flair_text),
        "(?:^|[^\\p{L}0-9'\\.\\$])\\s*(?:\\d\\d)?\\s*(f|m)\\s*(?:\\d\\d)?\\s*(?:$|[^\\p{L}0-9'\\.])") AS asl
    FROM
      flair.userflair
    WHERE
      subreddit IN (
        'OkCupid',
        'keto',
        'childfree',
        'xxketo',
        'LGBTeens',
        'loseit',
        'Tinder',
        'proED',
        'fatlogic',
        'financialindependence',
        'infj',
        'infertility',
        '100DaysofKeto')) )
WHERE
  gender IS NOT NULL;
Out[9]:
subreddit author gender
0 infj AsianSwagger male
1 infj galifanasana male
2 infj P-mac male
3 infj Yoshimotso male
4 infj Thunder_54 male
5 infj TeenageMutant1 male
6 infj Shannester male
7 infj TheAmberbrew male
8 infj ejgamer13 male
9 infj NotVeryWellFriend male

For various reasons, users may report different genders over time or in different subreddits.

Let's create a new table with only users and their gender and omit users that report multiple genders, otherwise these users would be counted twice in our final analysis.

In [10]:
%%bigquery --limit=10 --destinationtable=flair.users
SELECT
  author,
  ANY_VALUE(gender) as gender
FROM (
  SELECT
    author,
    gender
  FROM
    flair.usersubredditgenders
  GROUP BY
    author,
    gender )
GROUP BY
  author
HAVING COUNT(*) = 1;
Out[10]:
author gender
0 fisherpriceman male
1 alltimeisrelative male
2 Betrubnis male
3 VoicelikeHoney male
4 Pathomann male
5 JohnDylena male
6 SpookyWagons male
7 FerrisTriangle male
8 Jerry132 male
9 Juswantedtono male

Analysis

Now we can do our analysis of subreddits using the genders of authors of comments.

We use the standard read_gbq() function from pandas-gbq now because we want its handling of types and indexes.

In [11]:
import pandas as pd
In [12]:
query="""
SELECT
  subreddit,
  subreddit_size,
  COUNT(CASE WHEN gender='male' THEN 1 END) AS male,
  COUNT(CASE WHEN gender='female' THEN 1 END) AS female
FROM
  flair.users
JOIN (
  SELECT DISTINCT
    subreddit,
    author,
    subreddit_size
  FROM
    `fh-bigquery.reddit_comments.201*`
  JOIN (
    SELECT
      subreddit,
      COUNT(*) AS subreddit_size
    FROM (
      SELECT DISTINCT
        subreddit,
        author
      FROM
        `fh-bigquery.reddit_comments.201*`)
    GROUP BY
      subreddit)
  USING
    (subreddit)
  JOIN
    flair.users
  USING
    (author))
USING
  (author)
GROUP BY
  subreddit,
  subreddit_size
ORDER BY
  subreddit_size DESC;"""
df = pd.read_gbq(
    query,
    PROJECT_ID,
    index_col='subreddit',
    verbose=False,
    dialect='standard')
df.head(20)
Out[12]:
subreddit_size male female
subreddit
AskReddit 6744464 86595 50572
funny 3689875 68965 30025
pics 3608386 68255 30453
gaming 2517109 57656 13114
videos 2448897 55836 17154
IAmA 2413052 58177 24672
todayilearned 2300584 59938 23739
WTF 2075080 53287 22445
aww 1956664 42377 26527
worldnews 1904960 48800 14189
gifs 1813010 44012 14308
AdviceAnimals 1765520 50610 20777
movies 1649322 44204 14338
news 1480259 40769 14347
Music 1465879 38535 12851
mildlyinteresting 1359446 34267 14909
politics 1337245 39622 11478
Showerthoughts 1330610 32930 12688
explainlikeimfive 1061782 35026 13598
leagueoflegends 1038240 13388 1825

Create columns for the ratio of male to female users.

In [13]:
df['male_ratio'] = df.male / (df.female + df.male)
df['female_ratio'] = df.female / (df.female + df.male)

Plotting results

Plot the gender ratio for the 100 subreddits with the most comments.

In [14]:
%matplotlib inline
from matplotlib.ticker import FuncFormatter
In [15]:
df.sort_values('subreddit_size', ascending=False, inplace=True)
plot = df[['male_ratio', 'female_ratio']].head(100)[::-1].plot(
    kind='barh',
    stacked=True,
    figsize=(8, 20),
    xlim=[0, 1],
    title='Subreddit Gender Ratios',
    legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.45, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')

Plot the gender ratio for the subreddits that are predominantly female. Only include subreddits that we have sampled at least 1000 users.

In [16]:
df_female = df[(df.male + df.female) > 1000][['male_ratio', 'female_ratio']]
plot = df_female.sort_values('female_ratio', ascending=False).head(100)[::-1].plot(
    kind='barh',
    stacked=True,
    figsize=(8, 20),
    xlim=[0, 1],
    title='Subreddit Gender Ratios of Predominantly Female Subreddits',
    legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.42, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')

Plot the gender ratio for the subreddits that are predominantly male. Only include subreddits that we have sampled at least 1000 users.

In [17]:
df_male = df[(df.male + df.female) > 1000][['male_ratio', 'female_ratio']]
plot = df_male.sort_values('male_ratio', ascending=False).head(100)[::-1].plot(
    kind='barh',
    stacked=True,
    figsize=(8, 20),
    xlim=[0, 1],
    title='Subreddit Gender Ratios of Predominantly Male Subreddits',
    legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.38, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')

Plot the total number of submitters and the number of users we actually sampled for each subreddit. Note that the number of male and female users barely even visible on the graph.

Or data is somewhat terrible. This is okay if our sampled users are somehow actually a perfect random sample. That isn't especially likely though.

In [18]:
df.sort_values('subreddit_size', ascending=False, inplace=True)
plot = df[['male', 'female', 'subreddit_size']].head(100)[::-1].plot(
    kind='barh',
    figsize=(8, 30),
    title='Subreddit Gender Ratios',
    legend=True,
)
plot.set_ylabel('Subreddit')
plot.set_xlabel('Submitters')
plot.get_xaxis().get_major_formatter().set_scientific(False)
plot.legend(loc='upper left', bbox_to_anchor=(-0.5, 1.005), fontsize=12)
plot.text(-0.45, -0.02, 'This is how bad the data is. I promise the total number of male and female users that we have data for are actually on this plot.', fontsize=10,
          horizontalalignment='left',
          verticalalignment='top',
          transform=plot.axes.transAxes)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')
plot.get_legend().get_texts()[2].set_text('Total Commenters')

Interactive chart

Now let's save the data to a CSV file so we can create an interactive chart using D3.js.

In [19]:
df[(df.female + df.male) > 100].sort_values('subreddit_size', ascending=False).to_csv('subreddit_genders.csv')

Explore the completed interactive chart.