Gender Guessing Methods

Now that we've got our author data set and inferences for genders, it's time to do some exploratory data analysis. I'm going to do this in julia. I made two functions - importauthors() and getgenderprob() that take the csv fils from write_names_to_file found in xml parsing and create julia DataFrames.

In [2]:
include("../src/dataimport.jl")
WARNING: Method definition importauthors(String, String) in module Main at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:5 overwritten at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:5.
WARNING: Method definition getgenderprob(DataFrames.DataFrame, String, Symbol) in module Main at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:14 overwritten at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:14.
Out[2]:
getgenderprob (generic function with 1 method)
In [3]:
using DataFrames

bio = importauthors("../data/pubdata/bio.csv", "bio")
comp = importauthors("../data/pubdata/comp.csv", "comp")

comp[1:5, :]
Out[3]:
IDDateJournalAuthor_First_NameAuthor_Last_NameAuthor_InitialsPositionTitleDataset
1266053822015-11-24IEEE/ACM Trans Comput Biol BioinformYufeiHuangNAfirstSelected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).comp
2266053822015-11-24IEEE/ACM Trans Comput Biol BioinformXiaoningQianNAlastSelected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).comp
3266053822015-11-24IEEE/ACM Trans Comput Biol BioinformYidongChenNAsecondSelected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).comp
4263570622015-09-11IEEE/ACM Trans Comput Biol BioinformHaiyingWangNAfirstOrganized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.comp
5263570622015-09-11IEEE/ACM Trans Comput Biol BioinformHuiruZhengNAlastOrganized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.comp

Now let's combine all the data - we can subset it again later. I'm also going to clear the bio and comp variables to free up some memory.

We'll also use the getgenderprob() function to add columns for the probability that the author is female (P) using the different apis and the number of times that name showed up in the respective database, which gives us some sense of how certain we can be in the result (Count).

Finally, we'll use pool!, which makes the represenation of factored data (data that has distinct rather than continuous values) a bit more efficient in memory (and will make queries faster later on).

In [4]:
alldata = vcat(bio, comp)
bio = 0
comp = 0

alldata[:izeP], alldata[:izeCount] = getgenderprob(
    alldata, "../data/genders/genderize_genders.json", :Author_First_Name)
alldata[:apiP], alldata[:apiCount] = getgenderprob(
    alldata, "../data/genders/genderAPI_genders.json", :Author_First_Name)

pool!(alldata)

In julia, we can subset our dataframes pretty easily. For example, we can pull back out rows for our different datasets.

In [5]:
alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal

biodata = alldata[alldata[:Dataset] .== "bio", :] # get all columns for rows where the Dataset column is "bio"
compdata = alldata[alldata[:Dataset] .== "comp", :]

biodata[1:5, :] # get the first 5 rows, and all columns
Out[5]:
IDDateJournalAuthor_First_NameAuthor_Last_NameAuthor_InitialsPositionTitleDatasetizePizeCountapiPapiCount
1264664252015-10-15Southeast Asian J. Trop. Med. Public HealthSuwitChotinunNAfirstPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.020.020000000000000018306
2264664252015-10-15Southeast Asian J. Trop. Med. Public HealthPrapasPatchaneeNAlastPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.010.03000000000000002758
3264664252015-10-15Southeast Asian J. Trop. Med. Public HealthSuvichaiRojanasthienNAsecondPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bioNA0NA0
4264664252015-10-15Southeast Asian J. Trop. Med. Public HealthPakpoomTadeeNApenultimatePREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.010.020000000000000018116
5264664252015-10-15Southeast Asian J. Trop. Med. Public HealthFredUngerNAotherPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.0200000000000000189660.04000000000000003649394

Now we're going to use the plotting package Plots, which allows us to use different plotting backends to take a look. First, we need to reshape the data a little bit to make it easier to plot.

In [6]:
using StatPlots
gr()

izemeans = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:izeP]))))
apimeans = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:apiP]))))

izemeans[:method] = "genderize"
apimeans[:method] = "genderAPI"

allmeans = vcat(izemeans, apimeans)
Out[6]:
DatasetMeanPFmethod
1bio0.35246780033729036genderize
2comp0.3191075321651934genderize
3bio0.34048754243415824genderAPI
4comp0.29633429724211385genderAPI

The rather complicated expression below makes an ndarray that can be passed to "groupedbar" to make the plot.

In [7]:
ys = hcat([allmeans[allmeans[:Dataset] .== x, :MeanPF] for x in levels(allmeans[:Dataset])]...)

groupedbar(ys, bar_position=:dodge, 
    ylims=(0,1), xticks=([1,2],["genderize", "genderAPI"]),
            lab=["Bio", "comp"],
            xlabel="Gender Calling Method",
            ylabel="Percent Female",
            title="Proportion of Female Authors")
Out[7]:
genderize genderAPI 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Female Authors Gender Calling Method Percent Female Bio comp
In [8]:
genderize_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:izeP])))
genderapi_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:apiP])))
Out[8]:
PositionDatasetx1
1firstbio0.37557475101349735
2firstcomp0.316338658146965
3lastbio0.24458769214547696
4lastcomp0.20729178007621116
5otherbio0.3680884734668387
6othercomp0.33070648163576216
7penultimatebio0.2793692129629631
8penultimatecomp0.23617388670099979
9secondbio0.378590785907859
10secondcomp0.32179097154072633
In [9]:
ys = hcat([genderize_byposition[genderize_byposition[:Dataset] .== x, :x1] for x in levels(genderize_byposition[:Dataset])]...)

groupedbar(ys, bar_position=:dodge, 
            ylims=(0,0.6), xticks=(1:5,levels(genderize_byposition[:Position])),
            lab=levels(genderize_byposition[:Dataset]),
            xlabel="Author Position",
            ylabel="Percent Female",
            title="By Position, Genderize.io")
Out[9]:
first last other penultimate second 0.0 0.1 0.2 0.3 0.4 0.5 By Position, Genderize.io Author Position Percent Female bio comp
In [10]:
ys = hcat([genderapi_byposition[genderapi_byposition[:Dataset] .== x, :x1] for x in levels(genderapi_byposition[:Dataset])]...)

groupedbar(ys, bar_position=:dodge, 
            ylims=(0,0.6), xticks=(1:5,levels(genderapi_byposition[:Position])),
            lab=levels(genderapi_byposition[:Dataset]),
            xlabel="Author Position",
            ylabel="Percent Female",
            title="By Position, GenderAPI")
Out[10]:
first last other penultimate second 0.0 0.1 0.2 0.3 0.4 0.5 By Position, GenderAPI Author Position Percent Female bio comp

The good news:

  • both of the gender inferences look pretty similar.
  • this recapitulates previously published data that:
    • Women are less likely to be authors than men
    • Women are less likely to be first authors than second authors
    • Women are less likely to be last authors than first authors

The bad news - this recapitulates previously published data that women are under-represented in biology publishing.

New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected.

Methods discrepancies

Using genderize, it looks like women are better represented than when using genderAPI. Which one is better?

A couple of things to consider:

  1. how many names of our names can the service guess?
  2. what proportion of authors can the service guess (this is a different question)
  3. for names that can be guessed, how certain can we be that the gender assignment is correct?

To do this, we'll start by reshaping our dataframe to show the stats for each name, including the number of times they show up.

In [88]:
names = by(biodata, [:Author_First_Name, :izeP, :apiP], df -> DataFrame(
                                                    izeCount = mean(df[:izeCount]), 
                                                    apiCount = mean(df[:apiCount]),
                                                    Frequency = length(df[:izeCount])
                                                    )
            )

names[1:5, :]
Out[88]:
Author_First_NameizePapiPizeCountapiCountFrequency
1'AzlinNANA0.00.01
2A0.41000000000000003NA56.00.018676
3A-ANANA0.00.01
4A-BNANA0.00.05
5A-CNANA0.00.015

One difference that's immediately apparent is that Genderize has guesses for initials, while genderAPI doesn't.

In [90]:
initials = names[map(x->length(x), names[:Author_First_Name]) .== 1, :]

initials[1:26, :]
Out[90]:
Author_First_NameizePapiPizeCountapiCountFrequency
1A0.41000000000000003NA56.00.018676
2B0.35NA40.00.06719
3C0.26NA27.00.012175
4D0.20999999999999996NA43.00.010180
5E0.14NA21.00.07615
6F0.32999999999999996NA6.00.06046
7G0.17000000000000004NA23.00.08221
8H0.15000000000000002NA13.00.07365
9INANA0.00.03682
10J0.17000000000000004NA122.00.020916
11K0.37NA49.00.08427
12L0.58NA24.00.08543
13M0.26NA158.00.024032
14N0.43999999999999995NA9.00.05327
15O0.10999999999999999NA9.00.01690
16P0.32999999999999996NA6.00.09894
17Q0.0NA1.00.0394
18R0.20999999999999996NA42.00.012106
19S0.32999999999999996NA42.00.015745
20T0.12NA73.00.07693
21U0.5NA2.00.0752
22V0.75NA4.00.03662
23W0.29000000000000004NA7.00.04169
24XNANA0.00.01147
25YNANA0.00.03854
26Z0.4NA5.00.01283

Another thing that you might have noticed is that there are a lot of names that come through as initials, so this might make a pretty significant difference.

In [91]:
# 1. how many of our names can the service guess?

println("Gender-API:   $(length(names[names[:apiCount] .!= 0, :Author_First_Name]) / length(names[:Author_First_Name]))")
println("Genderize.io: $(length(names[names[:izeCount] .!= 0, :Author_First_Name]) / length(names[:Author_First_Name]))")
Gender-API:   0.5701339901214076
Genderize.io: 0.3754802093511987

So it looks like GenderAPI can guess lot more of the unique names, but this doesn't take into consideration how many times each name shows up. Maybe Genderize has a lot more of the more frequent names

In [92]:
# 2. what proportion of authors can the service guess (this is a different question)

println("Gender-API:   $(length(biodata[biodata[:apiCount] .!= 0, :Author_First_Name]) / length(biodata[:Author_First_Name]))")
println("Genderize.io: $(length(biodata[biodata[:izeCount] .!= 0, :Author_First_Name]) / length(biodata[:Author_First_Name]))")
Gender-API:   0.7339417358360227
Genderize.io: 0.8704293357223296

Here it seems that genderize has the upperhand. Then again, remember all of those names that are just initials. What if we take those out of the mix?

In [93]:
println("Gender-API:   $(length(biodata[(biodata[:apiCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :Author_First_Name]) / length(biodata[:Author_First_Name]))")
println("Genderize.io: $(length(biodata[(biodata[:izeCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :Author_First_Name]) / length(biodata[:Author_First_Name]))")
Gender-API:   0.7339255453495526
Genderize.io: 0.6890671041695899

It looks like most of genderize's advantage comes from the fact that it's guessing on initials. It's unclear whether we should include these names. There's evidence that women are more likely to use initials when publishing than men. At the same time, most of genderize's guesses for gender based on initials skew towards male. The combination of these things would lead me to expect genderize to underpredict female authorship, yet we saw above that the genderize guesses skew female compared to GenderAPI.

What happens to the genderize data when we drop the initials?

In [94]:
# Bio data, first author
println(mean(dropna(biodata[biodata[:Position] .== "first", :izeP])))
println(mean(dropna(biodata[(biodata[:Position] .== "first") & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeP])))
0.3817953047613131
0.40951634035036066
In [95]:
# Comp data, first author
println(mean(dropna(compdata[compdata[:Position] .== "first", :izeP])))
println(mean(dropna(compdata[(compdata[:Position] .== "first") & (map(x->length(x), compdata[:Author_First_Name]) .!= 1), :izeP])))
0.34289434889434894
0.3481651843200671
In [96]:
# Bio data, last author
println(mean(dropna(biodata[biodata[:Position] .== "last", :izeP])))
println(mean(dropna(biodata[(biodata[:Position] .== "last") & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeP])))
0.26790239032911917
0.26300646260127664
In [97]:
# Comp data, last author
println(mean(dropna(compdata[compdata[:Position] .== "last", :izeP])))
println(mean(dropna(compdata[(compdata[:Position] .== "last") & (map(x->length(x), compdata[:Author_First_Name]) .!= 1), :izeP])))
0.22832719728845252
0.2232962737496396

So, it doesn't look like leaving in the predictions based on intitials is substantially swaying the predictions one way or the other. What else might explain the difference?

In [99]:
# 3. for names that can be guessed, how certain can we be that the gender assignment is correct?

println("Gender-API:            $(mean(biodata[biodata[:apiCount] .!= 0, :apiCount]))")
println("Genderize.io:          $(mean(biodata[biodata[:izeCount] .!= 0, :izeCount]))")

# Excluding initials
println("Genderize no initials: $(mean(biodata[(biodata[:izeCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeCount]))")
Gender-API:            40342.13820483596
Genderize.io:          1342.4970549088985
Genderize no initials: 1680.9153182435255
In [105]:
n = names[(names[:apiCount] .> 0) & (names[:izeCount] .> 0), :]
n[1:5, :]
Out[105]:
Author_First_NameizePapiPizeCountapiCountFrequency
1Aabha1.00.9911.076.01
2Aabid0.00.0200000000000000181.0236.02
3Aad0.00.0500000000000000442.0660.026
4Aaditya0.00.0100000000000000094.0712.03
5Aafia1.01.01.022.01
In [111]:
scatter(n, :Frequency, :izeCount, 
    lab="genderize.io", α=0.5,
    yaxis=("Count", :log10),
    xaxis=("Name Frequency", :log10))
scatter!(n, :Frequency, :apiCount, lab="genderAPI", α=0.5)
Out[111]:
10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 5 Frequency apiCount