Assign country codes to affiliations¶

References:

We implemented affiliation extraction in the pubmedpy Python package for both PubMed and PMC XML records. These methods extract a sequence of textual affiliations for each author. Although, ideally, each affiliation record would refer to one and only one research organization, sometimes journals deposit multiple affiliations in a single structured affiliation. For example, we extracted the following composite affiliation for all authors of PMC4147893:

'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.

We designed a method for extracting countries from affiliations that accommodated multiple countries. We relied on two Python utilities to extract countries from text: geotext and geopy.geocoders.Nominatim. The first, geotext, used regular expressions to find mentions of places from the GeoNames database. In the above text, geotext detected four mentions of places in Germany: Saarland, Saarbrücken, Saarland, Germany. Anytime geotext identified 2 or more mentions of a country, we labeled the affiliation as including that country.

geopy.geocoders.Nominatim converts names / addresses to geographic coordinates using the OpenStreetMap’s Nomatim service. We split textual affiliations by punctuation and found the first segment, in reverse order, that returned any Nomatim search results. For the above affiliation, the search order was “France”, “Paris”, “Laboratory of Computational and Quantitative Biology”, et cetera. Since searching “France” returns a match by Nomatim, the following queries would not be made. When a match was found, we extracted the country containing the location. This approach returns a single country for an affiliation when successful. When labeling affiliations with countries, we only used these values when geotext did not return results or had ambiguity amongst countries without multiple matches.

In [1]:

import pathlib
import functools
import re
import lzma

import backoff
import geotext
import geopy.geocoders
import jsonlines
import pandas
import ratelimit
import tqdm.notebook

In [2]:

# read PubMed Central affiliations
affil_pmc_df = pandas.read_csv("data/pmc/affiliations.tsv.xz", sep='\t')
assert affil_pmc_df.affiliation.notna().all()
affil_pmc_df.head(2)

Out[2]:

	pmcid	position	affiliation
0	PMC100321	1	1 University of Cologne, Institute of Genetics...
1	PMC100321	2	1 University of Cologne, Institute of Genetics...

In [3]:

# read PubMed affiliations
affil_pm_df = pandas.read_csv("data/pubmed/affiliations.tsv.xz", sep='\t')
assert affil_pm_df.affiliation.notna().all()
affil_pm_df.head(2)

Out[3]:

	pmid	position	affiliation
0	7477412	1	Dept. of Pathology, Cornell Medical College, N...
1	7479891	1	National Center for Human Genome Research, Nat...

In [4]:

# combine affiliations from PubMed Central and PubMed
affiliations = sorted(set(affil_pmc_df.affiliation) | set(affil_pm_df.affiliation))
print(f"{len(affiliations):,} unique affiliation strings")
affiliations[:5]

446,551 unique affiliation strings

Out[4]:

['"Athena" Research and Innovation Center, Athens, 15125, Greece.',
 '"Athena" Research and Innovation Center, Athens, 15125, Greece. kzagganas@uop.gr.',
 '"Biology of Spirochetes" Unit, Institut Pasteur, 28 Rue Du Docteur Roux, 75724, Paris Cedex 15, France, mathieu.picardeau@pasteur.fr.',
 '"Cephalogenetics" Genetic Center, Athens, Greece.',
 '"Cephalogenetics" Genetic Center, Athens, Greece. cyapi@med.uoa.gr.']

In [5]:

def get_countries_geotext(text):
    """
    Use geotext package to get country metions.
    The returned counts dict maps from country code
    to number of places located to that county in the text.

    For example, the following code detects GR twice, first for
    "Athens" and second for "Greece":

    ```
    >>> text = '"Athena" Research and Innovation Center, Athens, 15125, Greece.'
    >>> get_countries_geotext(text)
    {'GR': 2}
    ```

    See https://github.com/elyase/geotext
    """
    geo_text = geotext.GeoText(text)
    return dict(geo_text.country_mentions)

geopy.geocoders.options.default_user_agent = 'https://github.com/greenelab/iscb-diversity'
geolocator = geopy.geocoders.Nominatim(timeout=5)

@functools.lru_cache(maxsize=200_000)
@backoff.on_exception(backoff.expo, ratelimit.RateLimitException, max_tries=8)
@ratelimit.limits(calls=1, period=1)
def _geocode(text):
    """
    https://operations.osmfoundation.org/policies/nominatim/
    """
    return geolocator.geocode(text, addressdetails=True)

email_pattern = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
split_pattern = re.compile(r"; |, |\. ")

def get_country_geocode(affiliation):
    """
    Lookup country using OSM Nomatim.
    Splits affiliations text into parts by punctuation,
    searching for country matches in reverse. Parts that
    look like email addresses are discarded.
    """
    parts = split_pattern.split(affiliation)
    for part in reversed(parts):
        part = part.rstrip(".")
        if email_pattern.fullmatch(part):
            continue
        location = _geocode(part)
        if not location:
            continue
        try:
            return location.raw['address']['country_code'].upper()
        except KeyError:
            return None


def query_affiliation(affiliation: str):
    return dict(
        affiliation=affiliation,
        country_geocode=get_country_geocode(affiliation),
        countries_geotext=get_countries_geotext(affiliation),
    )

In [6]:

# read already queried affiliations
path = pathlib.Path('data/affiliations/geocode.jsonl.xz')
lines = jsonlines.Reader(lzma.open(path, "rt")) if path.exists() else []
existing = {row['affiliation'] for row in lines}
new = sorted(set(affiliations) - existing)
print(f"{len(affiliations):,} total affiliations: {len(existing):,} already queried, {len(new):,} new")

446,551 total affiliations: 475,120 already queried, 0 new

In [7]:

# query new affiliations and append to JSON Lines file
with lzma.open(path, mode='at') as write_file:
    with jsonlines.Writer(write_file) as writer:
        for affiliation in tqdm.notebook.tqdm(new):
            result = query_affiliation(affiliation)
            writer.write(result)

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

In [8]:

# Read the jsonlines file
with jsonlines.Reader(lzma.open(path, "rt")) as reader:
    lines = list(reader)
# Show a single line
lines[10]

Out[8]:

{'affiliation': "'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.",
 'country_geocode': 'FR',
 'countries_geotext': {'DE': 4, 'FR': 4, 'US': 2},
 'countries': ['DE', 'FR', 'US']}

In [9]:

query_affiliation("Laboratory of Computational and Quantitative Biology, Paris, France")

Out[9]:

{'affiliation': 'Laboratory of Computational and Quantitative Biology, Paris, France',
 'country_geocode': 'FR',
 'countries_geotext': {'FR': 2}}

In [10]:

def get_consensus_countries(line: dict) -> list:
    """
    Get a list of countries resulting from a consensus algorithm
    between countries_geotext and country_geocode.
    """
    country_geotext: dict = line['countries_geotext']
    country_geocode: str = line['country_geocode']
    if not country_geotext:
        # geotext empty, so use geocode country or nothing
        return [country_geocode] if country_geocode else []
    countries_gte_2 = {
        country: count for country, count in
        country_geotext.items() if count >= 2
    }
    if countries_gte_2:
        # countries with multiple mentions according to geotext
        return list(countries_gte_2)
    if country_geocode and country_geocode in country_geotext:
        # geocode country matches a geotext country
        return [country_geocode]
    return list(country_geotext)

In [11]:

for line in lines:
    line["countries"] = get_consensus_countries(line)

In [12]:

lines[10]

Out[12]:

{'affiliation': "'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.",
 'country_geocode': 'FR',
 'countries_geotext': {'DE': 4, 'FR': 4, 'US': 2},
 'countries': ['DE', 'FR', 'US']}

In [13]:

with lzma.open(path, mode='wt') as write_file:
    with jsonlines.Writer(write_file) as writer:
        writer.write_all(lines)

In [14]:

affil_df = pandas.DataFrame(lines)
affil_df = (
    affil_df
    .explode("countries")
    .rename(columns={"countries": "country"})
    .dropna(subset=['country'])
    [["affiliation", "country"]]
    .drop_duplicates()  # for safety, duplicates not expected
)
affil_df.head(2)

Out[14]:

	affiliation	country
0	"Athena" Research and Innovation Center, Athen...	GR
1	"Athena" Research and Innovation Center, Athen...	GR

In [15]:

len(lines)

Out[15]:

In [16]:

affil_df.shape

Out[16]:

(488554, 2)

In [17]:

# Most common countries
affil_df.country.value_counts(normalize=True).head(20).map("{:.02%}".format)

Out[17]:

US    30.66%
CN    13.03%
GB     6.89%
DE     6.08%
FR     3.95%
CA     3.11%
IT     3.11%
JP     2.89%
ES     2.84%
AU     2.62%
NL     1.87%
IN     1.69%
KR     1.67%
BR     1.54%
CH     1.32%
SE     1.26%
TW     1.10%
BE     1.01%
DK     0.96%
SG     0.69%
Name: country, dtype: object

In [18]:

# save affiliations to a table
affil_df.to_csv("data/affiliations/countries.tsv.xz", sep='\t', index=False)

In [ ]: