Progress report 2¶

Asura Enkhbayar, 15.06.2020

This report contains a brief summary of the methodology and a few descriptives to assess it. Furthermore, some statistics and plots are generated for the citationa and altmetrics gathered for the current data.

In [63]:

from pathlib import Path

from IPython.display import Markdown as md

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib_venn import venn3
import pandas as pd
import seaborn as sns

import numpy as np

from tracking_grants import references_f, articles_f, wos_f, altmetric_f

In [64]:

# Seaborn styles
sns.set_style("darkgrid")

# Matplotlib figure configuration fonts and figsizes
plt.rcParams.update({
    'font.family':'sans-serif',
    'font.size': 16.0,
    'text.usetex': False,
    'figure.figsize': (11.69,8.27)
})

# Color palette
cm = "Paired"
cp3 = sns.color_palette(cm, 3)
cp10 = sns.color_palette(cm, 10)

Methodology¶

1. Input data¶

The input data always comes from the CDMRP database (https://cdmrp.army.mil/search.aspx). The data contains unstructured references and grant IDs for each article in the selected research program.

In [65]:

# Load references
refs = pd.read_csv(references_f, index_col="reference_id")

In [66]:

keywords = refs.program.unique().tolist()
mds = f"""
**Data from CDMRP database**

- Selected research programs: { ", ".join(keywords) }
- Total number of references found: { len(refs) }
- Unique references: { refs.reference.nunique() } => Several articles are in multiple research programs and multiple grants affiliated
    - Unique grant IDs: { refs.grant_id.nunique() }
"""
md(mds)

Out[66]:

Data from CDMRP database

Selected research programs: TSCRP, NFRP, PCRP, PRORP
Total number of references found: 9078
Unique references: 8620 => Several articles are in multiple research programs and multiple grants affiliated
- Unique grant IDs: 2227

2. Matching references to articles¶

From here on, we are using a reference matcher developed by Dominika Tkaczyk at Crossref. The algorithm is described to some degree in a blog series: https://www.crossref.org/blog/matchmaker-matchmaker-make-me-a-match/

The algorithm is implemented in Java and no longer very well maintained. However, it seems to be working fine. Dominika recommends up to 20 threads (already implemented) and with some workarounds we could also implement progress reports for larger datasets which are not supported at the moment.

In our case, I used the 8620 unique references from our dataset and matched 6711 articles. Each matched article also comes with the matching score. In this case, the implementation takes care of thresholds and simply returns Nulls for those references that are not matching (enough).

In [67]:

# Load matched articles
articles = pd.read_csv(articles_f)

articles.DOI = articles.DOI.str.lower()

In [68]:

mds = f"""
**Matching references with Crossref**

- Matched articles: { articles.DOI.nunique() }
- Articles in multiple research programs: { (articles.groupby("DOI").program.nunique()>1).sum() }
- Articles funded by multiple grants: { (articles.groupby("DOI").grant_id.nunique()>1).sum() }
"""
md(mds)

Out[68]:

Matching references with Crossref

Matched articles: 6711
Articles in multiple research programs: 198
Articles funded by multiple grants: 829

3. Retrieving metrics¶

Once the the references have been matched to articles in Crossref, we are using the DOIs to retrieve metrics from the Web of Science and Altmetric.

In [69]:

# Load metrics from WoS
wos = pd.read_csv(wos_f,  low_memory=False, index_col="DOI")
wos.columns = [x.lower() for x in wos.columns.tolist()]
wos.index = wos.index.str.lower()

wos = wos.rename(columns={'relative citation score':'citation_score'})

In [70]:

# Load metrics from Altmetric
altmetrics = pd.read_json(altmetric_f).T

# Filter out all articles had not altmetrics
altmetrics = altmetrics[altmetrics.altmetric_id.notna()]

In [71]:

dates = ["last_updated", "published_on", "added_on"]
for d in dates:
    altmetrics[d] = pd.to_datetime(altmetrics[d], unit="s")

str_cols = ["pmid", "pmc", "altmetric_id", "doi", 'hollis_id', "arxiv_id"]
for _ in str_cols:
    altmetrics[_] = altmetrics[_].astype(str)
    
metric_cols = {
    'cited_by_posts_count': 'posts_count',
    'cited_by_rh_count': 'research_highlight',
    'cited_by_tweeters_count': 'twitter_accounts',
    'cited_by_patents_count': 'patents',
    'cited_by_msm_count': 'news_outlets',
    'cited_by_feeds_count': 'blogs',
    'cited_by_fbwalls_count': 'fb_pages',
    'cited_by_qna_count': 'qna_count',
    'cited_by_videos_count': 'videos',
    'cited_by_peer_review_sites_count': 'peer_reviews',
    'cited_by_weibo_count': 'weibo',
    'cited_by_gplus_count': 'gplus',
    'cited_by_rdts_count': 'reddit_threads',
    'cited_by_policies_count': 'policies',
    'cited_by_syllabi_count': 'syllabi',
    'cited_by_linkedin_count': 'linkedin',
    'cited_by_wikipedia_count': 'wikipedia',
}
altmetrics = altmetrics.rename(columns=metric_cols)
metric_cols = list(metric_cols.values())

altmetrics[metric_cols] = altmetrics[metric_cols].astype(float)

cols_to_keep = metric_cols + dates + str_cols + ['subjects', 'scopus_subjects']
altmetrics = altmetrics[cols_to_keep]

# Transform all DOIs to lowercase
altmetrics.index = altmetrics.index.str.lower()

Results in detail¶

In [122]:

metrics = articles.drop_duplicates().merge(altmetrics[metric_cols], left_on="DOI", right_index=True, how="left")
metrics = metrics.merge(wos[["citations", "citation_score"]], left_on="DOI", right_index=True, how="left")

In [123]:

# Replace articles with 0 citations as NaN
metrics = metrics.replace(0.0, np.nan)

In [124]:

mds = f"""
**Metrics from the Web of Science and Altmetric.com**

- Articles found in WoS: { len(wos) }
    - Articles with at least 1 citation: { metrics.citations.count() }
- Articles found in Altmetric.com: { len(altmetrics) }
    - Articles with tweets: { altmetrics.twitter_accounts.notna().sum() }
    - Articles with FB mentions: { altmetrics.fb_pages.notna().sum() }
"""
md(mds)

Out[124]:

Metrics from the Web of Science and Altmetric.com

Articles found in WoS: 2688
- Articles with at least 1 citation: 3298
Articles found in Altmetric.com: 3372
- Articles with tweets: 1488
- Articles with FB mentions: 432

Coverage¶

In [76]:

all_articles = set(articles.DOI.unique().tolist())
articles_w_altm = set(altmetrics.index.tolist())
articles_w_cit = set(wos.index.tolist())

total = len(all_articles)

v = venn3([all_articles, articles_w_altm, articles_w_cit],
      set_labels=('', '', ''),
      subset_label_formatter=lambda x: "{:,} ({:.1f})".format(x, 100*x/total));

v.get_patch_by_id('100').set_color(cp3[0])
v.get_patch_by_id('110').set_color(np.add(cp3[0], cp3[1])/2)
v.get_patch_by_id('101').set_color(np.add(cp3[0], cp3[2])/2)
v.get_patch_by_id('111').set_color(np.add(np.add(cp3[1], cp3[0]), cp3[2]) / 3)

for text in v.set_labels:
    text.set_fontsize(10)
# for text in v.subset_labels:
#     text.set_fontsize(12)

handles = []
labels=["All articles", "Altmetric", "WoS"]
for l, c in zip(labels, [cp3[0], np.add(cp3[0], cp3[1])/2, np.add(cp3[0], cp3[2])/2]):
    handles.append(mpatches.Patch(color=c, label=l))
plt.legend(handles=handles);

# plt.gca().legend(handles=[v.get_patch_by_id('100'), v.get_patch_by_id('010'), v.get_patch_by_id('001')],
#                  , prop={'size': 12});

Disciplines¶

In [77]:

wos.groupby("discipline").size().div(len(wos)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in WoS disciplines (n={len(wos)})")
sns.despine()

In [78]:

pdf = pd.get_dummies(altmetrics['scopus_subjects'].apply(pd.Series).stack()).sum(level=0).melt()
pdf.groupby("variable")['value'].sum().div(len(altmetrics)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in Scopus subjects (n={len(altmetrics)})")
sns.despine()

Counts¶

In [79]:

pdf = metrics[metrics.columns.drop('citation_score')].count().div(len(metrics)/100).sort_values()
pdf.plot(kind="barh", color=cp3[0])
plt.xlabel("Coverage [%]")
count_order = pdf.index.tolist()

In [80]:

pdf = metrics[metrics.columns].reset_index().melt(id_vars="DOI")
sns.boxenplot(y="variable", x="value", data=pdf, order=count_order[::-1], k_depth="trustworthy", color=cp3[0])
plt.ylabel("")
plt.xlabel("Count")
plt.xscale("log")
ticks = [1,2,3,5,10,100,1000]
plt.xticks(ticks, ticks);

In [81]:

wos.groupby("discipline").citation_score.mean().to_frame("avg_citation_score")

Out[81]:

	avg_citation_score
discipline
Biology	2.366737
Biomedical Research	2.236923
Chemistry	1.434047
Clinical Medicine	2.030200
Earth and Space	0.982333
Engineering and Technology	2.124703
Health	0.787700
Mathematics	1.490500
Physics	1.435974
Professional Fields	1.083000
Psychology	0.115000

In [82]:

pdf = wos.melt(id_vars="discipline", value_vars="citation_score")
sns.boxenplot(y="discipline", x="value", data=pdf)
plt.xlim(0,15);

Web of Science in detail¶

In [83]:

pdf = pd.DataFrame(index=range(2000,2020))
pdf['count'] = wos.groupby("year").size()
pdf['count'].plot(kind="bar", color=cp3[0]);
sns.despine()
plt.title("Articles indexed in WoS by years");

In [84]:

pdf = wos.replace(0, np.nan).melt(id_vars="year", value_vars="citations")
for y in range(pdf.year.min(),pdf.year.max()+1):
    if str(y) not in pdf.year:
        pdf.loc[len(pdf)+1] = [y, 'citations', np.nan]
sns.boxenplot(x="year", y="value", data=pdf, color=cp3[0])
plt.title("Letter-value plot of citation counts by year")
plt.ylim(1,2000)
plt.yscale("log")
plt.ylabel("")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);

In [85]:

pdf = pd.DataFrame(index=range(2000,2020))
pdf['mean'] = wos.groupby("year").citations.mean()
pdf['mean'].plot(kind="bar", color=cp3[0])
sns.despine()
plt.title("Average citations for articles by year");

Altmetrics in detail¶

In [86]:

altmetrics['year'] = altmetrics.published_on.map(lambda x: str(x.year))

In [87]:

pdf = altmetrics.groupby("year").size().to_frame("count")
pdf.index = pdf.index.astype(float)
for y in range(int(pdf.index.min()), int(pdf.index.max()+1)):
    if y not in pdf.index:
        pdf.loc[y] = [np.nan]

In [88]:

pdf = altmetrics.groupby("year").size().to_frame("count")
pdf = pdf[pdf.index.astype(float)>1996]
pdf.plot(kind="bar", color=cp3[0]);
plt.title("Articles indexed in Altmetric.com by year");

In [89]:

plt_metrics = ['twitter_accounts', 'fb_pages']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()

fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");

In [90]:

plt_metrics = ['news_outlets', 'wikipedia']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()

fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");

By Research Programs¶

In [271]:

df = refs.groupby("program").size().to_frame("total references")
df['found in crossref'] = articles.groupby("program").DOI.nunique()
df["found (%)"] = 100 * df['found in crossref'] / refs.groupby("program").size()
df.columns = ["References", "Found DOI", "Found (%)"]
df = df.sort_values("References")
df.round(2)

Out[271]:

	References	Found DOI	Found (%)
program
NFRP	208	197	94.71
TSCRP	208	197	94.71
PRORP	354	271	76.55
PCRP	8308	6246	75.18

In [272]:

cs = ['coci_citations', 'citations', 'posts_count']

x = metrics.groupby(["DOI", "program"])[cs].mean().reset_index().groupby("program").count()[cs]
x[cs] = x[cs].apply(lambda x: 100 * x / articles.groupby("program").DOI.nunique())
x.columns = ["COCI (Cov in %)", "WoS (Cov in %)", "Altmetric (Cov in %)"]
x.reindex(df.index).round(2)

Out[272]:

	COCI (Cov in %)	WoS (Cov in %)	Altmetric (Cov in %)
program
NFRP	97.97	66.50	59.39
TSCRP	97.97	66.50	59.39
PRORP	96.68	71.96	56.09
PCRP	96.13	36.60	49.70

In [215]:

df = metrics.groupby(["grant_id", "program"])[['coci_citations', 'citations', 'twitter_accounts', 'citation_score']].mean()
df['count'] = metrics.groupby(["grant_id", "program"]).size()
df = df.sort_values("count").reset_index()
df

Out[215]:

	grant_id	program	coci_citations	citations	twitter_accounts	citation_score	count
0	MP980015	PCRP	1.000000	NaN	NaN	NaN	1
1	PC081249P1	PCRP	176.000000	163.000000	NaN	4.770000	1
2	PC081249	PCRP	176.000000	163.000000	NaN	4.770000	1
3	PC081246	PCRP	36.000000	30.000000	NaN	0.575000	1
4	PC081176	PCRP	19.000000	12.000000	1.000000	0.672000	1
...	...	...	...	...	...	...	...
2239	PC100473	PCRP	89.487805	71.633333	2.153846	3.444767	41
2240	PC010267	PCRP	23.527273	20.500000	NaN	0.779500	57
2241	PC051369	PCRP	43.629630	30.780000	11.272727	1.319580	83
2242	PC081610	PCRP	73.276316	101.909091	4.750000	3.421500	97
2243	PC021004	PCRP	133.000000	83.000000	1.200000	1.388000	136

2244 rows × 7 columns

In [219]:

# df['citation_score'] = 100*df['citation_score']
sns.scatterplot(x="citation_score", y="twitter_accounts", hue="program", size="count", data=df, sizes=(50,300), alpha=.5)
# plt.xlim(0, 500)
# plt.ylim(0, 100)
# plt.yscale("log")
# plt.xscale("log")

Out[219]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f17dac92358>

To-Do¶

New data to get

Get Metadata from Crossref
Get OA status from unpaywall
clinicaltrials.gov

Available data from DoD

Patents
Drugs to market

Notes

Do analysis for

For grants

award size
and some measures

Next Thing To Do:

Get some plots going to present the gist of this