Asura Enkhbayar, 15.06.2020
This report contains a brief summary of the methodology and a few descriptives to assess it. Furthermore, some statistics and plots are generated for the citationa and altmetrics gathered for the current data.
from pathlib import Path
from IPython.display import Markdown as md
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib_venn import venn3
import pandas as pd
import seaborn as sns
import numpy as np
from tracking_grants import references_f, articles_f, wos_f, altmetric_f
# Seaborn styles
sns.set_style("darkgrid")
# Matplotlib figure configuration fonts and figsizes
plt.rcParams.update({
'font.family':'sans-serif',
'font.size': 16.0,
'text.usetex': False,
'figure.figsize': (11.69,8.27)
})
# Color palette
cm = "Paired"
cp3 = sns.color_palette(cm, 3)
cp10 = sns.color_palette(cm, 10)
The input data always comes from the CDMRP database (https://cdmrp.army.mil/search.aspx). The data contains unstructured references and grant IDs for each article in the selected research program.
# Load references
refs = pd.read_csv(references_f, index_col="reference_id")
keywords = refs.program.unique().tolist()
mds = f"""
**Data from CDMRP database**
- Selected research programs: { ", ".join(keywords) }
- Total number of references found: { len(refs) }
- Unique references: { refs.reference.nunique() } => Several articles are in multiple research programs and multiple grants affiliated
- Unique grant IDs: { refs.grant_id.nunique() }
"""
md(mds)
Data from CDMRP database
From here on, we are using a reference matcher developed by Dominika Tkaczyk at Crossref. The algorithm is described to some degree in a blog series: https://www.crossref.org/blog/matchmaker-matchmaker-make-me-a-match/
The algorithm is implemented in Java and no longer very well maintained. However, it seems to be working fine. Dominika recommends up to 20 threads (already implemented) and with some workarounds we could also implement progress reports for larger datasets which are not supported at the moment.
In our case, I used the 8620 unique references from our dataset and matched 6711 articles. Each matched article also comes with the matching score. In this case, the implementation takes care of thresholds and simply returns Nulls for those references that are not matching (enough).
# Load matched articles
articles = pd.read_csv(articles_f)
articles.DOI = articles.DOI.str.lower()
mds = f"""
**Matching references with Crossref**
- Matched articles: { articles.DOI.nunique() }
- Articles in multiple research programs: { (articles.groupby("DOI").program.nunique()>1).sum() }
- Articles funded by multiple grants: { (articles.groupby("DOI").grant_id.nunique()>1).sum() }
"""
md(mds)
Matching references with Crossref
Once the the references have been matched to articles in Crossref, we are using the DOIs to retrieve metrics from the Web of Science and Altmetric.
# Load metrics from WoS
wos = pd.read_csv(wos_f, low_memory=False, index_col="DOI")
wos.columns = [x.lower() for x in wos.columns.tolist()]
wos.index = wos.index.str.lower()
wos = wos.rename(columns={'relative citation score':'citation_score'})
# Load metrics from Altmetric
altmetrics = pd.read_json(altmetric_f).T
# Filter out all articles had not altmetrics
altmetrics = altmetrics[altmetrics.altmetric_id.notna()]
dates = ["last_updated", "published_on", "added_on"]
for d in dates:
altmetrics[d] = pd.to_datetime(altmetrics[d], unit="s")
str_cols = ["pmid", "pmc", "altmetric_id", "doi", 'hollis_id', "arxiv_id"]
for _ in str_cols:
altmetrics[_] = altmetrics[_].astype(str)
metric_cols = {
'cited_by_posts_count': 'posts_count',
'cited_by_rh_count': 'research_highlight',
'cited_by_tweeters_count': 'twitter_accounts',
'cited_by_patents_count': 'patents',
'cited_by_msm_count': 'news_outlets',
'cited_by_feeds_count': 'blogs',
'cited_by_fbwalls_count': 'fb_pages',
'cited_by_qna_count': 'qna_count',
'cited_by_videos_count': 'videos',
'cited_by_peer_review_sites_count': 'peer_reviews',
'cited_by_weibo_count': 'weibo',
'cited_by_gplus_count': 'gplus',
'cited_by_rdts_count': 'reddit_threads',
'cited_by_policies_count': 'policies',
'cited_by_syllabi_count': 'syllabi',
'cited_by_linkedin_count': 'linkedin',
'cited_by_wikipedia_count': 'wikipedia',
}
altmetrics = altmetrics.rename(columns=metric_cols)
metric_cols = list(metric_cols.values())
altmetrics[metric_cols] = altmetrics[metric_cols].astype(float)
cols_to_keep = metric_cols + dates + str_cols + ['subjects', 'scopus_subjects']
altmetrics = altmetrics[cols_to_keep]
# Transform all DOIs to lowercase
altmetrics.index = altmetrics.index.str.lower()
metrics = articles.drop_duplicates().merge(altmetrics[metric_cols], left_on="DOI", right_index=True, how="left")
metrics = metrics.merge(wos[["citations", "citation_score"]], left_on="DOI", right_index=True, how="left")
# Replace articles with 0 citations as NaN
metrics = metrics.replace(0.0, np.nan)
mds = f"""
**Metrics from the Web of Science and Altmetric.com**
- Articles found in WoS: { len(wos) }
- Articles with at least 1 citation: { metrics.citations.count() }
- Articles found in Altmetric.com: { len(altmetrics) }
- Articles with tweets: { altmetrics.twitter_accounts.notna().sum() }
- Articles with FB mentions: { altmetrics.fb_pages.notna().sum() }
"""
md(mds)
Metrics from the Web of Science and Altmetric.com
all_articles = set(articles.DOI.unique().tolist())
articles_w_altm = set(altmetrics.index.tolist())
articles_w_cit = set(wos.index.tolist())
total = len(all_articles)
v = venn3([all_articles, articles_w_altm, articles_w_cit],
set_labels=('', '', ''),
subset_label_formatter=lambda x: "{:,} ({:.1f})".format(x, 100*x/total));
v.get_patch_by_id('100').set_color(cp3[0])
v.get_patch_by_id('110').set_color(np.add(cp3[0], cp3[1])/2)
v.get_patch_by_id('101').set_color(np.add(cp3[0], cp3[2])/2)
v.get_patch_by_id('111').set_color(np.add(np.add(cp3[1], cp3[0]), cp3[2]) / 3)
for text in v.set_labels:
text.set_fontsize(10)
# for text in v.subset_labels:
# text.set_fontsize(12)
handles = []
labels=["All articles", "Altmetric", "WoS"]
for l, c in zip(labels, [cp3[0], np.add(cp3[0], cp3[1])/2, np.add(cp3[0], cp3[2])/2]):
handles.append(mpatches.Patch(color=c, label=l))
plt.legend(handles=handles);
# plt.gca().legend(handles=[v.get_patch_by_id('100'), v.get_patch_by_id('010'), v.get_patch_by_id('001')],
# , prop={'size': 12});
wos.groupby("discipline").size().div(len(wos)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in WoS disciplines (n={len(wos)})")
sns.despine()
pdf = pd.get_dummies(altmetrics['scopus_subjects'].apply(pd.Series).stack()).sum(level=0).melt()
pdf.groupby("variable")['value'].sum().div(len(altmetrics)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in Scopus subjects (n={len(altmetrics)})")
sns.despine()
pdf = metrics[metrics.columns.drop('citation_score')].count().div(len(metrics)/100).sort_values()
pdf.plot(kind="barh", color=cp3[0])
plt.xlabel("Coverage [%]")
count_order = pdf.index.tolist()
pdf = metrics[metrics.columns].reset_index().melt(id_vars="DOI")
sns.boxenplot(y="variable", x="value", data=pdf, order=count_order[::-1], k_depth="trustworthy", color=cp3[0])
plt.ylabel("")
plt.xlabel("Count")
plt.xscale("log")
ticks = [1,2,3,5,10,100,1000]
plt.xticks(ticks, ticks);
wos.groupby("discipline").citation_score.mean().to_frame("avg_citation_score")
avg_citation_score | |
---|---|
discipline | |
Biology | 2.366737 |
Biomedical Research | 2.236923 |
Chemistry | 1.434047 |
Clinical Medicine | 2.030200 |
Earth and Space | 0.982333 |
Engineering and Technology | 2.124703 |
Health | 0.787700 |
Mathematics | 1.490500 |
Physics | 1.435974 |
Professional Fields | 1.083000 |
Psychology | 0.115000 |
pdf = wos.melt(id_vars="discipline", value_vars="citation_score")
sns.boxenplot(y="discipline", x="value", data=pdf)
plt.xlim(0,15);
pdf = pd.DataFrame(index=range(2000,2020))
pdf['count'] = wos.groupby("year").size()
pdf['count'].plot(kind="bar", color=cp3[0]);
sns.despine()
plt.title("Articles indexed in WoS by years");
pdf = wos.replace(0, np.nan).melt(id_vars="year", value_vars="citations")
for y in range(pdf.year.min(),pdf.year.max()+1):
if str(y) not in pdf.year:
pdf.loc[len(pdf)+1] = [y, 'citations', np.nan]
sns.boxenplot(x="year", y="value", data=pdf, color=cp3[0])
plt.title("Letter-value plot of citation counts by year")
plt.ylim(1,2000)
plt.yscale("log")
plt.ylabel("")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
pdf = pd.DataFrame(index=range(2000,2020))
pdf['mean'] = wos.groupby("year").citations.mean()
pdf['mean'].plot(kind="bar", color=cp3[0])
sns.despine()
plt.title("Average citations for articles by year");
altmetrics['year'] = altmetrics.published_on.map(lambda x: str(x.year))
pdf = altmetrics.groupby("year").size().to_frame("count")
pdf.index = pdf.index.astype(float)
for y in range(int(pdf.index.min()), int(pdf.index.max()+1)):
if y not in pdf.index:
pdf.loc[y] = [np.nan]
pdf = altmetrics.groupby("year").size().to_frame("count")
pdf = pdf[pdf.index.astype(float)>1996]
pdf.plot(kind="bar", color=cp3[0]);
plt.title("Articles indexed in Altmetric.com by year");
plt_metrics = ['twitter_accounts', 'fb_pages']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()
fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");
plt_metrics = ['news_outlets', 'wikipedia']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()
fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");
df = refs.groupby("program").size().to_frame("total references")
df['found in crossref'] = articles.groupby("program").DOI.nunique()
df["found (%)"] = 100 * df['found in crossref'] / refs.groupby("program").size()
df.columns = ["References", "Found DOI", "Found (%)"]
df = df.sort_values("References")
df.round(2)
References | Found DOI | Found (%) | |
---|---|---|---|
program | |||
NFRP | 208 | 197 | 94.71 |
TSCRP | 208 | 197 | 94.71 |
PRORP | 354 | 271 | 76.55 |
PCRP | 8308 | 6246 | 75.18 |
cs = ['coci_citations', 'citations', 'posts_count']
x = metrics.groupby(["DOI", "program"])[cs].mean().reset_index().groupby("program").count()[cs]
x[cs] = x[cs].apply(lambda x: 100 * x / articles.groupby("program").DOI.nunique())
x.columns = ["COCI (Cov in %)", "WoS (Cov in %)", "Altmetric (Cov in %)"]
x.reindex(df.index).round(2)
COCI (Cov in %) | WoS (Cov in %) | Altmetric (Cov in %) | |
---|---|---|---|
program | |||
NFRP | 97.97 | 66.50 | 59.39 |
TSCRP | 97.97 | 66.50 | 59.39 |
PRORP | 96.68 | 71.96 | 56.09 |
PCRP | 96.13 | 36.60 | 49.70 |
df = metrics.groupby(["grant_id", "program"])[['coci_citations', 'citations', 'twitter_accounts', 'citation_score']].mean()
df['count'] = metrics.groupby(["grant_id", "program"]).size()
df = df.sort_values("count").reset_index()
df
grant_id | program | coci_citations | citations | twitter_accounts | citation_score | count | |
---|---|---|---|---|---|---|---|
0 | MP980015 | PCRP | 1.000000 | NaN | NaN | NaN | 1 |
1 | PC081249P1 | PCRP | 176.000000 | 163.000000 | NaN | 4.770000 | 1 |
2 | PC081249 | PCRP | 176.000000 | 163.000000 | NaN | 4.770000 | 1 |
3 | PC081246 | PCRP | 36.000000 | 30.000000 | NaN | 0.575000 | 1 |
4 | PC081176 | PCRP | 19.000000 | 12.000000 | 1.000000 | 0.672000 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2239 | PC100473 | PCRP | 89.487805 | 71.633333 | 2.153846 | 3.444767 | 41 |
2240 | PC010267 | PCRP | 23.527273 | 20.500000 | NaN | 0.779500 | 57 |
2241 | PC051369 | PCRP | 43.629630 | 30.780000 | 11.272727 | 1.319580 | 83 |
2242 | PC081610 | PCRP | 73.276316 | 101.909091 | 4.750000 | 3.421500 | 97 |
2243 | PC021004 | PCRP | 133.000000 | 83.000000 | 1.200000 | 1.388000 | 136 |
2244 rows × 7 columns
# df['citation_score'] = 100*df['citation_score']
sns.scatterplot(x="citation_score", y="twitter_accounts", hue="program", size="count", data=df, sizes=(50,300), alpha=.5)
# plt.xlim(0, 500)
# plt.ylim(0, 100)
# plt.yscale("log")
# plt.xscale("log")
<matplotlib.axes._subplots.AxesSubplot at 0x7f17dac92358>
New data to get
Available data from DoD
Notes
For grants
Next Thing To Do: