This post is about what you can learn about scientific articles posted on the arXiv by using Natural Language Processing (NLP). Said differently: I had some questions about papers posted on the arXiv and used it as an excuse to teach myself the basics of NLP. We also look at citation counts and reveal the top cited paper of 2014!
The arXiv makes its data available via a simple API which allows you to download almost everything about an article short of its full text. For each article we can look up information about who has been citing it on inspire. Combined this is a powerful dataset that can answer some interesting questions like: what are the most used words, can we auto generate abstracts, what about summarising abstracts or finding the most cited article of 2014.
Let's get going!
First some standard imports that we will need later. Some of them you might need to install but nothing too obscure:
import time
import urllib2
import datetime
from itertools import ifilter
from collections import Counter, defaultdict
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import bibtexparser
pd.set_option('mode.chained_assignment','warn')
%matplotlib inline
The harvest
function will query the arXiv API for all articles modified between January, 1st 2010 and the end of the year 2014. This is a subtlety worth noting: with this query you will also get articles created before 2010 if their entry was modified after 2010.
The arXiv itself covers so many topics that it is organised into seperate arxivs (I know unfortunate doulbe use of the name arxiv), one for each topic. By default harvest
will collect articles from the physics:hep-ex
arxiv. This is because I am a experimental particle physicist. If you are into theory try physics:hep-th
or stats
if you are a stats guru. The API gives you a full list of all sets of topics to explore.
If you do not care for the technicalities of how to scrape the data skip right ahead to the first factoid.
Most of harvest
is pretty straight forward. The API returns a big XML document containing information about at most 1000 articles which we can parse with ElementTree
and store. If there are more than 1000 articles for a particular query we can get those using the resumptionToken
in the XML. API access can be throttled so on occasion the arXiv will reply with a 503 error asking us to retry later. The information we harvest is stored in a pandas
dataframe.
OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"
def harvest(arxiv="physics:hep-ex"):
df = pd.DataFrame(columns=("title", "abstract", "categories", "created", "id", "doi"))
base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
url = (base_url +
"from=2010-01-01&until=2014-12-31&" +
"metadataPrefix=arXiv&set=%s"%arxiv)
while True:
print "fetching", url
try:
response = urllib2.urlopen(url)
except urllib2.HTTPError, e:
if e.code == 503:
to = int(e.hdrs.get("retry-after", 30))
print "Got 503. Retrying after {0:d} seconds.".format(to)
time.sleep(to)
continue
else:
raise
xml = response.read()
root = ET.fromstring(xml)
for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
arxiv_id = record.find(OAI+'header').find(OAI+'identifier')
meta = record.find(OAI+'metadata')
info = meta.find(ARXIV+"arXiv")
created = info.find(ARXIV+"created").text
created = datetime.datetime.strptime(created, "%Y-%m-%d")
categories = info.find(ARXIV+"categories").text
# if there is more than one DOI use the first one
# often the second one (if it exists at all) refers
# to an eratum or similar
doi = info.find(ARXIV+"doi")
if doi is not None:
doi = doi.text.split()[0]
contents = {'title': info.find(ARXIV+"title").text,
'id': info.find(ARXIV+"id").text,#arxiv_id.text[4:],
'abstract': info.find(ARXIV+"abstract").text.strip(),
'created': created,
'categories': categories.split(),
'doi': doi,
}
df = df.append(contents, ignore_index=True)
# The list of articles returned by the API comes in chunks of
# 1000 articles. The presence of a resumptionToken tells us that
# there is more to be fetched.
token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken")
if token is None or token.text is None:
break
else:
url = base_url + "resumptionToken=%s"%(token.text)
return df
Set harvest
running and go chat with someone for a few minutes while it gathers the information about your articles.
df = harvest()
What does all that stuff we just downloaded look like? Here are the first five entries in the dataframe:
df.head()
title | abstract | categories | created | id | doi | cited_by | |
---|---|---|---|---|---|---|---|
0 | Measurement of the Hadronic Form Factor in D0 ... | The shape of the hadronic form factor f+(q2) i... | [hep-ex] | 2007-03-31 | 0704.0020 | doi:10.1103/PhysRevD.76.052005 | [{u'slaccitation': u'%%CITATION = ARXIV:1411.3... |
1 | Measurement of B(D_S^+ --> ell^+ nu) and the D... | We examine e+e- --> Ds- Ds*+ and Ds*- Ds+ inte... | [hep-ex, hep-lat, hep-ph] | 2007-04-03 | 0704.0437 | 10.1103/PhysRevD.76.072002 | [{u'author': u'Yelton, John M.', u'journal': u... |
2 | A unified analysis of the reactor neutrino pro... | We present in this article a detailed quantita... | [hep-ex] | 2007-04-04 | 0704.0498 | 10.1088/1742-6596/110/8/082013 | [{u'author': u'Queval, Rachel', u'title': u'Ch... |
3 | Measurement of Decay Amplitudes of B -->(c cba... | We perform the first three-dimensional measure... | [hep-ex] | 2007-04-04 | 0704.0522 | 10.1103/PhysRevD.76.031102 | [{u'author': u'Giurgiu, Gavril', u'journal': u... |
4 | Measurement of the Decay Constant $f_D{_S^+}$ ... | We measure the decay constant fDs using the Ds... | [hep-ex, hep-lat, hep-ph] | 2007-04-04 | 0704.0629 | 10.1103/PhysRevLett.99.071802 | [{u'author': u'Jackson, Graham', u'type': u'ar... |
def bar_chart(items):
"""Make a bar chart showing the count associated with each key
`items` is a list of (key, count) pairs.
"""
width = 0.5
ind = np.arange(len(items))
fig, ax = plt.subplots(figsize=(8,8))
rects1 = ax.bar(ind, zip(*items)[1], width, color='r')
ax.set_xticks(ind+width)
ax.set_xticklabels(zip(*items)[0])
fig.autofmt_xdate()
plt.show()
edits_per_year = Counter(df.created.map(lambda x: x.year))
bar_chart(edits_per_year.items())
new_articles = sum(edits_per_year[year] for year in (2010,2011,2012,2013,2014))
print "Unique arXiv IDs edited between 2010 and 2014:", len(df.id.unique())
print "of which %i entries were created in that time period."%(new_articles)
Unique arXiv IDs edited between 2010 and 2014: 16321 of which 11958 entries were created in that time period.
Here is our first factoid about the arXiv: There are about 16000 articles in hep-ex
which were edited between the beginning of 2010 and the end of 2014. Including about 12000 newly created articles. The other 4000 papers were created before 2010 and were updated after creation. Amazing to see that papers created in 1994 were still being edited almost ten years later!
Let's take a look at those, maybe there is something interesting:
df[df.created<datetime.date(1995,1,1)]
title | abstract | categories | created | id | doi | cited_by | |
---|---|---|---|---|---|---|---|
13366 | DUMAND and AMANDA: High Energy Neutrino Astrop... | The field of high energy neutrino astrophysics... | [astro-ph, hep-ex] | 1994-12-06 | astro-ph/9412019 | None | [{u'author': u'Al Samarai, Imen', u'type': u'a... |
13375 | Detection of nuclear recoils in prototype dark... | This work is part of an ongoing project to dev... | [cond-mat, hep-ex] | 1994-11-17 | cond-mat/9411072 | 10.1016/0168-9002(95)00036-4 | [{u'doi': u'10.1016/j.astropartphys.2004.06.00... |
15144 | Precise Measurement of the Left-Right Cross Se... | We present a precise measurement of the left-r... | [hep-ex, hep-ph] | 1994-04-27 | hep-ex/9404001 | 10.1103/PhysRevLett.73.25 | [{u'doi': u'10.1088/1742-6596/335/1/012078', u... |
15145 | An optimal method of moments to measure the ch... | Parity violation at LEP or SLC can be measured... | [hep-ex] | 1994-05-11 | hep-ex/9405002 | 10.1016/0168-9002(94)90847-8 | [] |
15146 | Observation of Anisotropic Event Shapes and Tr... | Event shapes for Au + Au collisions at 11.4 Ge... | [hep-ex] | 1994-05-13 | hep-ex/9405003 | 10.1103/PhysRevLett.73.2532 | [{u'primaryclass': u'nucl-th', u'author': u'Wa... |
15147 | Measurement of the Charged Multiplicity of $Z ... | Using an impact parameter tag to select an enr... | [hep-ex, hep-ph] | 1994-05-13 | hep-ex/9405004 | 10.1103/PhysRevLett.72.3145 | [{u'doi': u'10.1140/epjc/s2005-02424-5', u'pri... |
15148 | Evidence for Top Quark Production in $\bar{p}p... | We summarize a search for the top quark with t... | [hep-ex, hep-ph] | 1994-05-16 | hep-ex/9405005 | 10.1103/PhysRevLett.73.225 | [{u'primaryclass': u'hep-ex', u'author': u'Ger... |
15149 | Precise Determination of the Weak Mixing Angle... | In the 1993 SLC/SLD run, the SLD recorded 50,0... | [hep-ex] | 1994-05-20 | hep-ex/9405011 | None | [{u'doi': u'10.1016/0370-1573(95)00072-0', u'a... |
15150 | A Neural Network for Locating the Primary Vert... | Using simulated collider data for $p+p\rightar... | [hep-ex] | 1994-06-21 | hep-ex/9406003 | 10.1016/0168-9002(94)01133-8 | [] |
15151 | Semileptonic Branching Fraction of Charged and... | An examination of leptons in ${\Upsilon (4S)}$... | [hep-ex] | 1994-06-23 | hep-ex/9406004 | 10.1103/PhysRevLett.73.3503 | [{u'author': u'Ivarsson, Jenny', u'title': u'P... |
15152 | Measurement of the B -> D^* l nu Branching Fra... | We study the exclusive semileptonic B meson de... | [hep-ex] | 1994-06-24 | hep-ex/9406005 | 10.1103/PhysRevD.51.1014 | [{u'author': u'Borean, Cristiano', u'type': u'... |
15153 | Spin Asymmetry in Muon--Proton Deep Inelastic ... | We measured the spin asymmetry in the scatteri... | [hep-ex] | 1994-08-06 | hep-ex/9408001 | 10.1016/0370-2693(94)00968-6 | [{u'primaryclass': u'nucl-ex', u'author': u'Pa... |
15154 | Measurement of the polarization of Lambda0 Ant... | The polarization of Lambda0, AntiLambda0, Sigm... | [hep-ex] | 1994-09-16 | hep-ex/9409001 | 10.1007/BF01291194 | [{u'doi': u'10.1088/1742-6596/509/1/012056', u... |
15155 | Search for slowly moving magnetic monopoles | We report a search for slowly moving magnetic ... | [hep-ex] | 1994-10-05 | hep-ex/9410006 | 10.1016/0920-5632(94)90257-7 | [] |
15156 | Polarized Bhabha Scattering and a Precision Me... | We present the first measurement of the left-r... | [hep-ex] | 1994-10-10 | hep-ex/9410009 | 10.1103/PhysRevLett.74.2880 | [{u'author': u'Quast, Gunther', u'title': u'Me... |
15157 | A Measurement of the $D^{*\pm}$ Cross Section ... | We have measured the inclusive $D^{*\pm}$ prod... | [hep-ex] | 1994-11-29 | hep-ex/9411002 | 10.1103/PhysRevD.50.1879 | [{u'author': u'Ngac, An Bang', u'title': u'Mea... |
15158 | Measurement of the $D^{*\pm}$ Cross Section us... | The differential cross section of $d\sigma(e^+... | [hep-ex] | 1994-12-01 | hep-ex/9412001 | 10.1016/0370-2693(94)91515-6 | [{u'author': u'Ngac, An Bang', u'title': u'Mea... |
15159 | $K^0(\bar{K^0})$ Production in Two-Photon Proc... | We have carried out an inclusive measurement o... | [hep-ex] | 1994-12-05 | hep-ex/9412003 | 10.1016/0370-2693(94)90315-8 | [{u'doi': u'10.1016/S0370-2693(02)01769-0', u'... |
15160 | New Tagging Method of B Flavor of Neutral B Me... | In CP violation measurements in asymmetric B-f... | [hep-ex] | 1994-12-08 | hep-ex/9412005 | 10.1143/JPSJ.63.3542 | [{u'author': u'Foland, Andrew Dean', u'title':... |
15161 | Kinematic Evidence for Top Quark Pair Producti... | We present a study of $W+$multijet events that... | [hep-ex] | 1994-12-13 | hep-ex/9412009 | 10.1103/PhysRevD.51.4623 | [{u'author': u'Hinchliffe, Ian and Paige, FE a... |
15162 | Feasibility Study of Single-Photon Counting Us... | The fine-mesh phototube is one type of photode... | [hep-ex] | 1994-12-13 | hep-ex/9412010 | 10.1016/0168-9002(93)90749-8 | [{u'doi': u'10.1140/epjc/s10052-014-3026-9', u... |
15163 | Measurement of inclusive electron cross sectio... | We have studied open charm production in $\gam... | [hep-ex] | 1994-12-16 | hep-ex/9412011 | 10.1016/0370-2693(94)01349-7 | [{u'doi': u'10.1016/S0370-2693(02)01769-0', u'... |
15164 | Measurement of the forward-backward asymmetrie... | We have measured, with electron tagging, the f... | [hep-ex] | 1994-12-18 | hep-ex/9412012 | 10.1016/0370-2693(94)91310-2 | [{u'doi': u'10.1103/PhysRevD.65.053002', u'pri... |
15165 | J/psi,psi(2S) to mu+ mu- and B to J/psi,psi(2S... | This paper presents a measurement of J/psi,psi... | [hep-ex] | 1994-12-23 | hep-ex/9412013 | None | [{u'slaccitation': u'%%CITATION = ARXIV:1411.3... |
15166 | Measurement of inclusive particle spectra and ... | Inclusive momentum spectra are measured for al... | [hep-ex] | 1994-12-27 | hep-ex/9412015 | 10.1016/0370-2693(94)01685-6 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.2... |
15167 | Measurement of the Bs Meson Lifetime | The lifetime of the $B_s$ meson is measured us... | [hep-ex] | 1994-12-27 | hep-ex/9412017 | 10.1103/PhysRevLett.74.4988 | [{u'doi': u'10.1007/s00601-014-0871-x', u'prim... |
16101 | Exclusive Hadronic B Decays to Charm and Charm... | We have fully reconstructed decays of both B0 ... | [hep-ph, hep-ex] | 1994-03-15 | hep-ph/9403295 | 10.1103/PhysRevD.50.43 | [{u'author': u'Sabelli, Chiara', u'title': u'T... |
16102 | Observation of a New Charmed Strange Meson | Using the CLEO-II detector, we have obtained e... | [hep-ph, hep-ex] | 1994-03-21 | hep-ph/9403325 | 10.1103/PhysRevLett.72.1972 | [{u'slaccitation': u'%%CITATION = ARXIV:1410.5... |
16103 | Study of the Decay $\Lambda_c \to \Lambda l^+ ... | Using the CLEO II detector at CESR, we observe... | [hep-ph, hep-ex] | 1994-03-21 | hep-ph/9403326 | 10.1016/0370-2693(94)90295-X | [{u'doi': u'10.1140/epjc/s10052-014-3194-7', u... |
16104 | Precision Measurement of the $D_s^{*+}- D_s^+$... | We have measured the vector-pseudoscalar mass ... | [hep-ph, hep-ex] | 1994-03-21 | hep-ph/9403327 | 10.1103/PhysRevD.50.1884 | [{u'doi': u'10.1007/JHEP06(2013)065', u'primar... |
16105 | A Measurement of ${\cal B}(D_s \to \phi l^+ \n... | Using the CLEO~II detector at CESR, we have me... | [hep-ph, hep-ex] | 1994-03-21 | hep-ph/9403328 | 10.1016/0370-2693(94)90416-2 | [{u'doi': u'10.1103/RevModPhys.84.65', u'prima... |
16106 | Measurement of Cabibbo Suppressed Decays of th... | Branching ratios for the dominant Cabibbo-supp... | [hep-ph, hep-ex] | 1994-03-21 | hep-ph/9403329 | 10.1103/PhysRevLett.73.1079 | [{u'doi': u'10.1103/PhysRevD.87.073016', u'pri... |
16107 | Production and Decay of D_1(2420)^0 and D_2^*(... | We have investigated $D^{+}\pi^{-}$ and $D^{*+... | [hep-ph, hep-ex] | 1994-03-24 | hep-ph/9403359 | 10.1016/0370-2693(94)90968-7 | [{u'slaccitation': u'%%CITATION = ARXIV:1410.5... |
16108 | Two-Photon Production of Charged Pion and Kaon... | A measurement of the cross section for the com... | [hep-ph, hep-ex] | 1994-03-28 | hep-ph/9403379 | 10.1103/PhysRevD.50.3027 | [{u'slaccitation': u'%%CITATION = ARXIV:1307.0... |
16109 | Measurement of the Branching Fraction for D^+ ... | Using the CLEO-II detector at CESR we have mea... | [hep-ph, hep-ex] | 1994-03-28 | hep-ph/9403382 | 10.1103/PhysRevLett.72.2328 | [{u'doi': u'10.1103/RevModPhys.84.65', u'prima... |
16110 | Measurement of the Spin-Dependent Structure Fu... | We have measured the spin-dependent structure ... | [hep-ph, hep-ex] | 1994-04-15 | hep-ph/9404270 | 10.1016/0370-2693(94)90793-5 | [{u'doi': u'10.1103/PhysRevD.90.012009', u'pri... |
16111 | A Measurement of the Branching Fraction ${\cal... | Using data from the CLEO II detector at CESR, ... | [hep-ph, hep-ex] | 1994-04-20 | hep-ph/9404310 | 10.1103/PhysRevLett.72.3762 | [{u'primaryclass': u'hep-ex', u'author': u'Amh... |
16112 | Supersymmetry at the DiTevatron | We study the signals for supersymmetry at the ... | [hep-ph, hep-ex] | 1994-06-07 | hep-ph/9406248 | 10.1103/PhysRevD.50.5676 | [{u'doi': u'10.1103/PhysRevD.82.035009', u'pri... |
16113 | Detecting Tau Neutrino Oscillations at PeV Ene... | It is suggested that a large deep underocean (... | [hep-ph, astro-ph, hep-ex] | 1994-08-15 | hep-ph/9408296 | 10.1016/0927-6505(94)00043-3 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.1... |
After scrolling through the list I can not spot a particular pattern to the edits. Though it does seem like the list contains articles on interesting topics like evidence for the top quark (index 15148), PeV $\tau$ neutrinos (index 16113) and an article about determining the weak mixing angle at SLD (index 15149).
Unfortunately Inspire does not provide a real API, so we have to scrape their webpages to get what we want. The get_cites
function will look up the citations of an article by its arxiv_id
. Having to make one HTTP request per article means this takes quite a while. So set it going and come back after a few hours. We process articles in chunks of 1000 to get some feedback as well as being able to resume if something goes wrong:
def get_cites(arxiv_id):
cites = []
base_url = "http://inspirehep.net/search?p=refersto:%s&of=hx&rg=250&jrec=%i"
offset = 1
while True:
print base_url%(arxiv_id, offset)
response = urllib2.urlopen(base_url%(arxiv_id, offset))
xml = response.read()
soup = BeautifulSoup(xml)
refs = "\n".join(cite.get_text() for cite in soup.findAll("pre"))
bib_database = bibtexparser.loads(refs)
if bib_database.entries:
cites += bib_database.entries
offset += 250
else:
break
return cites
step = 1000
for N in range(0,17):
print N
cites = df['id'][N*step:(N+1)*step].map(get_cites)
df.ix[N*step:(N+1)*step -1,'cited_by'] = cites
After investing so much time to gather the raw data it is a good idea to store it locally so we do not have to scrape it all again later:
store = pd.HDFStore("/Users/thead/git/arxiv-experiments/hep-ex.h5")
#store['df'] = df
#df = store['df']
store.close()
Let's get to answering some questions. What are the ten most used words in hep-ex
abstracts?
word_bag = " ".join(df.abstract.apply(lambda t: t.lower()))
Counter(word_bag.split()).most_common(n=10)
[('the', 161889), ('of', 79075), ('and', 54236), ('in', 41418), ('a', 38591), ('to', 36425), ('for', 25128), ('is', 23440), ('with', 23015), ('we', 22510)]
Not too enlightening, boring little words close out the top ten. These words are known as stopwords and the NLTK
library provides a list of all of them. So let's remove them as well as basic mathematical symbols:
from nltk.corpus import stopwords
stops = [word for word in stopwords.words('english')]
stops += ["=", "->"]
words = filter(lambda w: w not in stops,
word_bag.split())
top_twenty = Counter(words).most_common(n=20)
bar_chart(top_twenty)
Experimental physics is all about data afterall! A shame that model
beats detector
but probably that is inevitable as there are many more theoretical models than experimental detectors. The higgs
boson beats the neutrino
but they reign supreme over all the other particles.
Towards the bottom we have measurements
and measured
. These should probably be counted as one entry, together with measurement
, measuring
, etc. The easiest way to achieve this is to stem the words before counting. Stemming is the process of reducing derived words to their stem, for example:
import nltk.stem as stem
porter = stem.PorterStemmer()
for w in ("measurement", "measurements", "measured", "measure"):
print w, "->", porter.stem(w)
measurement -> measur measurements -> measur measured -> measur measure -> measur
Like in this case the stem does not have to be a real word itself. By stemming words before counting how often they occur the entries for measurements
and measured
get added together. Using the stem of every word we get the following ranking:
word_stems = map(lambda w: (porter.stem(w),w), words)
stem2words = defaultdict(set)
for stem, word in word_stems:
stem2words[stem].add(word)
top_twenty = Counter(w[0] for w in word_stems).most_common(n=20)
bar_chart(top_twenty)
# list all words which correspond to each top twenty stem
for stem,count in top_twenty:
print stem, "<-", ", ".join(stem2words[stem])
measur <- measuring, measures, measurment, measurements, measure, measurable, measurably, measureable, measurability, measurement, measured decay <- decayed, decays, decaying, decay use <- use, used, useful, uses, usefulness, using data <- data mass <- masses, mass model <- models, modeled, modelled, modelling, modeling, model result <- resulted, resultant, resulting, results, result energi <- energies, energy product <- product, productive, productivity, productions, production, products detector <- detector, detectors neutrino <- neutrino, neutrinos present <- presently, presented, presentations, presents, presentational, presenting, presentation, present search <- searches, searchs, search, searching, searched new <- new studi <- studying, study, studied, studies observ <- observational, observable, observation, observer, observes, observed, observe, observations, observables, observability, observing standard <- standards, standardize, standardized, standard higg <- higgs cross <- crossing, crossings, crosses, crossed, cross experi <- experiement, experiments, experiences, experiment, experience
Turns out experimental phsics is all about measuring things. The stemming is not perfect, but good enough for now.
In science citations is the currency used to measure the success of a paper. What does the distribution of citations look like then?
A simple question to ask is: how often are articles cited? As articles have to be read and understood before they can be cited we only look at articles created before the beginning of 2014.
before_2014 = datetime.datetime(2014,1,1)
plt.hist(df[df.created<before_2014].cited_by.map(len),
bins=200, normed=True, range=(0,200))
plt.xlabel("Number of citations")
plt.ylabel("Fraction")
<matplotlib.text.Text at 0x13a304c50>
This plot shows the fraction of articles cited zero, one, two, three, ... times. The single most likely number of citations for an article on hep-ex
is zero! A whopping 13% of articles never get cited and nearly a third of articles are cited less than four times.
df['citation_count'] = df.cited_by.map(len)
df[df.created<before_2014]['citation_count'].describe()
count 14059.000000 mean 32.630486 std 100.773436 min 0.000000 25% 2.000000 50% 10.000000 75% 31.000000 max 4138.000000 Name: citation_count, dtype: float64
The average number of citations is about 33. The average is misleading for a steeply falling distribution like this, afterall we reach the 50% percentile at only 10 citations!
The prize for most cited paper with a whopping 4138 citations goes to:
df.iloc[df.citation_count.idxmax()]
title New Generation of Parton Distributions with Un... abstract A new generation of parton distribution functi... categories [hep-ph, hep-ex] created 2002-01-21 00:00:00 id hep-ph/0201195 doi 10.1088/1126-6708/2002/07/012 cited_by [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... citation_count 4138 Name: 15779, dtype: object
Read the paper yourself: New Generation of Parton Distributions with Uncertainties from Global QCD Analysis and see if you agree.
We can also easily compute the top ten papers. This is an interesting mix of articles. Number two and three are the papers by the ATLAS and CMS experiments reporting on the discovery of the Higgs boson. While most of the papers in the top ten are older these two were only published in 2012 and have already overtaken the top quark discovery which was published in 1995! Curious fact, the ATLAS paper has ever so few more citations than the CMS one.
df.sort('citation_count', ascending=False).head(10)
title | abstract | categories | created | id | doi | cited_by | citation_count | |
---|---|---|---|---|---|---|---|---|
15779 | New Generation of Parton Distributions with Un... | A new generation of parton distribution functi... | [hep-ph, hep-ex] | 2002-01-21 | hep-ph/0201195 | 10.1088/1126-6708/2002/07/012 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 4138 |
7371 | Observation of a new particle in the search fo... | A search for the Standard Model Higgs boson in... | [hep-ex] | 2012-07-31 | 1207.7214 | 10.1016/j.physletb.2012.08.020 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 3653 |
7372 | Observation of a new boson at a mass of 125 Ge... | Results are presented from searches for the st... | [hep-ex] | 2012-07-31 | 1207.7235 | 10.1016/j.physletb.2012.08.021 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 3592 |
15181 | Observation of Top Quark Production in Pbar-P ... | We establish the existence of the top quark us... | [hep-ex, hep-ph] | 1995-03-02 | hep-ex/9503002 | 10.1103/PhysRevLett.74.2626 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2519 |
16241 | Direct Evidence for Neutrino Flavor Transforma... | Observations of neutral current neutrino inter... | [nucl-ex, hep-ex] | 2002-04-21 | nucl-ex/0204008 | 10.1103/PhysRevLett.89.011301 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2425 |
15690 | HERWIG 6.5: an event generator for Hadron Emis... | HERWIG is a general-purpose Monte Carlo event ... | [hep-ph, hep-ex] | 2000-11-29 | hep-ph/0011363 | 10.1088/1126-6708/2001/01/010 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2324 |
14026 | First Results from KamLAND: Evidence for React... | KamLAND has been used to measure the flux of $... | [hep-ex, nucl-ex] | 2002-12-09 | hep-ex/0212021 | 10.1103/PhysRevLett.90.021802 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2235 |
16152 | A Supersymmetry Primer | I provide a pedagogical introduction to supers... | [hep-ph, hep-ex, hep-th] | 1997-09-15 | hep-ph/9709356 | None | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2160 |
16235 | Measurement of the rate of nu_e + d --> p + p ... | Solar neutrinos from the decay of $^8$B have b... | [nucl-ex, hep-ex] | 2001-06-18 | nucl-ex/0106015 | 10.1103/PhysRevLett.87.071301 | [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... | 2030 |
13678 | The BABAR Detector | BABAR, the detector for the SLAC PEP-II asymme... | [hep-ex] | 2001-05-16 | hep-ex/0105044 | 10.1016/S0168-9002(01)02012-5 | [{u'title': u'Measurement of the Partial Branc... | 1828 |
There are many more interesting things to be done with analysing the words used in abstracts as well as anlysing who cites who. This will be covered in the second part of this post as this one is already fairly lengthy.
Just one more thing, the top cited paper of 2014: First combination of Tevatron and LHC measurements of the top-quark mass celebrating collaboration across the globe:
df.iloc[df[df.created>before_2014].citation_count.idxmax()]
title First combination of Tevatron and LHC measurem... abstract We present a combination of measurements of th... categories [hep-ex] created 2014-03-18 00:00:00 id 1403.4427 doi None cited_by [{u'slaccitation': u'%%CITATION = ARXIV:1412.1... citation_count 118 Name: 11494, dtype: object