import pandas as pd
from pandas import DataFrame, Series, Index
import numpy as np
import matplotlib.pyplot as plt
Hint from http://api.plos.org/search-examples/plos_search.py about how to get only articles in search:
query['fq'] = quote('doc_type:full AND !article_type_facet:"Issue Image"') #search only for articles
We need to be mindful of the API rate limit -- but PLoS has
http://api.plos.org/solr/faq/#solr_api_recommended_usage:
Please limit your API requests to 7200 requests a day, 300 per hour, 10 per minute and allow 5 seconds for your search to return results. If you exceed this threshold, we will lock out your IP address. If you’re a high-volume user of the PLOS Search API and need more API requests a day, please contact us at api@plos.org to discuss your options. We currently limit API users to no more than five concurrent connections from a single IP address.
[...]
PLOS Search API requests: Please do not send requests that return more than 100 rows. That’s a lot of data for our network to push all at once and it may take some time to return the result set. If you are getting back a result set that bigger than 100+ rows, then you likely need to change your query to return a smaller result set or set a limit on the records returned and page through the results.
import settings
import requests
import urllib
# http://api.plos.org/search-examples/plos_search.py
# query['fq'] = quote('doc_type:full AND !article_type_facet:"Issue Image"')
#search only for articles
def plos_search(q,start=0,rows=100,fl=None, extras=None):
BASE_URL = 'http://api.plos.org/search'
DEFAULT_FL = ('abstract','article_type','author_display',
'eissn','id','journal','publication_date',
'score','title_display')
# fl indicates fields to return
# http://wiki.apache.org/solr/CommonQueryParameters#fl
if fl is None:
fl_ = ",".join(DEFAULT_FL)
else:
fl_ = ",".join(fl)
query = {'q':q,
'start':start,
'rows':rows,
'api_key':settings.PLOS_KEY,
'wt':'json',
'fl':fl_,
'fq': 'doc_type:full AND !article_type_facet:"Issue Image"'}
if extras is not None:
query.update(extras)
query_url = BASE_URL + "?" +urllib.urlencode(query)
r = requests.get(query_url)
return r
r = plos_search(q='subject:"biotechnology"')
r.json()['response']['numFound']
3577
docs = r.json()['response']['docs']
df = DataFrame(docs)
df.head()
abstract | article_type | author_display | eissn | id | journal | publication_date | score | title_display | |
---|---|---|---|---|---|---|---|---|---|
0 | [\nThe objective of this paper is to assess th... | Research Article | [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... | 1932-6203 | 10.1371/journal.pone.0086174 | PLoS ONE | 2014-01-29T00:00:00Z | 1.212916 | Determinants of Public Attitudes to Geneticall... |
1 | [\n Atrazine (ATZ) and S-metolachlor (S... | Research Article | [Cristina A. Viegas, Catarina Costa, Sandra An... | 1932-6203 | 10.1371/journal.pone.0037140 | PLoS ONE | 2012-05-15T00:00:00Z | 1.120443 | Does <i>S</i>-Metolachlor Affect the Performan... |
2 | [\nDue to environmental persistence and biotox... | Research Article | [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... | 1932-6203 | 10.1371/journal.pone.0070686 | PLoS ONE | 2013-08-05T00:00:00Z | 1.120443 | Microbial Electricity Generation Enhances Deca... |
3 | [\n Intensive use of chlorpyrifos has r... | Research Article | [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... | 1932-6203 | 10.1371/journal.pone.0047205 | NaN | 2012-10-08T00:00:00Z | 1.120443 | Biodegradation of Chlorpyrifos and Its Hydroly... |
4 | [Background: The complex characteristics and u... | Research Article | [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... | 1932-6203 | 10.1371/journal.pone.0042270 | NaN | 2012-08-09T00:00:00Z | 0.990341 | Microbial Transformation of Biomacromolecules ... |
5 rows × 9 columns
What type of parameters in search?
parameters: http://api.plos.org/solr/search-fields/
Can try out: http://www.plosone.org/search/advanced?noSearchFlag=true&query=&filterJournals=PLoSONE
How to ask for more metadata -- like subject areas?
http://www.plosone.org/search/advanced?unformattedQuery=subject%3A%22Body+mass+index%22
# get subjects of a given article
# http://api.plos.org/search?q=id:%2210.1371/journal.pbio.0050082%22&fl=id,subject,subject_level_1&api_key=[YOUR_API_KEY]
r = plos_search(q="10.1371/journal.pone.0039504",
fl=('id','subject','subject_level'))
r.json()
{u'response': {u'docs': [{u'id': u'10.1371/journal.pone.0039504', u'subject': [u'/Medicine and health sciences/Physiology/Physiological parameters/Body weight/Body mass index', u'/Biology and life sciences/Anatomy/Musculoskeletal system/Pelvis/Hip', u'/Medicine and health sciences/Physiology/Physiological parameters', u'/Biology and life sciences/Physiology/Physiological parameters/Body weight/Body mass index', u'/Biology and life sciences/Physiology/Physiological parameters/Body weight/Obesity', u'/Medicine and health sciences/Metabolic disorders/Diabetes mellitus', u'/Medicine and health sciences/Anatomy/Musculoskeletal system/Pelvis/Hip', u'/People and places/Demography/Death rates', u'/Biology and life sciences/Physiology/Physiological parameters', u'/Medicine and health sciences/Epidemiology/Ethnic epidemiology', u'/Medicine and health sciences/Physiology/Physiological parameters/Body weight/Obesity', u'/Biology and life sciences/Population biology/Population metrics/Death rates', u'/Physical sciences/Mathematics/Statistics (mathematics)/Confidence intervals']}], u'numFound': 1, u'start': 0}}
# There is also a list of all the top-level subject areas and their article counts:
# http://api.plos.org/search/?q=*:*&fq=doc_type:full&rows=0&facet.field=subject_facet&facet.mincount=1
# this doesn't seem to work for me:
# https://groups.google.com/forum/#!searchin/plos-api-developers/subject/plos-api-developers/BqECTQkvRTI/r9v6oCAAOPoJ
r = plos_search(q="*.*",
extras={'fq':'doc_type:full',
'facet.field':'subject_facet',
'facet.mincount':1
})
r
<Response [200]>
r.json()
{u'response': {u'docs': [], u'maxScore': 0.0, u'numFound': 0, u'start': 0}}
for simplicity, try to limit to PLoS One to start with
# compute links back to journal, XML document
Example article:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039504
id: 10.1371/journal.pone.0039504 XML: http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0039504&representation=XML
# get all articles
r = plos_search(q="*:*")
r
<Response [200]>
r.json()['response']['numFound']
118545