Finding Open Access versions of articles in Australian Historical Studies

Open access isn't just what historians expect from GLAM organisations, it's we do with the products of our research. Australian Historical Studies is one of the major journals for Australian historians. How much of it is accessible to researchers without the luxury of an institutional subscription?

AHS is published by Taylor & Francis, and under their terms and conditions there are two ways articles can be made openly accessible:

  • The author can pay an article publishing charge (APC) to make the article open immediately upon publication. The APC is currently set at $3775. This is known as Gold Open Access.
  • The author can share the Author Accepted Manuscript (AAM) version of their article. The AAM version is the one after peer-review, but before copy-editing and typesetting. This version can be shared immediately on the author's personal website or, after an 18 month embargo, it can be uploaded to an institutional or subject repository. This is known as Green Open Access.

If Green Open Access versions are uploaded to a recognised repository, they become findable. Tools such as the Open Access Button and Unpaywall can redirect you from a paywalled version to an open access alternative. If you use Zotero to save articles from a journal, it'll automatically look for Open Access versions via Unpaywall if no free downloads are available. Green Open Access costs nothing, but it opens your work to new audiences and new uses.

So how many authors are taking advantage of Green Open Access arrangements? Let's have a look and see.

The dataset

For this little experiment I'm going to look at 10 years of articles, from 2008 to 2018. I'm finishing in 2018 because it's outside of the 18 month embargo period. Everything published in 2018 or before can be made open access. I'm focusing on research articles, excluding editorials, reviews, and commentaries.

My plan is to save details of the articles to Zotero, access the details from the Zotero API, then use the Open Access Button API to look for open access versions.

Import what we need

In [334]:
from pyzotero import zotero
import requests
import time
from IPython.display import JSON, display
import pandas as pd
import altair as alt

Get the list of articles from the Zotero API

To create a list of articles to check I just went to the page for every issue from 2008 to 2018 and used Zotero to save the details of research articles (I haven't included reviews, editorials, or commentaries). You can view the collection of 242 articles in the Zotero web interface.

To access the data for each of these articles, I had to create an API token for read-only access to the collection. I can then use PyZotero to request the list of articles from the Zotero API.

In [188]:
# Details of the public Zotero group into which I've captured article details
# https://www.zotero.org/groups/2589863/australian_history_journals/library
# This key id read-only
ZOTERO_API_KEY = 'FT3a7ByHQCRUpCnEeoKlhhKy'
ZOTERO_GROUP_ID = '2589863'
ZOTERO_LIBRARY_TYPE = 'group'

# Create the Zotero group client
zot = zotero.Zotero(ZOTERO_USER_ID, ZOTERO_LIBRARY_TYPE, ZOTERO_API_KEY)
zot.add_parameters(sort='title')

# This is the Australian Historical Studies collection
articles = zot.everything(zot.collection_items_top('922CMTJU'))

Free but not open?

As I was saving the articles into Zotero, I noticed that some had a green tick next to them, indicating that you could access the content without a subscription. These are described as 'free access' articles, rather then 'open access' articles which have the orange, open padlock icon. The difference is that Open Access articles are both freely available, and openly licensed. I don't know why the journal makes some articles 'free'. I did some checking and found that the 'free' articles don't seem to show up in the open access databases. In order to include them with the OA articles, I manually added the 'free' article link to Zotero. As you'll see below, I check for this link before searching for an OA version of each article. So the final results are a combination of the 'free' and OA articles.

Look for OA versions of the articles

Now we're going to see if we can find open access versions of the articles. The code below will get the DOI for each article in our dataset and then look it up using the Open Access Button API. If it finds an OA version, it'll display the title and link, and add the link to the article's metadata.

In [189]:
# Open Access button API endpoint
OA_API_URL = 'https://api.openaccessbutton.org/find'

oa_articles = []
for article in articles:
    # Some articles have been made 'free' by the journal, though they're not open access
    # These aren't included in the OA Button db, so I've added the PDF links to their Zotero records.
    # Here we'll check to see if the article has one of these links.
    for child in zot.children(article['key']):
        if child['data']['title'] == 'Free access PDF':
            article['data']['oadoi'] = child['data']['url']
            article['data']['oa_type'] = 'free access'
            break
    # If there's not free access, we'll see if there's an OA version
    if not article['data'].get('oadoi'):
        # Search the OA db using the DOI
        response = requests.get(OA_API_URL, params={'id': article['data']['DOI']})
        data = response.json()
        # Try the title if we couldn't find it by DOI
        # if not data['found']:
        #    response = requests.get(OA_API_URL, params={'title': article['data']['title']})
        #    data = response.json()
        # Is there an OA version?
        if data['found']:
            article['data']['oadoi'] = data['found']['oadoi']
            article['data']['licence'] = data['metadata'].get('licence', '')
            time.sleep(1)
    if article['data'].get('oadoi'):
        print(f'\n{article["data"]["title"]}')
        print(article['data']['oadoi'])
    oa_articles.append(article['data'])     
A Case of Identity: The Artefacts of the 1770 Kamay (Botany Bay) Encounter
https://www.repository.cam.ac.uk/handle/1810/293268

A Historical Myth? Matthew Flinders and the Quest for a Strait
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2016.1250791?needAccess=true

A Shield Loaded with History: Encounters, Objects and Exhibitions
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2017.1408663

Asian Servants for the Imperial Telegraph: Imagining North Australia as an Indian Ocean Colony before 1914
https://ro.uow.edu.au/cgi/viewcontent.cgi?article=4008&context=lhapapers

Colonial Judiciaries, Aboriginal Protection and South Australia's Policy of Punishing ‘with Exemplary Severity’
http://pdfs.semanticscholar.org/217a/cc68a93d95204f0230edb7256880d7dc92ad.pdf

Galahs
https://openresearch-repository.anu.edu.au/bitstream/1885/47799/4/galahs_long.pdf

‘Habeas Corpus Mongols’—Chinese Litigants and the Politics of Immigration in 1888
https://research-repository.griffith.edu.au/bitstream/10072/63973/1/97921_1.pdf

Mediatisation and Institutions of Public Memory: Digital Storytelling and the Apology
https://eprints.qut.edu.au/32980/1/c32980.pdf

Neither a Discipline nor a Colony: Renaissance and Re-imagination in Economic History
https://ro.uow.edu.au/cgi/viewcontent.cgi?article=4074&context=lhapapers

Re-Routing Empire? Steam-Age Circulations and the Making of an Anglo Pacific, c.1850–90
https://ro.uow.edu.au/cgi/viewcontent.cgi?article=3174&context=lhapapers

Remembering and Fighting for Their Own: Vietnam Veterans and the Long Tan Cross
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2017.1394887?needAccess=true

Rewriting Quarantine: Pacific History at Australia's Edge
https://www.repository.cam.ac.uk/bitstream/1810/250281/1/Bashford%20%26%20Hobbins%202015%20Australian%20Historical%20Studies.pdf

Settler Justice and Aboriginal Homicide in Late Colonial Australia
https://research-repository.griffith.edu.au/bitstream/10072/41848/1/73407_1.pdf

‘Such a Great Space of Water between Us’: Anzac Day in Britain, 1916–39
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2014.912667?needAccess=true

The Place of Anzac in Australian Historical Consciousness
https://opus.lib.uts.edu.au/bitstream/10453/88058/4/7C03BF5A-9BA9-435C-8299-DF822C4D8B10%20am.pdf

The Significance of the Northern Territory in the Formulation of ‘White Australia’ Policies, 1880–1901
https://research-repository.griffith.edu.au/bitstream/10072/390800/2/Fong79965-Accepted.pdf

Transnational Histories of Penal Transportation: Punishment, Labour and Governance in the British Imperial World, 1788–1939
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2016.1203962?needAccess=true

Vida Lahey's Progressive Activism for Children's Art Education
https://research-repository.griffith.edu.au/bitstream/10072/36552/1/66889_1.pdf

Visiting the Neighbours: The Political Meanings of Australian Travel to Cold War Asia
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2013.817450

White Men in Quarantine: Disease, Race, Commerce and Mobility in the Pacific, 1872
https://www.tandfonline.com/doi/pdf/10.1080/1031461X.2017.1293704?needAccess=true

Convert to a dataframe

Now we'll convert the list of articles into a dataframe for further exploration.

In [238]:
df = pd.DataFrame(oa_articles)
df.head()
Out[238]:
key version itemType title creators abstractNote publicationTitle volume issue pages ... rights extra tags collections relations dateAdded dateModified oadoi licence oa_type
0 4CTNPIKI 26 journalArticle A Case of Identity: The Artefacts of the 1770 ... [{'creatorType': 'author', 'firstName': 'Nicho... Collections of Indigenous artefacts made durin... Australian Historical Studies 49 1 4-27 ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z https://www.repository.cam.ac.uk/handle/1810/2... NaN
1 54PJ4CIG 6 journalArticle A Disenfranchised Grief: Post-war Death and Me... [{'creatorType': 'author', 'firstName': 'Marin... The 1918 Armistice signalled the end of the Fi... Australian Historical Studies 40 1 79-95 ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN
2 6XPGZ57R 11 journalArticle A Dog in the Manger: White Australia and its V... [{'creatorType': 'author', 'firstName': 'Russe... Between the world wars Australia was commonly ... Australian Historical Studies 43 2 157-173 ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN
3 C6XRZRJ3 11 journalArticle ‘A Halo of Protection’: Colonial Protectors an... [{'creatorType': 'author', 'firstName': 'Amand... Scholarship on Australia's colonial protectora... Australian Historical Studies 43 3 396-411 ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN
4 LTEDXQLF 24 journalArticle A Historical Myth? Matthew Flinders and the Qu... [{'creatorType': 'author', 'firstName': 'Kenne... This article takes issue with a recent argumen... Australian Historical Studies 48 1 52-67 ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z https://www.tandfonline.com/doi/pdf/10.1080/10... pd NaN

5 rows × 35 columns

In [239]:
# Check the total number of articles
df.shape
Out[239]:
(242, 35)

All articles with Open Access (or free access) versions

Let's look at all the articles that have an OA link.

In [226]:
# How many articles have OA versions?
df.loc[df['oadoi'].notnull()].shape
Out[226]:
(20, 35)
In [233]:
# Display them
df.loc[df['oadoi'].notnull()][['DOI', 'title', 'date', 'oadoi']]
Out[233]:
DOI title date oadoi
0 10.1080/1031461X.2017.1414862 A Case of Identity: The Artefacts of the 1770 ... January 2, 2018 https://www.repository.cam.ac.uk/handle/1810/2...
4 10.1080/1031461X.2016.1250791 A Historical Myth? Matthew Flinders and the Qu... January 2, 2017 https://www.tandfonline.com/doi/pdf/10.1080/10...
6 10.1080/1031461X.2017.1408663 A Shield Loaded with History: Encounters, Obje... January 2, 2018 https://www.tandfonline.com/doi/pdf/10.1080/10...
26 10.1080/1031461X.2017.1279196 Asian Servants for the Imperial Telegraph: Ima... April 3, 2017 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
57 10.1080/1031461X.2010.493947 Colonial Judiciaries, Aboriginal Protection an... September 1, 2010 http://pdfs.semanticscholar.org/217a/cc68a93d9...
83 10.1080/10314610903067094 Galahs September 1, 2009 https://openresearch-repository.anu.edu.au/bit...
86 10.1080/1031461X.2014.911759 ‘Habeas Corpus Mongols’—Chinese Litigants and ... May 4, 2014 https://research-repository.griffith.edu.au/bi...
122 10.1080/10314611003716861 Mediatisation and Institutions of Public Memor... June 1, 2010 https://eprints.qut.edu.au/32980/1/c32980.pdf
128 10.1080/1031461X.2017.1279197 Neither a Discipline nor a Colony: Renaissance... April 3, 2017 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
151 10.1080/1031461X.2015.1071416 Re-Routing Empire? Steam-Age Circulations and ... September 2, 2015 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
154 10.1080/1031461X.2017.1394887 Remembering and Fighting for Their Own: Vietna... January 2, 2018 https://www.tandfonline.com/doi/pdf/10.1080/10...
159 10.1080/1031461X.2015.1071860 Rewriting Quarantine: Pacific History at Austr... September 2, 2015 https://www.repository.cam.ac.uk/bitstream/181...
165 10.1080/1031461X.2011.560610 Settler Justice and Aboriginal Homicide in Lat... June 1, 2011 https://research-repository.griffith.edu.au/bi...
171 10.1080/1031461X.2014.912667 ‘Such a Great Space of Water between Us’: Anza... May 4, 2014 https://www.tandfonline.com/doi/pdf/10.1080/10...
202 10.1080/1031461X.2016.1250790 The Place of Anzac in Australian Historical Co... January 2, 2017 https://opus.lib.uts.edu.au/bitstream/10453/88...
207 10.1080/1031461X.2018.1515963 The Significance of the Northern Territory in ... October 2, 2018 https://research-repository.griffith.edu.au/bi...
217 10.1080/1031461X.2016.1203962 Transnational Histories of Penal Transportatio... September 1, 2016 https://www.tandfonline.com/doi/pdf/10.1080/10...
224 10.1080/1031461X.2010.493945 Vida Lahey's Progressive Activism for Children... September 1, 2010 https://research-repository.griffith.edu.au/bi...
225 10.1080/1031461X.2013.817450 Visiting the Neighbours: The Political Meaning... September 1, 2013 https://www.tandfonline.com/doi/pdf/10.1080/10...
233 10.1080/1031461X.2017.1293704 White Men in Quarantine: Disease, Race, Commer... April 3, 2017 https://www.tandfonline.com/doi/pdf/10.1080/10...

How many articles have OA versions available?

In [297]:
print(f'{df.loc[df["oadoi"].notnull()].shape[0] / df.shape[0]:.2%} of articles are freely available')
8.26% of articles are freely available

Gold open access articles

The Gold OA articles have links that go back to the Taylor & Francis site, but are not the 'free access' articles I identified manually.

In [243]:
# Set OA type to gold
df.loc[(df['oadoi'].fillna('').str.contains('tandfonline')) & (df['oa_type'] != 'free access'),'oa_type'] = 'gold'
In [292]:
# Number of articles
df.loc[df['oa_type'] == 'gold'].shape
Out[292]:
(5, 36)
In [244]:
df.loc[df['oa_type'] == 'gold'][['DOI', 'title', 'date', 'oadoi']]
Out[244]:
DOI title date oadoi
4 10.1080/1031461X.2016.1250791 A Historical Myth? Matthew Flinders and the Qu... January 2, 2017 https://www.tandfonline.com/doi/pdf/10.1080/10...
154 10.1080/1031461X.2017.1394887 Remembering and Fighting for Their Own: Vietna... January 2, 2018 https://www.tandfonline.com/doi/pdf/10.1080/10...
171 10.1080/1031461X.2014.912667 ‘Such a Great Space of Water between Us’: Anza... May 4, 2014 https://www.tandfonline.com/doi/pdf/10.1080/10...
217 10.1080/1031461X.2016.1203962 Transnational Histories of Penal Transportatio... September 1, 2016 https://www.tandfonline.com/doi/pdf/10.1080/10...
233 10.1080/1031461X.2017.1293704 White Men in Quarantine: Disease, Race, Commer... April 3, 2017 https://www.tandfonline.com/doi/pdf/10.1080/10...

Green Open Access articles

If the OA url doesn't include 'tandfonline' and it's not 'free access', then it looks like it's Green Open Access.

In [246]:
# Set oa_type to 'green'
df.loc[(df['oadoi'].notnull()) & (~df['oadoi'].fillna('').str.contains('tandfonline')) & (df['oa_type'] != 'free access'), 'oa_type'] = 'green'
In [291]:
# Number of articles
df.loc[df['oa_type'] == 'green'].shape
Out[291]:
(13, 36)
In [247]:
df.loc[df['oa_type'] == 'green'][['DOI', 'title', 'date', 'oadoi']]
Out[247]:
DOI title date oadoi
0 10.1080/1031461X.2017.1414862 A Case of Identity: The Artefacts of the 1770 ... January 2, 2018 https://www.repository.cam.ac.uk/handle/1810/2...
26 10.1080/1031461X.2017.1279196 Asian Servants for the Imperial Telegraph: Ima... April 3, 2017 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
57 10.1080/1031461X.2010.493947 Colonial Judiciaries, Aboriginal Protection an... September 1, 2010 http://pdfs.semanticscholar.org/217a/cc68a93d9...
83 10.1080/10314610903067094 Galahs September 1, 2009 https://openresearch-repository.anu.edu.au/bit...
86 10.1080/1031461X.2014.911759 ‘Habeas Corpus Mongols’—Chinese Litigants and ... May 4, 2014 https://research-repository.griffith.edu.au/bi...
122 10.1080/10314611003716861 Mediatisation and Institutions of Public Memor... June 1, 2010 https://eprints.qut.edu.au/32980/1/c32980.pdf
128 10.1080/1031461X.2017.1279197 Neither a Discipline nor a Colony: Renaissance... April 3, 2017 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
151 10.1080/1031461X.2015.1071416 Re-Routing Empire? Steam-Age Circulations and ... September 2, 2015 https://ro.uow.edu.au/cgi/viewcontent.cgi?arti...
159 10.1080/1031461X.2015.1071860 Rewriting Quarantine: Pacific History at Austr... September 2, 2015 https://www.repository.cam.ac.uk/bitstream/181...
165 10.1080/1031461X.2011.560610 Settler Justice and Aboriginal Homicide in Lat... June 1, 2011 https://research-repository.griffith.edu.au/bi...
202 10.1080/1031461X.2016.1250790 The Place of Anzac in Australian Historical Co... January 2, 2017 https://opus.lib.uts.edu.au/bitstream/10453/88...
207 10.1080/1031461X.2018.1515963 The Significance of the Northern Territory in ... October 2, 2018 https://research-repository.griffith.edu.au/bi...
224 10.1080/1031461X.2010.493945 Vida Lahey's Progressive Activism for Children... September 1, 2010 https://research-repository.griffith.edu.au/bi...

'Free access' articles

In [293]:
df.loc[df['oa_type'] == 'free access'][['DOI', 'title', 'date', 'oadoi']]
Out[293]:
DOI title date oadoi
6 10.1080/1031461X.2017.1408663 A Shield Loaded with History: Encounters, Obje... January 2, 2018 https://www.tandfonline.com/doi/pdf/10.1080/10...
225 10.1080/1031461X.2013.817450 Visiting the Neighbours: The Political Meaning... September 1, 2013 https://www.tandfonline.com/doi/pdf/10.1080/10...

Articles over time

Let's see how the number of articles varies over time. First we'll extract the year from the date string.

In [254]:
# Add a year column by extracting the year from the date column
df['year'] = df['date'].str.extract(r'(\d{4})$').astype(int)

Now let's plot the results.

In [294]:
alt.Chart(df.fillna('$$$')).mark_bar().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('count():Q', title='Number of articles', axis=alt.Axis(tickMinStep=1)),
    color=alt.Color('oa_type:N', scale=alt.Scale(range=['lightgrey', 'blue', 'gold', 'green']), legend=alt.Legend(title='OA type')),
    tooltip=[alt.Tooltip('count():Q', title='Number of articles'), alt.Tooltip('oa_type', title='OA type')]
).properties(width=400)
Out[294]:

OA articles by repository

We can extract the domain from the OA url to see where the articles come from.

In [266]:
df.loc[df['oadoi'].notnull()]['oadoi'].str.extract(r'^https*://(.*?)/').value_counts()
Out[266]:
www.tandfonline.com                    7
research-repository.griffith.edu.au    4
ro.uow.edu.au                          3
www.repository.cam.ac.uk               2
pdfs.semanticscholar.org               1
opus.lib.uts.edu.au                    1
openresearch-repository.anu.edu.au     1
eprints.qut.edu.au                     1
dtype: int64

Why so few?

Only 8.26% of research articles published in Australian Historical Studies between 2008 and 2018 are available in an open access version. That's pretty disappointing. Remember too that the embargo period for AHS is 18 months, so everything published up to the end of 2018 could now be open access. So why aren't they? There's a few possible reasons why they're not showing up.

  • Perhaps the repositories aren't being properly indexed by the Open Access Button / Unpaywall services. The articles might be available, but missing from our results.
  • There might be records for the articles in repositories, but either the AAM version hasn't been uploaded, or the embargo settings are wrong.
  • Records might not have been added to a repository at all.

One way we might explore this further is to look at another index of content from Australian university repositories – Trove. From Trove we can find how many of the articles are listed in repositories, and do a bit of cross-checking with the Open Access sources.

Search for the articles in Trove

Trove harvests records from all Australian university repositories. In some cases the records will include DOIs, but not always. We'll search for the title of each article first in the article or journal zone (this is now the Research category in the web interface). If that doesn't work we'll try searching for the DOI. There might be multiple records for each article, either because they're held by multiple repositories, or they've been indexed into something like Informit, which also supplies data to Trove. To try and be as thorough as possible, we'll look for repository links in all the matching records. We can get rid of any duplicates later.

In [390]:
TROVE_API_KEY = 'YOUR TROVE API KEY'

# Trove search parameters
params = {
    'key': TROVE_API_KEY,
    'encoding': 'json',
    'zone': 'article',
    'format': 'Article',
    'include': 'workVersions,links'
}

def check_link(link):
    '''
    Filter out links that go back to the journal site (rather than to a repository).
    '''
    if not 'tandfonline' in link and not 'doi.org' in link:
        return True
    
def add_repo(link, article, work):
    '''
    Add a repository link to the data set.
    '''
    # Add the basic details into a dictionary.
    repo = {'DOI': article['data']['DOI'], 'title': article['data']['title'], 'trove_url': work.get('troveUrl')}
    
    # In most cases link will be a dictionary with a linktype attribute.
    # However, sometimes in version records it can just be a string.
    # Here we'll handle either case.
    if 'linktype' in link:
        # Standardise urls so we can remove duplicates later
        url = link['value'].replace('http:', 'https:')
        repo['link_type'] = link['linktype']
        repo['repo_url'] = url
    else:
        repo['link_type'] = 'unknown'
        repo['repo_url'] = link
    repositories.append(repo)
    
def query_api(params):
    # Query the API
    response = requests.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    
    # How many matches?
    total_results = int(data["response"]["zone"][0]["records"]["total"])
    
    return total_results, data
    
repositories = []    

for article in articles:
    
    # Set the q parameter to the title of the article (note use of 'title:' to search the title field.)
    params['q'] = f'title:"{article["data"]["title"]}"'
    total_results, data = query_api(params)

    if total_results == 0:
        
        # Try searching for the DOI
        params['q'] = f'"{article["data"]["DOI"]}"'
        total_results, data = query_api(params)
    
    # In some cases the title is not very specific and returns lots of results (eg 'Galahs').
    # Let's try limiting the results further by adding an 'creator:' parameter
    if total_results > 4:
        
        # Add the first author's surname to the query
        params['q'] = f'title:"{article["data"]["title"]}" creator:{article["data"]["creators"][0]["lastName"]}'
        total_results, data = query_api(params)
    
    # If there's still too many results, we'll just flag the title to look at later
    if total_results > 4:
        print(f'Too many choices! - {article["data"]["title"]}')
    else:
        # Get a list of the matching works, if any
        try:
            works = data['response']['zone'][0]['records']['work']
        except KeyError:
            pass
        else:
            
            # Loop through the works
            for work in works:
                
                # Repository links can be at the aggregated 'work' level, or in individual version records.
                # We'll try and get them all and remove any duplicates later.
                # First check links at the work level
                for link in work.get('identifier', []):
                    if check_link(link['value']) is True:
                        add_repo(link, article, work)
                        
                # Then loop through each version
                for version in work['version']:
                    for link in version.get('identifier', []):
                        if check_link(link['value']) is True:
                            add_repo(link, article, work)
                    
                    # Version metadata can be nested under 'metadata', so check there as well
                    if 'metadata' in version:
                        for link in version['metadata'].get('identifier', []):
                            if check_link(link['value']) is True:
                                add_repo(link, article, work)

Convert to a dataframe and remove any duplicates

In [391]:
df_repos = pd.DataFrame(repositories)
df_repos.drop_duplicates(inplace=True)
df_repos.head()
Out[391]:
DOI title trove_url link_type repo_url
0 10.1080/10314610802663035 A Disenfranchised Grief: Post-war Death and Me... https://trove.nla.gov.au/work/66604896 restricted https://hdl.handle.net/1959.9/471872
2 10.1080/1031461X.2011.640695 A Dog in the Manger: White Australia and its V... https://trove.nla.gov.au/work/169309446 restricted https://researchonline.jcu.edu.au/22338/
4 10.1080/1031461X.2012.706621 ‘A Halo of Protection’: Colonial Protectors an... https://trove.nla.gov.au/work/173739658 notonline https://hdl.handle.net/2440/74425
6 10.1080/1031461X.2012.760636 A House Committee on Un-Australian Activities?... https://trove.nla.gov.au/work/181447421 restricted https://vuir.vu.edu.au/24499/
8 10.1080/1031461X.2014.996574 ‘Accurate to the Point of Mania’: Eyewitness T... https://trove.nla.gov.au/work/201475554 notonline https://hdl.handle.net/1885/63982

How many links do we have?

In [392]:
df_repos.shape
Out[392]:
(161, 5)

However, it's possible that we might have mutiple links for a single article. Let's look at how many unique DOIs there are in this dataset.

In [393]:
df_repos['DOI'].unique().shape
Out[393]:
(128,)
In [394]:
print(f'{df_repos["DOI"].unique().shape[0] / df.shape[0]:.2%} of article have records in university repositories (according to Trove)')
52.89% of article have records in university repositories (according to Trove)

So about half of the articles have records in repositories. Or to put it another way – about 40% of articles are listed in repositores, but don't provide AAM versions for download.

When it harvests records from repositories, Trove tries to figure out how accessible things actually are. It assigns a linktype based on this assessment – fulltext, restricted, or notonline. As you might have guessed, fulltext indicates that an item is available for download or viewing online.

In [395]:
df_repos['link_type'].value_counts()
Out[395]:
notonline     102
restricted     32
fulltext       26
thumbnail       1
Name: link_type, dtype: int64

However, there may be duplicates. Let's see how many unique DOIs have 'fulltext' links.

In [396]:
df_repos.loc[df_repos['link_type'] == 'fulltext']['DOI'].unique().shape
Out[396]:
(16,)

So Trove seems to think there are 16 articles that are freely available online. How does this compare to the list of Open Access articles we've already found? Let's combine our two datasets to find out.

Merging datasets

Here we'll combine the datasets using the DOI field to link them.

In [397]:
df_all = pd.merge(df_repos, df, how='outer', on=['DOI', 'title'])
df_all.head()
Out[397]:
DOI title trove_url link_type repo_url key version itemType creators abstractNote ... extra tags collections relations dateAdded dateModified oadoi licence oa_type year
0 10.1080/10314610802663035 A Disenfranchised Grief: Post-war Death and Me... https://trove.nla.gov.au/work/66604896 restricted https://hdl.handle.net/1959.9/471872 54PJ4CIG 6 journalArticle [{'creatorType': 'author', 'firstName': 'Marin... The 1918 Armistice signalled the end of the Fi... ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN 2009
1 10.1080/1031461X.2011.640695 A Dog in the Manger: White Australia and its V... https://trove.nla.gov.au/work/169309446 restricted https://researchonline.jcu.edu.au/22338/ 6XPGZ57R 11 journalArticle [{'creatorType': 'author', 'firstName': 'Russe... Between the world wars Australia was commonly ... ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN 2012
2 10.1080/1031461X.2012.706621 ‘A Halo of Protection’: Colonial Protectors an... https://trove.nla.gov.au/work/173739658 notonline https://hdl.handle.net/2440/74425 C6XRZRJ3 11 journalArticle [{'creatorType': 'author', 'firstName': 'Amand... Scholarship on Australia's colonial protectora... ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN 2012
3 10.1080/1031461X.2012.760636 A House Committee on Un-Australian Activities?... https://trove.nla.gov.au/work/181447421 restricted https://vuir.vu.edu.au/24499/ QWZDSVVC 14 journalArticle [{'creatorType': 'author', 'firstName': 'Lachl... Legislation introduced by Prime Minister Rober... ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN 2013
4 10.1080/1031461X.2014.996574 ‘Accurate to the Point of Mania’: Eyewitness T... https://trove.nla.gov.au/work/201475554 notonline https://hdl.handle.net/1885/63982 7WTYHC34 19 journalArticle [{'creatorType': 'author', 'firstName': 'Marga... The collection of official war art housed in t... ... Publisher: Routledge\n_eprint: https://doi.org... [] [922CMTJU] {} 2020-10-20T02:22:30Z 2020-10-20T02:22:30Z NaN NaN NaN 2015

5 rows × 39 columns

In [398]:
df_all.shape
Out[398]:
(275, 39)

Articles in repositories

Now we have a dataset of articles that are in university repositories, as reported by Trove. Let's analyse them as we did above.

First we'll look at the number of articles per year that have records in repositories.

In [399]:
alt.Chart(df_all.drop_duplicates(subset='DOI').fillna('-')).mark_bar().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('count():Q', title='Number of articles', axis=alt.Axis(tickMinStep=1)),
    color=alt.Color('link_type:N', scale=alt.Scale(range=['lightgrey', 'blue', 'red', 'orange']), legend=alt.Legend(title='Link type')),
    tooltip=[alt.Tooltip('count()', title='Number of articles'), alt.Tooltip('link_type', title='Link type')]
).properties(width=400)
Out[399]:

Oddly, there seems to have been a drop in the proportion of articles being added to repositories.

Let's look at the breakdown by repository.

In [400]:
df_all.drop_duplicates(subset='DOI')['repo_url'].str.extract(r'^https*://(.*?)/').value_counts()
Out[400]:
hdl.handle.net                       68
espace.library.uq.edu.au             12
research-repository.uwa.edu.au       10
researchers.mq.edu.au                 9
ro.uow.edu.au                         6
ecite.utas.edu.au                     5
researchonline.federation.edu.au      4
vuir.vu.edu.au                        2
researchoutputs.unisa.edu.au          2
researchonline.jcu.edu.au             2
researchbank.rmit.edu.au              2
eprints.usq.edu.au                    2
researchrepository.murdoch.edu.au     1
hdl.cqu.edu.au                        1
handle.uws.edu.au:8081                1
handle.unsw.edu.au                    1
dtype: int64

Let's see if we can disambiguate those handle.net links. The code below tries to resolve the handle links, grabbing the address at the end of the redirects.

In [401]:
def get_redirected_url(url):
    if not pd.isna(url) and 'handle.net' in url:
        try:
            response = requests.get(url, timeout=60)
        except requests.exceptions.Timeout:
            print(url)
        else:
            return response.url
    return url

df_all['redirected_url'] = df_all['repo_url'].apply(get_redirected_url)
https://hdl.handle.net/1959.8/115505
https://hdl.handle.net/1959.8/151729

As you can see above, two handle addresses failed to resolve. If you click on them you'll see they go to UNISA. I'm not sure what the problem is.

Let's breakdown the repository details by the redirected urls.

In [402]:
df_all.drop_duplicates(subset='DOI')['redirected_url'].str.extract(r'^https*://(.*?)/').value_counts()
Out[402]:
openresearch-repository.anu.edu.au     19
arrow.latrobe.edu.au:8080              15
espace.library.uq.edu.au               12
dro.deakin.edu.au                      10
research-repository.uwa.edu.au         10
researchers.mq.edu.au                   9
rune.une.edu.au                         7
ro.uow.edu.au                           6
ecite.utas.edu.au                       5
research-repository.griffith.edu.au     5
digital.library.adelaide.edu.au         4
researchonline.federation.edu.au        4
ogma.newcastle.edu.au:443               3
eprints.usq.edu.au                      2
vuir.vu.edu.au                          2
researchbank.rmit.edu.au                2
researchbank.swinburne.edu.au           2
researchonline.jcu.edu.au               2
researchoutputs.unisa.edu.au            2
minerva-access.unimelb.edu.au           1
hdl.cqu.edu.au                          1
handle.uws.edu.au:8081                  1
handle.unsw.edu.au                      1
researchdirect.westernsydney.edu.au     1
researchrepository.murdoch.edu.au       1
opus.lib.uts.edu.au                     1
dtype: int64

If you compare this to the breakdown of OA articles by repository you'll see, for example, that the ANU repository has records for 19 articles, but only one of these makes a Green OA version available for download.

Are 'fulltext' articles really open?

Now let's compare what we found in Trove, with the links we found using the OA Button API.

First of all let's look for articles with 'fulltext' links for which we've already found an OA version.

In [403]:
df_all.loc[(df_all['link_type'] == 'fulltext') & (df_all['oadoi'].notnull())][['DOI', 'title', 'date', 'oadoi', 'redirected_url', 'link_type']]
Out[403]:
DOI title date oadoi redirected_url link_type
56 10.1080/1031461X.2014.911759 ‘Habeas Corpus Mongols’—Chinese Litigants and ... May 4, 2014 https://research-repository.griffith.edu.au/bi... https://research-repository.griffith.edu.au/ha... fulltext
78 10.1080/10314611003716861 Mediatisation and Institutions of Public Memor... June 1, 2010 https://eprints.qut.edu.au/32980/1/c32980.pdf https://eprints.usq.edu.au/6715/ fulltext
109 10.1080/1031461X.2011.560610 Settler Justice and Aboriginal Homicide in Lat... June 1, 2011 https://research-repository.griffith.edu.au/bi... https://research-repository.griffith.edu.au/ha... fulltext
141 10.1080/1031461X.2018.1515963 The Significance of the Northern Territory in ... October 2, 2018 https://research-repository.griffith.edu.au/bi... https://research-repository.griffith.edu.au/ha... fulltext
150 10.1080/1031461X.2010.493945 Vida Lahey's Progressive Activism for Children... September 1, 2010 https://research-repository.griffith.edu.au/bi... https://research-repository.griffith.edu.au/ha... fulltext

So the four Green OA articles in the Griffith repository also show up as 'fulltext' links in Trove. Yay! That's how things are meant to work.

The other article seems a bit odd because the repository link goes to USQ, while the OA link goes to QUT. But this is actually ok. What's happening here is that there are Green OA versions of this article in two repositories. The Open Access Button API only gives us one of them. You can check that they're both there, however, by using the Unpaywall API directly. Look under oa_locations in the results below.

In [404]:
response = requests.get('https://api.unpaywall.org/v2/10.1080/[email protected]')
display(JSON(response.json()))
<IPython.core.display.JSON object>

But why aren't both locations of this article showing up in our dataset. Let's