#!/usr/bin/env python
# coding: utf-8

# # Finding Open Access versions of articles in *Australian Historical Studies*
# 
# Open access isn't just what historians expect from GLAM organisations, it's we do with the products of our research. [*Australian Historical Studies*](https://www.tandfonline.com/toc/rahs20/current) is one of the major journals for Australian historians. How much of it is accessible to researchers without the luxury of an institutional subscription?
# 
# AHS is published by Taylor & Francis, and under their terms and conditions there are two ways articles can be made openly accessible:
# 
# * The author can pay an article publishing charge (APC) to make the article open immediately upon publication. The APC is currently set at $3775. This is known as **Gold** Open Access.
# * The author can share the Author Accepted Manuscript (AAM) version of their article. The AAM version is the one after peer-review, but before copy-editing and typesetting. This version can be shared immediately on the author's personal website or, after an 18 month embargo, it can be uploaded to an institutional or subject repository. This is known as **Green** Open Access.
# 
# If Green Open Access versions are uploaded to a recognised repository, they become findable. Tools such as the [Open Access Button](https://openaccessbutton.org/) and [Unpaywall](https://unpaywall.org/) can redirect you from a paywalled version to an open access alternative. If you use Zotero to save articles from a journal, it'll automatically look for Open Access versions via Unpaywall if no free downloads are available. Green Open Access costs nothing, but it opens your work to new audiences and new uses.
# 
# So how many authors are taking advantage of Green Open Access arrangements? Let's have a look and see.
# 
# ## The dataset
# 
# For this little experiment I'm going to look at 10 years of articles, from 2008 to 2018. I'm finishing in 2018 because it's outside of the 18 month embargo period. Everything published in 2018 or before **can be made open access**. I'm focusing on research articles, excluding editorials, reviews, and commentaries.
# 
# My plan is to save details of the articles to Zotero, access the details from the Zotero API, then use the Open Access Button API to look for open access versions.

# ## Import what we need

# In[334]:


from pyzotero import zotero
import requests
import time
from IPython.display import JSON, display
import pandas as pd
import altair as alt


# ## Get the list of articles from the Zotero API
# 
# To create a list of articles to check I just went to the page for every issue from 2008 to 2018 and used Zotero to save the details of research articles (I haven't included reviews, editorials, or commentaries). You can view the [collection of 242 articles](https://www.zotero.org/groups/2589863/australian_history_journals/collections/922CMTJU) in the Zotero web interface.
# 
# To access the data for each of these articles, I had to create an API token for read-only access to the collection. I can then use [PyZotero](https://pyzotero.readthedocs.io/en/latest/) to request the list of articles from the [Zotero API](https://www.zotero.org/support/dev/web_api/v3/start).

# In[188]:


# Details of the public Zotero group into which I've captured article details
# https://www.zotero.org/groups/2589863/australian_history_journals/library
# This key id read-only
ZOTERO_API_KEY = 'FT3a7ByHQCRUpCnEeoKlhhKy'
ZOTERO_GROUP_ID = '2589863'
ZOTERO_LIBRARY_TYPE = 'group'

# Create the Zotero group client
zot = zotero.Zotero(ZOTERO_USER_ID, ZOTERO_LIBRARY_TYPE, ZOTERO_API_KEY)
zot.add_parameters(sort='title')

# This is the Australian Historical Studies collection
articles = zot.everything(zot.collection_items_top('922CMTJU'))


# ## Free but not open?
# 
# As I was saving the articles into Zotero, I noticed that some had a green tick next to them, indicating that you could access the content without a subscription. These are described as 'free access' articles, rather then 'open access' articles which have the orange, open padlock icon. The difference is that Open Access articles are both freely available, **and** openly licensed. I don't know why the journal makes some articles 'free'. I did some checking and found that the 'free' articles don't seem to show up in the open access databases. In order to include them with the OA articles, I manually added the 'free' article link to Zotero. As you'll see below, I check for this link before searching for an OA version of each article. So the final results are a combination of the 'free' and OA articles.

# ## Look for OA versions of the articles
# 
# Now we're going to see if we can find open access versions of the articles. The code below will get the DOI for each article in our dataset and then look it up using the Open Access Button API. If it finds an OA version, it'll display the title and link, and add the link to the article's metadata.

# In[189]:


# Open Access button API endpoint
OA_API_URL = 'https://api.openaccessbutton.org/find'

oa_articles = []
for article in articles:
    # Some articles have been made 'free' by the journal, though they're not open access
    # These aren't included in the OA Button db, so I've added the PDF links to their Zotero records.
    # Here we'll check to see if the article has one of these links.
    for child in zot.children(article['key']):
        if child['data']['title'] == 'Free access PDF':
            article['data']['oadoi'] = child['data']['url']
            article['data']['oa_type'] = 'free access'
            break
    # If there's not free access, we'll see if there's an OA version
    if not article['data'].get('oadoi'):
        # Search the OA db using the DOI
        response = requests.get(OA_API_URL, params={'id': article['data']['DOI']})
        data = response.json()
        # Try the title if we couldn't find it by DOI
        # if not data['found']:
        #    response = requests.get(OA_API_URL, params={'title': article['data']['title']})
        #    data = response.json()
        # Is there an OA version?
        if data['found']:
            article['data']['oadoi'] = data['found']['oadoi']
            article['data']['licence'] = data['metadata'].get('licence', '')
            time.sleep(1)
    if article['data'].get('oadoi'):
        print(f'\n{article["data"]["title"]}')
        print(article['data']['oadoi'])
    oa_articles.append(article['data'])     


# ## Convert to a dataframe
# 
# Now we'll convert the list of articles into a dataframe for further exploration.

# In[238]:


df = pd.DataFrame(oa_articles)
df.head()


# In[239]:


# Check the total number of articles
df.shape


# ## All articles with Open Access (or free access) versions
# 
# Let's look at all the articles that have an OA link.

# In[226]:


# How many articles have OA versions?
df.loc[df['oadoi'].notnull()].shape


# In[233]:


# Display them
df.loc[df['oadoi'].notnull()][['DOI', 'title', 'date', 'oadoi']]


# How many articles have OA versions available?

# In[297]:


print(f'{df.loc[df["oadoi"].notnull()].shape[0] / df.shape[0]:.2%} of articles are freely available')


# ## Gold open access articles
# 
# The Gold OA articles have links that go back to the Taylor & Francis site, but are not the 'free access' articles I identified manually.

# In[243]:


# Set OA type to gold
df.loc[(df['oadoi'].fillna('').str.contains('tandfonline')) & (df['oa_type'] != 'free access'),'oa_type'] = 'gold'


# In[292]:


# Number of articles
df.loc[df['oa_type'] == 'gold'].shape


# In[244]:


df.loc[df['oa_type'] == 'gold'][['DOI', 'title', 'date', 'oadoi']]


# ## Green Open Access articles

# If the OA url doesn't include 'tandfonline' and it's not 'free access', then it looks like it's Green Open Access.

# In[246]:


# Set oa_type to 'green'
df.loc[(df['oadoi'].notnull()) & (~df['oadoi'].fillna('').str.contains('tandfonline')) & (df['oa_type'] != 'free access'), 'oa_type'] = 'green'


# In[291]:


# Number of articles
df.loc[df['oa_type'] == 'green'].shape


# In[247]:


df.loc[df['oa_type'] == 'green'][['DOI', 'title', 'date', 'oadoi']]


# ## 'Free access' articles

# In[293]:


df.loc[df['oa_type'] == 'free access'][['DOI', 'title', 'date', 'oadoi']]


# ## Articles over time
# 
# Let's see how the number of articles varies over time. First we'll extract the `year` from the date string.

# In[254]:


# Add a year column by extracting the year from the date column
df['year'] = df['date'].str.extract(r'(\d{4})$').astype(int)


# Now let's plot the results.

# In[294]:


alt.Chart(df.fillna('$$$')).mark_bar().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('count():Q', title='Number of articles', axis=alt.Axis(tickMinStep=1)),
    color=alt.Color('oa_type:N', scale=alt.Scale(range=['lightgrey', 'blue', 'gold', 'green']), legend=alt.Legend(title='OA type')),
    tooltip=[alt.Tooltip('count():Q', title='Number of articles'), alt.Tooltip('oa_type', title='OA type')]
).properties(width=400)


# ## OA articles by repository
# 
# We can extract the domain from the OA url to see where the articles come from.

# In[266]:


df.loc[df['oadoi'].notnull()]['oadoi'].str.extract(r'^https*://(.*?)/').value_counts()


# ## Why so few?
# 
# Only 8.26% of research articles published in *Australian Historical Studies* between 2008 and 2018 are available in an open access version. That's pretty disappointing. Remember too that the embargo period for AHS is 18 months, so everything published up to the end of 2018 **could** now be open access. So why aren't they? There's a few possible reasons why they're not showing up.
# 
# * Perhaps the repositories aren't being properly indexed by the Open Access Button / Unpaywall services. The articles might be available, but missing from our results.
# * There might be records for the articles in repositories, but either the AAM version hasn't been uploaded, or the embargo settings are wrong.
# * Records might not have been added to a repository at all.
# 
# One way we might explore this further is to look at another index of content from Australian university repositories – Trove. From Trove we can find how many of the articles are listed in repositories, and do a bit of cross-checking with the Open Access sources.
# 

# ## Search for the articles in Trove
# 
# Trove harvests records from all Australian university repositories. In some cases the records will include DOIs, but not always. We'll search for the title of each article first in the `article` or journal zone (this is now the Research category in the web interface). If that doesn't work we'll try searching for the DOI. There might be multiple records for each article, either because they're held by multiple repositories, or they've been indexed into something like Informit, which also supplies data to Trove. To try and be as thorough as possible, we'll look for repository links in all the matching records. We can get rid of any duplicates later.

# In[390]:


TROVE_API_KEY = 'YOUR TROVE API KEY'

# Trove search parameters
params = {
    'key': TROVE_API_KEY,
    'encoding': 'json',
    'zone': 'article',
    'format': 'Article',
    'include': 'workVersions,links'
}

def check_link(link):
    '''
    Filter out links that go back to the journal site (rather than to a repository).
    '''
    if not 'tandfonline' in link and not 'doi.org' in link:
        return True
    
def add_repo(link, article, work):
    '''
    Add a repository link to the data set.
    '''
    # Add the basic details into a dictionary.
    repo = {'DOI': article['data']['DOI'], 'title': article['data']['title'], 'trove_url': work.get('troveUrl')}
    
    # In most cases link will be a dictionary with a linktype attribute.
    # However, sometimes in version records it can just be a string.
    # Here we'll handle either case.
    if 'linktype' in link:
        # Standardise urls so we can remove duplicates later
        url = link['value'].replace('http:', 'https:')
        repo['link_type'] = link['linktype']
        repo['repo_url'] = url
    else:
        repo['link_type'] = 'unknown'
        repo['repo_url'] = link
    repositories.append(repo)
    
def query_api(params):
    # Query the API
    response = requests.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    
    # How many matches?
    total_results = int(data["response"]["zone"][0]["records"]["total"])
    
    return total_results, data
    
repositories = []    

for article in articles:
    
    # Set the q parameter to the title of the article (note use of 'title:' to search the title field.)
    params['q'] = f'title:"{article["data"]["title"]}"'
    total_results, data = query_api(params)

    if total_results == 0:
        
        # Try searching for the DOI
        params['q'] = f'"{article["data"]["DOI"]}"'
        total_results, data = query_api(params)
    
    # In some cases the title is not very specific and returns lots of results (eg 'Galahs').
    # Let's try limiting the results further by adding an 'creator:' parameter
    if total_results > 4:
        
        # Add the first author's surname to the query
        params['q'] = f'title:"{article["data"]["title"]}" creator:{article["data"]["creators"][0]["lastName"]}'
        total_results, data = query_api(params)
    
    # If there's still too many results, we'll just flag the title to look at later
    if total_results > 4:
        print(f'Too many choices! - {article["data"]["title"]}')
    else:
        # Get a list of the matching works, if any
        try:
            works = data['response']['zone'][0]['records']['work']
        except KeyError:
            pass
        else:
            
            # Loop through the works
            for work in works:
                
                # Repository links can be at the aggregated 'work' level, or in individual version records.
                # We'll try and get them all and remove any duplicates later.
                # First check links at the work level
                for link in work.get('identifier', []):
                    if check_link(link['value']) is True:
                        add_repo(link, article, work)
                        
                # Then loop through each version
                for version in work['version']:
                    for link in version.get('identifier', []):
                        if check_link(link['value']) is True:
                            add_repo(link, article, work)
                    
                    # Version metadata can be nested under 'metadata', so check there as well
                    if 'metadata' in version:
                        for link in version['metadata'].get('identifier', []):
                            if check_link(link['value']) is True:
                                add_repo(link, article, work)


# ## Convert to a dataframe and remove any duplicates

# In[391]:


df_repos = pd.DataFrame(repositories)
df_repos.drop_duplicates(inplace=True)
df_repos.head()


# How many links do we have?

# In[392]:


df_repos.shape


# However, it's possible that we might have mutiple links for a single article. Let's look at how many unique DOIs there are in this dataset.

# In[393]:


df_repos['DOI'].unique().shape


# In[394]:


print(f'{df_repos["DOI"].unique().shape[0] / df.shape[0]:.2%} of article have records in university repositories (according to Trove)')


# So about half of the articles have records in repositories. Or to put it another way – about 40% of articles are listed in repositores, but don't provide AAM versions for download.

# ## Types of links
# 
# When it harvests records from repositories, Trove tries to figure out how accessible things actually are. It assigns a `linktype` based on this assessment – `fulltext`, `restricted`, or `notonline`. As you might have guessed, `fulltext` indicates that an item is available for download or viewing online.

# In[395]:


df_repos['link_type'].value_counts()


# However, there may be duplicates. Let's see how many unique DOIs have 'fulltext' links.

# In[396]:


df_repos.loc[df_repos['link_type'] == 'fulltext']['DOI'].unique().shape


# So Trove seems to think there are 16 articles that are freely available online. How does this compare to the list of Open Access articles we've already found? Let's combine our two datasets to find out.

# ## Merging datasets
# 
# Here we'll combine the datasets using the `DOI` field to link them.

# In[397]:


df_all = pd.merge(df_repos, df, how='outer', on=['DOI', 'title'])
df_all.head()


# In[398]:


df_all.shape


# ## Articles in repositories
# 
# Now we have a dataset of articles that are in university repositories, as reported by Trove. Let's analyse them as we did above.
# 
# First we'll look at the number of articles per year that have records in repositories.

# In[399]:


alt.Chart(df_all.drop_duplicates(subset='DOI').fillna('-')).mark_bar().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('count():Q', title='Number of articles', axis=alt.Axis(tickMinStep=1)),
    color=alt.Color('link_type:N', scale=alt.Scale(range=['lightgrey', 'blue', 'red', 'orange']), legend=alt.Legend(title='Link type')),
    tooltip=[alt.Tooltip('count()', title='Number of articles'), alt.Tooltip('link_type', title='Link type')]
).properties(width=400)


# Oddly, there seems to have been a drop in the proportion of articles being added to repositories.
# 
# Let's look at the breakdown by repository.

# In[400]:


df_all.drop_duplicates(subset='DOI')['repo_url'].str.extract(r'^https*://(.*?)/').value_counts()


# Let's see if we can disambiguate those `handle.net` links. The code below tries to resolve the handle links, grabbing the address at the end of the redirects.

# In[401]:


def get_redirected_url(url):
    if not pd.isna(url) and 'handle.net' in url:
        try:
            response = requests.get(url, timeout=60)
        except requests.exceptions.Timeout:
            print(url)
        else:
            return response.url
    return url

df_all['redirected_url'] = df_all['repo_url'].apply(get_redirected_url)


# As you can see above, two handle addresses failed to resolve. If you click on them you'll see they go to UNISA. I'm not sure what the problem is. 
# 
# Let's breakdown the repository details by the redirected urls.

# In[402]:


df_all.drop_duplicates(subset='DOI')['redirected_url'].str.extract(r'^https*://(.*?)/').value_counts()


# If you compare this to the breakdown of OA articles by repository you'll see, for example, that the ANU repository has records for 19 articles, but only one of these makes a Green OA version available for download.

# ## Are 'fulltext' articles really open?
# 
# Now let's compare what we found in Trove, with the links we found using the OA Button API.
# 
# First of all let's look for articles with 'fulltext' links for which we've already found an OA version.

# In[403]:


df_all.loc[(df_all['link_type'] == 'fulltext') & (df_all['oadoi'].notnull())][['DOI', 'title', 'date', 'oadoi', 'redirected_url', 'link_type']]


# So the four Green OA articles in the Griffith repository also show up as 'fulltext' links in Trove. Yay! That's how things are meant to work.
# 
# The other article seems a bit odd because the repository link goes to USQ, while the OA link goes to QUT. But this is actually ok. What's happening here is that there are Green OA versions of this article in two repositories. The Open Access Button API only gives us one of them. You can check that they're both there, however, by using the Unpaywall API directly. Look under `oa_locations` in the results below.

# In[404]:


response = requests.get('https://api.unpaywall.org/v2/10.1080/10314611003716861?email=tim@discontents.com.au')
display(JSON(response.json()))


# But why aren't both locations of this article showing up in our dataset. Let's look for the DOI.

# In[405]:


df_all.loc[df_all['DOI'] == '10.1080/10314611003716861']


# Ah, so they are both there, but look at the `link_type` values. One is 'fulltext', but the other is 'notonline' even though it points to a Green OA version. As I noted above, Trove makes an assessment of the online status of the article based on the available metadata – obviously something's going wrong here.

# How many more OA versions aren't labelled as 'fulltext' in Trove?

# In[406]:


df_all.loc[(df_all['link_type'].notnull()) & (df_all['link_type'] != 'fulltext') & (df_all['oadoi'].notnull())].drop_duplicates(subset='DOI')[['DOI', 'title', 'oadoi', 'redirected_url', 'link_type', 'oa_type']]


# The repository managers might like to look at these to see why the `link_type` values are not being set correctly in Trove. What's most disappointing here is that in two cases authors have gone to the trouble of making their articles Gold Open Access, but anyone searching for them in Trove will be told that they're not online, even though there's a link to them!
# 
# Now let's look at 'fulltext' links for which we haven't found an OA version.

# In[407]:


df_all.loc[(df_all['link_type'] == 'fulltext') & (df_all['oadoi'].isnull())].drop_duplicates(subset='DOI')[['DOI', 'title', 'oadoi', 'redirected_url', 'link_type', 'oa_type']]


# Nine of these links are from UWA. It looks like they might not be getting indexed properly by Unpaywall. Let's display the links so we can click on them and see what's really going on.

# In[408]:


for article, links in df_all.loc[(df_all['link_type'] == 'fulltext') & (df_all['oadoi'].isnull())].groupby(by=['DOI', 'title']):
    print(f'\n{article[0]}')
    print(f'{article[1]}')
    for link in links['repo_url'].to_list():
        print(f'  * {link}')
        

# If you click on the UWA repository links you'll see that there's no link to an OA version of the articles. So they're not 'fulltext' links at all. 
# 
# However, the [UNSW repository link](https://handle.unsw.edu.au/1959.4/unsworks_51159) doesn indeed lead to a Green OA version! After I shared this information on Twitter, Richard Orr from Unpaywall checked out the link and [explained the problem](https://twitter.com/unpaywall_dev/status/1318963084392804354). Fiona Bradley from UNSW also [noted that they're shifting to new repository software](https://twitter.com/Fiona_Bradley/status/1319036507244560385), so hopefully that'll fix things.
# 
# The [UQ repository link](https://espace.library.uq.edu.au/view/UQ:368053) also leads to a Green OA version. I didn't pick this one up in my first pass, but I've now asked Richard about it as well.
# 
# So what's clear is that we can't rely on the `link_type` value in Trove, and there may be some repositories whose OA data is not being captured by Unpaywall.
# 
# The only way to really check this thoroughly would be to work through all the other Trove repository links to see if there are more Green OA versions hiding there. After a bit of semi-random clicking I did find one example in NOVA, the University of Newcastle repository: http://hdl.handle.net/1959.13/805160. Again, Richard Orr checked this out and has [implemented a fix in Unpaywall](https://twitter.com/unpaywall_dev/status/1318961609771986944) so the Green OA version should now be indexed. I checked the other two links to NOVA in the data from Trove, but unfortunately those records don't have Green OA versions.
# 
# It's possible there are more Green OA versions that I haven't found. I suppose I could try loading all the repository links and looking for some soert of download link. But I think I'll leave that for another day...

# ## Conclusions
# 
# Adding in the 3 extra Green OA versions I found exploring the data from Trove, we have a total of 23 articles out of 242 that are available via open access. **That's just 9.5%!**
# 
# Although some repositories aren't being indexed properly, the major reason why we can't find OA versions for 90% of the articles is just that the authors haven't made them available through an open repository. This is both disappointing and hopeful. Disappointing in that so much significant scholarship remains locked up behind a paywall. But hopeful in that the solution is pretty straightforward. If you're the author of one of these articles:
# 
# * If you're at a university, talk to your local librarians or repository managers about how you can upload an AAM version of your article.
# * If you're not at a university, you can use [Share Your Paper](https://shareyourpaper.org/) to upload it to Zenodo. 
# 
# If you want to get access to one of these articles:
# 
# * Use the [Open Access Button](https://openaccessbutton.org/) to send a request to the author to make it available.
# 
# If you're about to publish an article in *Australian Historical Studies* or *History Australia* (or many other journals – check their OA policies using [Sherpa Romeo](http://v2.sherpa.ac.uk/romeo/)):
# 
# * Upload the AAM version to a repository immediately upon publication, and set the embargo period for 18 months.
# * Share your AAM version immediately upon publication through your own personal website.
# 
# If you're thinking about publishing an article:
# 
# * Check out the [Directory of Open Access Journals](https://doaj.org/) for a full open access alternative!
# * Share the pre-print version of your article (that's the version *before* you've submitted it to a journal) in a repository like [Humanities Commons](https://hcommons.org/). Yes, it's a bit scary but you might get some useful feedback or find new connections.

# In[ ]: