The use of standard licences and rights statements in Trove image records

Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of the licences and rights statements listed on this page. We can also use this index to build a picture of which rights statements are currently being used, and by who. Let's give it a try...

The method used here is to:

  • Retrieve details of Trove contributors from the API
  • Loop through the contributors, then loop through all the licences/rights statements, firing off a search in the picture zone for each combination.
  • Save the total results for each query with the contributor and licence details.

So for every organisation that contributes records to Trove, we'll find out the number of image records that include each rights statement.

Problems:

  • Searching by contributor saves us having to harvest all the images, but it has a major problem. Sometimes Trove will group multiple versions of a picture held by different organisations as a single work. Rights information is saved in the version metadata, but searches only return works. So if one organisation has assigned a rights statement to a version of the image, it will look like all the organisations whose images are grouped together with it as a work are using that rights statement. I don't think this will make a huge difference to the results, but it will be something to look out for. The only way around this is to harvest everything and expand the versions out into separate record.

  • The rights index doesn't currently seem to include information on out of copyright images, unless they've actually been marked using the 'Public Domain' statement by the institution. Common statements such as 'Out of copyright', 'No known copyright restrictions', or 'Copyright expired' return no results. So there's a lot more open images than are currently reported by the rights index.

In [195]:
import requests
import time
from tqdm.notebook import tqdm
import pandas as pd
from IPython.display import FileLink
In [161]:
# These are all the licence/rights statements recognised by Trove 
# Copied from https://help.nla.gov.au/trove/becoming-partner/for-content-partners/licensing-reuse
licences = [
    'Free/CC Public Domain',
    'Free/CC BY',
    'Free/CC0',
    'Free/RS NKC',
    'Free/RS Noc-US',
    'Free with conditions/CC BY-ND',
    'Free with conditions/CC BY-SA',
    'Free with conditions/CC BY-NC',
    'Free with conditions/CC BY-NC-ND',
    'Free with conditions/CC BY-NC-SA',
    'Free with conditions/RS NoC-NC',
    'Free with conditions/InC-NC',
    'Free with conditions/InC-EDU',
    'Restricted/RS InC',
    'Restricted/RS InC-OW-EU',
    'Restricted/RS InC-RUU',
    'Restricted/RS CNE',
    'Restricted/RS UND',
    'Restricted/NoC-CR',
    'Restricted/NoC-OKLR'
]
In [162]:
API_KEY = 'INSERT YOUR API KEY'
In [163]:
def save_summary(contributors, record, parent=None):
    '''
    Extract basic data from contributor record, and traverse any child records.
    Create a full_name value by combining parent and child names.
    '''
    summary = {
        'id': record['id'],
        'name': record['name']
    }        
    if parent:
        summary['parent_id'] = parent['id']
        summary['full_name'] = f'{parent["full_name"]} / {record["name"]}'
    elif 'parent' in  record:
        summary['parent_id'] = record['parent']['id']
        summary['full_name'] = f'{record["parent"]["value"]} / {record["name"]}'
    else:
        summary['full_name'] = record['name']
    if 'children' in record:
        for child in record['children']['contributor']:
            save_summary(contributors, child, summary)
    contributors.append(summary)
    
def get_contributors():
    '''
    Get a list of contributors form the Trove API.
    Flatten all the nested records.
    '''
    contributors = []
    contrib_params = {
        'key': API_KEY,
        'encoding': 'json',
        'reclevel': 'full'
    }
    response = requests.get('https://api.trove.nla.gov.au/v2/contributor/', params=contrib_params)
    data = response.json()
    for record in data['response']['contributor']:
        save_summary(contributors, record)
    return contributors
    
In [185]:
def contributor_has_results(contrib, params, additional_query):
    '''
    Check to see is the query return any results for this contributor.
    '''
    query = f'nuc:"{contrib["id"]}"'
    # Add any extra queries
    if additional_query:
        query += f' {additional_query}'
    params['q'] = query
    response = requests.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    total = int(data['response']['zone'][0]['records']['total'])
    if total > 0:
        return True

def licence_counts_by_institution(additional_query=None):
    '''
    Loop through contributors and licences to harvest data about the number of times each licence is used.
    '''
    contributors = get_contributors()
    licence_counts = []
    params = {
        'key': API_KEY,
        'encoding': 'json',
        'zone': 'picture',
        'n': 0
    }    
    for contrib in tqdm(contributors):
        # If there are no results for this contributor then there's no point checking for licences
        # This should save a bit of time
        if contributor_has_results(contrib, params, additional_query):
            contrib_row = contrib.copy()
            # Only search for nuc ids that start with a letter
            if contrib['id'][0].isalpha():
                for licence in licences:
                    # Construct query using nuc id and licence
                    query = f'nuc:"{contrib["id"]}" rights:"{licence}"'
                    # Add any extra queries
                    if additional_query:
                        query += f' {additional_query}'
                    params['q'] = query
                    response = requests.get('https://api.trove.nla.gov.au/v2/result', params=params)
                    data = response.json()
                    total = data['response']['zone'][0]['records']['total']
                    contrib_row[licence] = int(total)
                    time.sleep(0.2)
            # print(contrib_row)
            licence_counts.append(contrib_row)
    return licence_counts
In [186]:
licence_counts_not_books = licence_counts_by_institution('NOT format:"Book"')

Process the data

In [187]:
df = pd.DataFrame(licence_counts_not_books)
In [188]:
# Fill empty totals with zeros & make them all integers
df[licences] = df[licences].fillna(0).astype(int)
In [189]:
# Check the overall distribution of rights statements
df.sum()
Out[189]:
Free with conditions/CC BY-NC                                                   21962
Free with conditions/CC BY-NC-ND                                                26067
Free with conditions/CC BY-NC-SA                                                82315
Free with conditions/CC BY-ND                                                       0
Free with conditions/CC BY-SA                                                   16801
Free with conditions/InC-EDU                                                     4381
Free with conditions/InC-NC                                                         0
Free with conditions/RS NoC-NC                                                      0
Free/CC BY                                                                     146001
Free/CC Public Domain                                                          267425
Free/CC0                                                                          581
Free/RS NKC                                                                      1391
Free/RS Noc-US                                                                      0
Restricted/NoC-CR                                                                   0
Restricted/NoC-OKLR                                                                 0
Restricted/RS CNE                                                               15093
Restricted/RS InC                                                               19144
Restricted/RS InC-OW-EU                                                             0
Restricted/RS InC-RUU                                                               1
Restricted/RS UND                                                                 422
full_name                           4th/19th Prince of Wales'  Light Horse Regimen...
id                                  VPWLHADFAACTSACSASACCSAHSTWLQAATVAPRCNALB:DCNA...
name                                4th/19th Prince of Wales'  Light Horse Regimen...
dtype: object
In [190]:
# Remove columns we don't need
df_final = df[['id', 'full_name'] + licences]
In [191]:
# Remove rows that add up to zero
df_final = df_final.loc[(df_final.sum(axis=1) != 0)]
In [192]:
# Remove columns that are all zero
df_final = df_final.loc[:, df_final.any()]
In [193]:
# Sort by name and save as CSV
df_final.sort_values(by=['full_name']).to_csv('rights-on-images.csv', index=False)

See the results here:

Some GLAM institutions apply restrictive licences to digitised versions of out-of-copyright images. Under Australian copyright law, photographs created before 1955 are out of copyright, so we can adjust our query and look to see what sorts of rights statements are attached to them.

In [194]:
licence_counts_out_of_copyright = licence_counts_by_institution('format:Photograph date:[* TO 1954]')

In [196]:
df2 = pd.DataFrame(licence_counts_out_of_copyright)
In [197]:
# Fill empty totals with zeros & make them all integers
df2[licences] = df2[licences].fillna(0).astype(int)
In [199]:
# Check the overall distribution of rights statements
df2.sum()
Out[199]:
Free with conditions/CC BY-NC                                                      62
Free with conditions/CC BY-NC-ND                                                  840
Free with conditions/CC BY-NC-SA                                                 1172
Free with conditions/CC BY-ND                                                       0
Free with conditions/CC BY-SA                                                     715
Free with conditions/InC-EDU                                                        6
Free with conditions/InC-NC                                                         0
Free with conditions/RS NoC-NC                                                      0
Free/CC BY                                                                      36116
Free/CC Public Domain                                                            1772
Free/CC0                                                                          243
Free/RS NKC                                                                      1148
Free/RS Noc-US                                                                      0
Restricted/NoC-CR                                                                   0
Restricted/NoC-OKLR                                                                 0
Restricted/RS CNE                                                                 519
Restricted/RS InC                                                                 123
Restricted/RS InC-OW-EU                                                             0
Restricted/RS InC-RUU                                                               0
Restricted/RS UND                                                                   3
full_name                           4th/19th Prince of Wales'  Light Horse Regimen...
id                                  VPWLHADFAACTSACSASACCSAHNALB:DCNALBSALCXASPLNS...
name                                4th/19th Prince of Wales'  Light Horse Regimen...
dtype: object
In [200]:
# Remove columns we don't need
df2_final = df2[['id', 'full_name'] + licences]
In [201]:
# Remove rows that add up to zero
df2_final = df2_final.loc[(df2_final.sum(axis=1) != 0)]
In [202]:
# Remove columns that are all zero
df2_final = df2_final.loc[:, df2_final.any()]
In [203]:
# Sort by name and save as CSV
df2_final.sort_values(by=['full_name']).to_csv('rights-on-out-of-copyright-photos.csv', index=False)
In [ ]: