Corrections of OCRd text in Trove's newspapers

The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.

There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include has:corrections in your query to limit the results to articles that have at least one OCR correction.

To get information about the number of corrections made to the articles in your results, you can add the reclevel=full parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the correctionCount and lastCorrection values in the record below:

{
    "article": {
        "id": "41697877",
        "url": "/newspaper/41697877",
        "heading": "WRAGGE AND WEATHER CYCLES.",
        "category": "Article",
        "title": {
            "id": "101",
            "value": "Western Mail (Perth, WA : 1885 - 1954)"
        },
        "date": "1922-11-23",
        "page": 4,
        "pageSequence": 4,
        "troveUrl": "https://trove.nla.gov.au/ndp/del/article/41697877",
        "illustrated": "N",
        "wordCount": 1054,
        "correctionCount": 1,
        "listCount": 0,
        "tagCount": 0,
        "commentCount": 0,
        "lastCorrection": {
            "by": "*anon*",
            "lastupdated": "2016-09-12T07:08:57Z"
        },
        "identifier": "https://nla.gov.au/nla.news-article41697877",
        "trovePageUrl": "https://trove.nla.gov.au/ndp/del/page/3522839",
        "pdf": "https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print"
    }
}

Setting things up

In [6]:
import requests
import os
import ipywidgets as widgets
from operator import itemgetter # used for sorting
import pandas as pd # makes manipulating the data easier
import altair as alt
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, HTML, FileLink, clear_output
import math
from collections import OrderedDict
import time

# Make sure data directory exists
os.makedirs('data', exist_ok=True)

# Create a session that will automatically retry on server errors
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))
In [ ]:
api_key = 'YOUR API KEY'
print('Your API key is: {}'.format(api_key))
In [8]:
# Basic parameters for Trove API
params = {
    'facet': 'year', # Get the data aggregated by year.
    'zone': 'newspaper',
    'key': api_key,
    'encoding': 'json',
    'n': 0 # We don't need any records, just the facets!
}
In [9]:
def get_results(params):
    '''
    Get JSON response data from the Trove API.
    Parameters:
        params
    Returns:
        JSON formatted response data from Trove API 
    '''
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=30)
    response.raise_for_status()
    # print(response.url) # This shows us the url that's sent to the API
    data = response.json()
    return data

How many newspaper articles have corrections?

Let's find out what proportion of newspaper articles have at least one OCR correction.

First we'll get to the total number of newspaper articles in Trove.

In [10]:
# Set the q parameter to a single space to get everything
params['q'] = ' '

# Get the data from the API
data = get_results(params)

# Extract the total number of results
total = int(data['response']['zone'][0]['records']['total'])
print('{:,}'.format(total))
228,011,887

Now we'll set the q parameter to has:corrections to limit the results to newspaper articles that have at least one correction.

In [11]:
# Set the q parameter to 'has:corrections' to limit results to articles with corrections
params['q'] = 'has:corrections'

# Get the data from the API
data = get_results(params)

# Extract the total number of results
corrected = int(data['response']['zone'][0]['records']['total'])
print('{:,}'.format(corrected))
11,998,749

Calculate the proportion of articles with corrections.

In [12]:
print('{:.2%} of articles have at least one correction'.format(corrected/total))
5.26% of articles have at least one correction

You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the number of articles that include corrections, while the individual scores show the number of lines corrected by each volunteer.

Number of corrections by year

In [13]:
def get_facets(data):
    '''
    Loop through facets in Trove API response, saving terms and counts.
    Parameters:
        data  - JSON formatted response data from Trove API  
    Returns:
        A list of dictionaries containing: 'term', 'total_results'
    '''
    facets = []
    try:
        # The facets are buried a fair way down in the results
        # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want
        # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)
        for term in data['response']['zone'][0]['facets']['facet']['term']:
            
            # Get the year and the number of results, and convert them to integers, before adding to our results
            facets.append({'term': term['search'], 'total_results': int(term['count'])})
            
        # Sort facets by year
        facets.sort(key=itemgetter('term'))
    except TypeError:
        pass
    return facets

def get_facet_data(params, start_decade=180, end_decade=201):
    '''
    Loop throught the decades from 'start_decade' to 'end_decade',
    getting the number of search results for each year from the year facet.
    Combine all the results into a single list.
    Parameters:
        params - parameters to send to the API
        start_decade
        end_decade
    Returns:
        A list of dictionaries containing 'year', 'total_results' for the complete 
        period between the start and end decades.
    '''
    # Create a list to hold the facets data
    facet_data = []
    
    # Loop through the decades
    for decade in tqdm(range(start_decade, end_decade + 1)):
        
        #print(params)
        # Avoid confusion by copying the params before we change anything.
        search_params = params.copy()
        
        # Add decade value to params
        search_params['l-decade'] = decade
        
        # Get the data from the API
        data = get_results(search_params)
        
        # Get the facets from the data and add to facets_data
        facet_data += get_facets(data)
        
    # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)
    clear_output()
    return facet_data
In [14]:
facet_data = get_facet_data(params)
In [15]:
# Convert our data to a dataframe called df
df = pd.DataFrame(facet_data)
In [16]:
df.head()
Out[16]:
term total_results
0 1803 526
1 1804 619
2 1805 430
3 1806 367
4 1807 134

So which year has the most corrections?

In [17]:
df.loc[df['total_results'].idxmax()]
Out[17]:
term               1915
total_results    244730
Name: 112, dtype: int64

The fact that there's more corrections in newspaper articles from 1915, might make you think that people have been more motivated to correct articles relating to WWI. But if you look at the total number of articles per year, you'll see that there's been more articles digitised from 1915! The raw number of corrections is probably not very useful, so let's look instead at the proportion of articles each year that have at least one correction.

To do that we'll re-harvest the facet data, but this time with a blank, or empty search, to get the total number of articles available from each year.

In [18]:
# Reset the 'q' parameter
# Use a an empty search (a single space) to get ALL THE ARTICLES
params['q'] = ' '

# Get facet data for all articles
all_facet_data = get_facet_data(params)
In [19]:
# Convert the results to a dataframe
df_total = pd.DataFrame(all_facet_data)

No we'll merge the number of articles by year with corrections with the total number of articles. Then we'll calculate the proportion with corrections.

In [20]:
def merge_df_with_total(df, df_total, how='left'):
    '''
    Merge dataframes containing search results with the total number of articles by year.
    This is a left join on the year column. The total number of articles will be added as a column to 
    the existing results.
    Once merged, do some reorganisation and calculate the proportion of search results.
    Parameters:
        df - the search results in a dataframe
        df_total - total number of articles per year in a dataframe
    Returns:
        A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion' 
        (plus any other columns that are in the search results dataframe).
    '''
    # Merge the two dataframes on year
    # Note that we're joining the two dataframes on the year column
    df_merged = pd.merge(df, df_total, how=how, on='term')

    # Rename the columns for convenience
    df_merged.rename({'total_results_y': 'total_articles'}, inplace=True, axis='columns')
    df_merged.rename({'total_results_x': 'total_results'}, inplace=True, axis='columns')

    # Set blank values to zero to avoid problems
    df_merged['total_results'] = df_merged['total_results'].fillna(0).astype(int)

    # Calculate proportion by dividing the search results by the total articles
    df_merged['proportion'] = df_merged['total_results'] / df_merged['total_articles']
    return df_merged
In [21]:
# Merge the search results with the total articles
df_merged = merge_df_with_total(df, df_total)
df_merged.head()
Out[21]:
term total_results total_articles proportion
0 1803 526 526 1.0
1 1804 619 619 1.0
2 1805 430 430 1.0
3 1806 367 367 1.0
4 1807 134 134 1.0

Let's visualise the results, showing both the number of articles with corrections each year, and the proportion of articles each year with corrections.

In [22]:
# Number of articles with corrections
chart1 = alt.Chart(df_merged).mark_line(point=True).encode(
        x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),
        y=alt.Y('total_results:Q', axis=alt.Axis(format=',d', title='Number of articles with corrections')),
        tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('total_results:Q', title='Articles', format=',')]
    ).properties(width=700, height=250)

# Proportion of articles with corrections
chart2 = alt.Chart(df_merged).mark_line(point=True, color='red').encode(
        x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),
    
        # This time we're showing the proportion (formatted as a percentage) on the Y axis
        y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),
        tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('proportion:Q', title='Proportion', format='%')],
        
        # Make the charts different colors
        color=alt.value('orange')
    ).properties(width=700, height=250)

# This is a shorthand way of stacking the charts on top of each other
chart1 & chart2
Out[22]:

This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected.

Number of corrections by category

Let's see how the number of corrections varies across categories. This time we'll use the category facet instead of year.

In [23]:
params['q'] = 'has:corrections'
params['facet'] = 'category'
In [24]:
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_categories = pd.DataFrame(facets)
In [25]:
df_categories.head()
Out[25]:
term total_results
0 Article 9110253
1 Family Notices 1294362
2 Advertising 1155058
3 Detailed Lists, Results, Guides 443143
4 Literature 7570

Once again, the raw numbers are probably not all that useful, so let's get the total number of articles in each category and calculate the proportion that have at least one correction.

In [26]:
# Blank query
params['q'] = ' '
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_total_categories = pd.DataFrame(facets)

We'll merge the two corrections by category data with the total articles per category and calculate the proportion.

In [27]:
df_categories_merged = merge_df_with_total(df_categories, df_total_categories)
df_categories_merged
Out[27]:
term total_results total_articles proportion
0 Article 9110253 158686219 0.057410
1 Family Notices 1294362 1866929 0.693311
2 Advertising 1155058 41616783 0.027755
3 Detailed Lists, Results, Guides 443143 25840513 0.017149
4 Literature 7570 30677 0.246765
5 Obituaries 5637 5933 0.950110
6 Humour 5379 20446 0.263083
7 News 4378 5463 0.801391
8 Law, Courts, And Crime 4273 5345 0.799439
9 Sport And Games 3446 6860 0.502332
10 Letters 1794 7876 0.227781
11 Arts And Culture 1284 1691 0.759314
12 Puzzles 1228 20045 0.061262
13 Editorial 670 5713 0.117276
14 Classified Advertisements And Notices 663 706 0.939093
15 Weather 544 2729 0.199340
16 Shipping Notices 520 611 0.851064
17 Official Appointments And Notices 460 477 0.964361
18 Reviews 354 484 0.731405
19 Commerce And Business 300 352 0.852273
20 Display Advertisement 219 228 0.960526

A lot of the categories have been added recently and don't contain a lot of articles. Some of these have a very high proportion of articles with corrections – 'Obituaries' for example. This suggests users are systematically categorising and correcting certain types of article.

Let's focus on the main categories by filtering out those with less than 30,000 articles.

In [28]:
df_categories_filtered = df_categories_merged.loc[df_categories_merged['total_articles'] > 30000]
df_categories_filtered
Out[28]:
term total_results total_articles proportion
0 Article 9110253 158686219 0.057410
1 Family Notices 1294362 1866929 0.693311
2 Advertising 1155058 41616783 0.027755
3 Detailed Lists, Results, Guides 443143 25840513 0.017149
4 Literature 7570 30677 0.246765

And now we can visualise the results.

In [62]:
cat_chart1 = alt.Chart(df_categories_filtered).mark_bar().encode(
    x=alt.X('term:N', title='Category'),
    y=alt.Y('total_results:Q', title='Articles with corrections')
)

cat_chart2 = alt.Chart(df_categories_filtered).mark_bar().encode(
    x=alt.X('term:N', title='Category'),
    y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),
    color=alt.value('orange')
)

cat_chart1 | cat_chart2
Out[62]:

As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names.

Number of corrections by newspaper

How do rates of correction vary across newspapers? We can use the title facet to find out.

In [30]:
params['q'] = 'has:corrections'
params['facet'] = 'title'
In [31]:
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_newspapers = pd.DataFrame(facets)
In [32]:
df_newspapers.head()
Out[32]:
term total_results
0 35 775663
1 13 743286
2 11 324802
3 16 324039
4 30 296491

Once again we'll calculate the proportion of articles corrected for each newspaper by getting the total number of articles for each newspaper on Trove.

In [33]:
params['q'] = ' '
In [34]:
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_newspapers_total = pd.DataFrame(facets)
In [35]:
df_newspapers_merged = merge_df_with_total(df_newspapers, df_newspapers_total, how='right')
In [36]:
df_newspapers_merged.sort_values(by='proportion', ascending=False, inplace=True)
df_newspapers_merged.rename(columns={'term': 'id'}, inplace=True)
In [37]:
df_newspapers_merged.head()
Out[37]:
id total_results total_articles proportion
1012 1028 286 286 1.0
919 1142 437 437 1.0
1410 1042 21 21 1.0
1413 154 21 21 1.0
1354 1047 38 38 1.0

The title facet only gives us the id number for each newspaper, not its title. Let's get all the titles and then merge them with the facet data.

In [38]:
# Get all the newspaper titles
title_params = {
    'key': api_key,
    'encoding': 'json',
}

title_data = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params=params).json()
In [39]:
titles = []
for newspaper in title_data['response']['records']['newspaper']:
    titles.append({'title': newspaper['title'], 'id': int(newspaper['id'])})
df_titles = pd.DataFrame(titles)
In [40]:
df_titles.head()
Out[40]:
title id
0 Canberra Community News (ACT : 1925 - 1927) 166
1 Canberra Illustrated: A Quarterly Magazine (AC... 165
2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69
3 Good Neighbour (ACT : 1950 - 1969) 871
4 Student Notes/Canberra University College Stud... 665
In [41]:
df_titles.shape
Out[41]:
(1567, 2)

One problem with this list is that it also includes the titles of the Government Gazettes (this seems to be a bug in the API). Let's get the gazette titles and then subtract them from the complete list.

In [42]:
# Get gazette titles
gazette_data = s.get('https://api.trove.nla.gov.au/v2/gazette/titles', params=params).json()
gazettes = []
for gaz in gazette_data['response']['records']['newspaper']:
    gazettes.append({'title': gaz['title'], 'id': int(gaz['id'])})
df_gazettes = pd.DataFrame(gazettes)
In [43]:
df_gazettes.shape
Out[43]:
(37, 2)

Subtract the gazettes from the list of titles.

In [44]:
df_titles_not_gazettes = df_titles[~df_titles['id'].isin(df_gazettes['id'])]

Now we can merge the newspaper titles with the facet data using the id to link the two datasets.

In [45]:
df_newspapers_with_titles = pd.merge(df_titles_not_gazettes, df_newspapers_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)
In [46]:
# Convert the totals back to integers
df_newspapers_with_titles[['total_results', 'total_articles']] = df_newspapers_with_titles[['total_results', 'total_articles']].astype(int)

Now we can display the newspapers with the highest rates of correction. Remember, that a proportion of 1.00 means that every available article has at least one correction.

In [47]:
df_newspapers_with_titles[:25]
Out[47]:
title id total_results total_articles proportion
17 The Australian Abo Call (National : 1938) 51 78 78 1.000000
434 The Temora Telegraph and Mining Advocate (NSW ... 729 3 3 1.000000
390 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 286 1.000000
1378 Swan River Guardian (WA : 1836 - 1838) 1142 437 437 1.000000
501 Moonta Herald and Northern Territory Gazette (... 118 56 56 1.000000
920 Elsternwick Leader and East Brighton, ... (Vic... 201 17 17 1.000000
258 The Branxton Advocate: Greta and Rothbury Reco... 686 53 53 1.000000
810 The Derwent Star and Van Diemen's Land Intelli... 1046 12 12 1.000000
846 The Van Diemen's Land Gazette and General Adve... 1047 38 38 1.000000
698 Suedaustralische Zeitung (Adelaide, SA : 1850 ... 314 47 47 1.000000
862 Alexandra and Yea Standard, Thornton, Gobur an... 154 21 21 1.000000
438 The True Sun and New South Wales Independent P... 1038 20 20 1.000000
196 Society (Sydney, NSW : 1887) 1042 21 21 1.000000
176 Party (Sydney, NSW : 1942) 1000 6 6 1.000000
816 The Hobart Town Gazette and Southern Reporter ... 4 1919 1923 0.997920
2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69 541 545 0.992661
1163 The Melbourne Advertiser (Vic. : 1838) 935 120 121 0.991736
775 Hobart Town Gazette and Van Diemen's Land Adve... 5 1534 1556 0.985861
684 South Australian Gazette and Colonial Register... 40 1048 1065 0.984038
132 Intelligence (Bowral, NSW : 1884) 624 117 119 0.983193
1527 York Advocate (WA : 1915) 1131 236 241 0.979253
135 Justice (Narrabri, NSW : 1891) 885 44 45 0.977778
359 The Newcastle Argus and District Advertiser (N... 513 29 30 0.966667
10 Berita Repoeblik (Djakarta, Indonesia : 1945 -... 1283 471 498 0.945783
788 Tasmanian and Port Dalrymple Advertiser (Launc... 273 181 193 0.937824

At the other end, we can see the newspapers with the smallest rates of correction. Note that some newspapers have no corrections at all.

In [48]:
df_newspapers_with_titles.sort_values(by='proportion')[:25]
Out[48]:
title id total_results total_articles proportion
1453 The North Coolgardie Herald (Menzies, WA : 190... 1650 0 3199 0.000000
628 Deutsche Zeitung für Sud-Australien = German ... 1577 0 14 0.000000
1489 The Temperance Advocate (Perth, WA : 1882) 1561 0 155 0.000000
1301 Greenough Sun (WA : 1947 - 1954) 1628 0 13648 0.000000
1063 Progress (North Fitzroy, Vic. : 1889 - 1890) 1574 0 254 0.000000
1474 The Southern Cross (Perth, WA : 1893) 1660 0 59 0.000000
1394 The Boyup Brook Bulletin (WA : 1930 - 1950) 1607 0 14420 0.000000
1458 The Pemberton Post (WA : 1937 - 1950) 1657 0 10717 0.000000
1084 Seamen's Strike Bulletin (Melbourne, Vic. : 1919) 1043 0 14 0.000000
1477 The Southern Cross News (WA : 1935 - 1957) 1661 0 16822 0.000000
1426 The Harvey-Waroona Mail (Collie, WA : 1931 - 1... 1629 1 26129 0.000038
1452 The Norseman-Esperance News (WA : 1936 - 1954) 1647 5 40614 0.000123
1318 Kojonup Courier (WA : 1951 - 1958) 1631 2 11945 0.000167
1419 The Geraldton Express and Murchison and Yalgo ... 1622 10 55278 0.000181
1498 The Weekly Herald (Fremantle, WA : 1922 - 1926) 1556 4 14931 0.000268
1295 Goomalling Gazette (WA : 1946 - 1954) 1627 3 9164 0.000327
1514 Weekly Judge (Perth, WA : 1919 - 1931) 1557 19 38456 0.000494
1455 The Northern Grazier and Miner (Leonora, WA : ... 1544 41 79991 0.000513
1505 The Wiluna Miner (WA : 1931 - 1947) 1559 37 59866 0.000618
1346 Murchison Times and Cue-Big Bell-Reedy Advocat... 1543 10 15707 0.000637
1442 The Miners' Daily News (Menzies, WA : 1896 - 1... 1540 11 15877 0.000693
1485 The Swan and Canning Times and Hills Gazette (... 1551 7 9391 0.000745
1401 The Corrigin Sun (WA : 1929 - 1930) 1614 1 1270 0.000787
1487 The Swan Leader (Victoria Park, WA : 1931 - 1937) 1552 19 22753 0.000835
1320 Kondinin-Kulin Kourier and Karlgarin Advocate ... 1534 28 33117 0.000845

We'll save the full list of newspapers as a CSV file.

In [63]:
df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()
df_newspapers_with_titles_csv.rename({'total_results': 'articles_with_corrections'}, axis=1, inplace=True)
df_newspapers_with_titles_csv['percentage_with_corrections'] = df_newspapers_with_titles_csv['proportion'] * 100
df_newspapers_with_titles_csv.sort_values(by=['percentage_with_corrections'], inplace=True)
df_newspapers_with_titles_csv[['id', 'title', 'articles_with_corrections', 'total_articles', 'percentage_with_corrections']].to_csv('titles_corrected.csv', index=False)
df_newspapers_with_titles_csv['title_url'] = df_newspapers_with_titles_csv['id'].apply(lambda x: f'http://nla.gov.au/nla.news-title{x}')
df_newspapers_with_titles_csv.to_csv('titles_corrected.csv', index=False)
In [64]:
display(FileLink('titles_corrected.csv'))

Neediest newspapers

Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.

To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!

In [51]:
# Search for 'tbe' to get an indication of errors by newspaper
params['q'] = 'text:"tbe"~0'
params['facet'] = 'title'
In [52]:
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_errors = pd.DataFrame(facets)

Merge the error data with the total articles per newspaper to calculate the proportion.

In [53]:
df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how='right')
df_errors_merged.sort_values(by='proportion', ascending=False, inplace=True)
df_errors_merged.rename(columns={'term': 'id'}, inplace=True)
In [54]:
df_errors_merged.head()
Out[54]:
id total_results total_articles proportion
560 1316 2010 2954 0.680433
351 758 5255 8078 0.650532
251 927 9465 17227 0.549428
301 382 6971 12744 0.547003
322 262 6293 11527 0.545936

Add the title names.

In [55]:
df_errors_with_titles = pd.merge(df_titles_not_gazettes, df_errors_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)

So this is a list of the newspapers with the highest rate of OCR error (by our rather dodgy measure).

In [56]:
df_errors_with_titles[:25]
Out[56]:
title id total_results total_articles proportion
453 The Weekly Advance (Granville, NSW : 1892 - 1893) 1316 2010 2954 0.680433
918 Dunolly and Betbetshire Express and County of ... 758 5255 8078 0.650532
960 Hamilton Spectator and Grange District Adverti... 927 9465 17227 0.549428
482 Wagga Wagga Express and Murrumbidgee District ... 382 6971 12744 0.547003
579 The North Australian, Ipswich and General Adve... 262 6293 11527 0.545936
578 The North Australian (Brisbane, Qld. : 1863 - ... 264 2887 5314 0.543282
796 Telegraph (Hobart Town, Tas. : 1867) 1250 74 140 0.528571
315 The Hay Standard and Advertiser for Balranald,... 725 21711 42068 0.516093
786 Morning Star and Commercial Advertiser (Hobart... 1242 875 1703 0.513799
190 Robertson Advocate (NSW : 1894 - 1923) 530 37065 72376 0.512117
216 Temora Herald and Mining Journal (NSW : 1882 -... 728 640 1253 0.510774
790 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... 865 4865 9559 0.508944
210 Sydney Mail (NSW : 1860 - 1871) 697 24634 48535 0.507551
1055 Port Phillip Gazette and Settler's Journal (Vi... 1138 6127 12127 0.505236
153 Molong Argus (NSW : 1896 - 1921) 424 52146 104984 0.496704
1054 Port Phillip Gazette (Vic. : 1851) 1139 241 491 0.490835
283 The Cumberland Free Press (Parramatta, NSW : 1... 724 6441 13247 0.486223
528 Logan Witness (Beenleigh, Qld. : 1878 - 1893) 850 6972 14654 0.475775
849 Trumpeter General (Hobart, Tas. : 1833 - 1834) 869 704 1482 0.475034
609 Adelaide Chronicle and South Australian Litera... 986 909 1937 0.469282
571 The Darling Downs Gazette and General Advertis... 257 29873 65268 0.457697
364 The News, Shoalhaven and Southern Coast Distri... 1588 2481 5495 0.451501
807 The Cornwall Chronicle (Launceston, Tas. : 183... 170 72856 163791 0.444811
899 Chronicle, South Yarra Gazette, Toorak Times a... 847 1642 3720 0.441398
824 The Mount Lyell Standard and Strahan Gazette (... 1251 36511 83363 0.437976

And those with the lowest rate of errors. Note the number of non-English newspapers in this list – of course our measure of accuracy fails completely in newspapers that don't use the word 'the'!

In [57]:
df_errors_with_titles[-25:]
Out[57]:
title id total_results total_articles proportion
438 The True Sun and New South Wales Independent P... 1038 0 20 0.0
1279 Eco Italiano (Perth, WA : 1958 - 1959) 1387 0 1579 0.0
1278 Echo : Polski Tygodnik Niezalezny (Perth, WA :... 1384 0 2601 0.0
1245 Bullfinch Miner and Yilgarn Advocate (WA : 1910) 1460 0 27 0.0
501 Moonta Herald and Northern Territory Gazette (... 118 0 56 0.0
1274 Der Australische Spiegel = The Australian Mirr... 1385 0 1455 0.0
663 Port Augusta and Stirling Illustrated News (SA... 1478 0 125 0.0
63 Clarence and Richmond Examiner (Grafton, NSW :... 104 0 111 0.0
1492 The Voice of Freedom = Elefthera Phoni (Perth,... 1381 0 511 0.0
132 Intelligence (Bowral, NSW : 1884) 624 0 119 0.0
1268 Dampier Despatch (Broome, WA : 1904 - 1905) 1407 0 871 0.0
846 The Van Diemen's Land Gazette and General Adve... 1047 0 38 0.0
1163 The Melbourne Advertiser (Vic. : 1838) 935 0 121 0.0
810 The Derwent Star and Van Diemen's Land Intelli... 1046 0 12 0.0
1 Canberra Illustrated: A Quarterly Magazine (AC... 165 0 57 0.0
43 Blayney West Macquarie (NSW : 1949) 802 0 110 0.0
739 The Yellow Flag and Torrens Island Terror (Ade... 1506 0 44 0.0
1474 The Southern Cross (Perth, WA : 1893) 1660 0 59 0.0
728 The Progressive Times (Largs North, SA : 1949 ... 1307 0 1446 0.0
723 The Port Adelaide Post Shipping Gazette, Farme... 719 0 18 0.0
1189 The Sun News Pictorial (Melbourne, Vic. : 1956... 1191 0 14 0.0
704 The Citizen (Port Adelaide, SA : 1938-1940) 1305 0 1284 0.0
698 Suedaustralische Zeitung (Adelaide, SA : 1850 ... 314 0 47 0.0
62 Citizen Soldier (Sydney, NSW : 1942) 996 0 60 0.0
1207 Vigilante (Melbourne, Vic. : 1918) 799 0 302 0.0

Now let's merge the error data with the correction data.

In [58]:
corrections_errors_merged_df = pd.merge(df_newspapers_with_titles, df_errors_with_titles, how='left', on='id')
In [59]:
corrections_errors_merged_df.head()
Out[59]:
title_x id total_results_x total_articles_x proportion_x title_y total_results_y total_articles_y proportion_y
0 The Australian Abo Call (National : 1938) 51 78 78 1.0 The Australian Abo Call (National : 1938) 0 78 0.000000
1 The Temora Telegraph and Mining Advocate (NSW ... 729 3 3 1.0 The Temora Telegraph and Mining Advocate (NSW ... 0 3 0.000000
2 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 286 1.0 The Satirist and Sporting Chronicle (Sydney, N... 0 286 0.000000
3 Swan River Guardian (WA : 1836 - 1838) 1142 437 437 1.0 Swan River Guardian (WA : 1836 - 1838) 32 437 0.073227
4 Moonta Herald and Northern Territory Gazette (... 118 56 56 1.0 Moonta Herald and Northern Territory Gazette (... 0 56 0.000000
In [60]:
corrections_errors_merged_df['proportion_uncorrected'] = corrections_errors_merged_df['proportion_x'].apply(lambda x: 1 - x)
corrections_errors_merged_df.rename(columns={'title_x': 'title', 'proportion_x': 'proportion_corrected', 'proportion_y': 'proportion_with_errors'}, inplace=True)
corrections_errors_merged_df.sort_values(by=['proportion_with_errors', 'proportion_uncorrected'], ascending=False, inplace=True)

So, for what it's worth, here's a list of the neediest newspapers – those with high error rates and low correction rates! As I've said, this is a pretty dodgy method, but interesting nonetheless.

In [61]:
corrections_errors_merged_df[['title', 'proportion_with_errors', 'proportion_uncorrected']][:25]
Out[61]:
title proportion_with_errors proportion_uncorrected
1305 The Weekly Advance (Granville, NSW : 1892 - 1893) 0.680433 0.984766
571 Dunolly and Betbetshire Express and County of ... 0.650532 0.935380
392 Hamilton Spectator and Grange District Adverti... 0.549428 0.902246
431 Wagga Wagga Express and Murrumbidgee District ... 0.547003 0.912272
174 The North Australian, Ipswich and General Adve... 0.545936 0.780602
247 The North Australian (Brisbane, Qld. : 1863 - ... 0.543282 0.844750
403 Telegraph (Hobart Town, Tas. : 1867) 0.528571 0.907143
973 The Hay Standard and Advertiser for Balranald,... 0.516093 0.965033
159 Morning Star and Commercial Advertiser (Hobart... 0.513799 0.757487
808 Robertson Advocate (NSW : 1894 - 1923) 0.512117 0.955469
560 Temora Herald and Mining Journal (NSW : 1882 -... 0.510774 0.933759
459 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... 0.508944 0.916518
337 Sydney Mail (NSW : 1860 - 1871) 0.507551 0.887937
224 Port Phillip Gazette and Settler's Journal (Vi... 0.505236 0.834337
612 Molong Argus (NSW : 1896 - 1921) 0.496704 0.939648
245 Port Phillip Gazette (Vic. : 1851) 0.490835 0.843177
423 The Cumberland Free Press (Parramatta, NSW : 1... 0.486223 0.910772
300 Logan Witness (Beenleigh, Qld. : 1878 - 1893) 0.475775 0.871434
126 Trumpeter General (Hobart, Tas. : 1833 - 1834) 0.475034 0.706478
123 Adelaide Chronicle and South Australian Litera... 0.469282 0.699019
269 The Darling Downs Gazette and General Advertis... 0.457697 0.858384
1473 The News, Shoalhaven and Southern Coast Distri... 0.451501 0.996542
212 The Cornwall Chronicle (Launceston, Tas. : 183... 0.444811 0.829063
979 Chronicle, South Yarra Gazette, Toorak Times a... 0.441398 0.965323
1363 The Mount Lyell Standard and Strahan Gazette (... 0.437976 0.988496

Created by Tim Sherratt for the GLAM Workbench.