Corrections of OCRd text in Trove's newspapers¶

The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.

There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include has:corrections in your query to limit the results to articles that have at least one OCR correction.

To get information about the number of corrections made to the articles in your results, you can add the reclevel=full parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the correctionCount and lastCorrection values in the record below:

{
    "article": {
        "id": "41697877",
        "url": "/newspaper/41697877",
        "heading": "WRAGGE AND WEATHER CYCLES.",
        "category": "Article",
        "title": {
            "id": "101",
            "value": "Western Mail (Perth, WA : 1885 - 1954)"
        },
        "date": "1922-11-23",
        "page": 4,
        "pageSequence": 4,
        "troveUrl": "https://trove.nla.gov.au/ndp/del/article/41697877",
        "illustrated": "N",
        "wordCount": 1054,
        "correctionCount": 1,
        "listCount": 0,
        "tagCount": 0,
        "commentCount": 0,
        "lastCorrection": {
            "by": "*anon*",
            "lastupdated": "2016-09-12T07:08:57Z"
        },
        "identifier": "https://nla.gov.au/nla.news-article41697877",
        "trovePageUrl": "https://trove.nla.gov.au/ndp/del/page/3522839",
        "pdf": "https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print"
    }
}

Setting things up¶

In [1]:

import os
from operator import itemgetter  # used for sorting

import altair as alt
import pandas as pd  # makes manipulating the data easier
import requests
from IPython.display import FileLink, clear_output, display
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

# Make sure data directory exists
os.makedirs("data", exist_ok=True)

# Create a session that will automatically retry on server errors
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

In [2]:

%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [3]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

In [4]:

# Basic parameters for Trove API
params = {
    "facet": "year",  # Get the data aggregated by year.
    "zone": "newspaper",
    "key": API_KEY,
    "encoding": "json",
    "n": 0,  # We don't need any records, just the facets!
}

In [5]:

def get_results(params):
    """
    Get JSON response data from the Trove API.
    Parameters:
        params
    Returns:
        JSON formatted response data from Trove API
    """
    response = s.get(
        "https://api.trove.nla.gov.au/v2/result", params=params, timeout=30
    )
    response.raise_for_status()
    # print(response.url) # This shows us the url that's sent to the API
    data = response.json()
    return data

How many newspaper articles have corrections?¶

Let's find out what proportion of newspaper articles have at least one OCR correction.

First we'll get to the total number of newspaper articles in Trove.

In [6]:

# Set the q parameter to a single space to get everything
params["q"] = " "

# Get the data from the API
data = get_results(params)

# Extract the total number of results
total = int(data["response"]["zone"][0]["records"]["total"])
print("{:,}".format(total))

233,635,348

Now we'll set the q parameter to has:corrections to limit the results to newspaper articles that have at least one correction.

In [7]:

# Set the q parameter to 'has:corrections' to limit results to articles with corrections
params["q"] = "has:corrections"

# Get the data from the API
data = get_results(params)

# Extract the total number of results
corrected = int(data["response"]["zone"][0]["records"]["total"])
print("{:,}".format(corrected))

13,585,979

Calculate the proportion of articles with corrections.

In [8]:

print("{:.2%} of articles have at least one correction".format(corrected / total))

5.82% of articles have at least one correction

You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the number of articles that include corrections, while the individual scores show the number of lines corrected by each volunteer.

Number of corrections by year¶

In [9]:

def get_facets(data):
    """
    Loop through facets in Trove API response, saving terms and counts.
    Parameters:
        data  - JSON formatted response data from Trove API
    Returns:
        A list of dictionaries containing: 'term', 'total_results'
    """
    facets = []
    try:
        # The facets are buried a fair way down in the results
        # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want
        # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)
        for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:

            # Get the year and the number of results, and convert them to integers, before adding to our results
            facets.append({"term": term["search"], "total_results": int(term["count"])})

        # Sort facets by year
        facets.sort(key=itemgetter("term"))
    except TypeError:
        pass
    return facets


def get_facet_data(params, start_decade=180, end_decade=201):
    """
    Loop throught the decades from 'start_decade' to 'end_decade',
    getting the number of search results for each year from the year facet.
    Combine all the results into a single list.
    Parameters:
        params - parameters to send to the API
        start_decade
        end_decade
    Returns:
        A list of dictionaries containing 'year', 'total_results' for the complete
        period between the start and end decades.
    """
    # Create a list to hold the facets data
    facet_data = []

    # Loop through the decades
    for decade in tqdm(range(start_decade, end_decade + 1)):

        # print(params)
        # Avoid confusion by copying the params before we change anything.
        search_params = params.copy()

        # Add decade value to params
        search_params["l-decade"] = decade

        # Get the data from the API
        data = get_results(search_params)

        # Get the facets from the data and add to facets_data
        facet_data += get_facets(data)

    # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)
    clear_output()
    return facet_data

In [10]:

facet_data = get_facet_data(params)

In [11]:

# Convert our data to a dataframe called df
df = pd.DataFrame(facet_data)

In [12]:

df.head()

Out[12]:

	term	total_results
0	1803	526
1	1804	619
2	1805	430
3	1806	367
4	1807	134

So which year has the most corrections?

In [13]:

df.loc[df["total_results"].idxmax()]

Out[13]:

term               1915
total_results    270277
Name: 112, dtype: int64

The fact that there's more corrections in newspaper articles from 1915, might make you think that people have been more motivated to correct articles relating to WWI. But if you look at the total number of articles per year, you'll see that there's been more articles digitised from 1915! The raw number of corrections is probably not very useful, so let's look instead at the proportion of articles each year that have at least one correction.

To do that we'll re-harvest the facet data, but this time with a blank, or empty search, to get the total number of articles available from each year.

In [14]:

# Reset the 'q' parameter
# Use a an empty search (a single space) to get ALL THE ARTICLES
params["q"] = " "

# Get facet data for all articles
all_facet_data = get_facet_data(params)

In [15]:

# Convert the results to a dataframe
df_total = pd.DataFrame(all_facet_data)

No we'll merge the number of articles by year with corrections with the total number of articles. Then we'll calculate the proportion with corrections.

In [16]:

def merge_df_with_total(df, df_total, how="left"):
    """
    Merge dataframes containing search results with the total number of articles by year.
    This is a left join on the year column. The total number of articles will be added as a column to
    the existing results.
    Once merged, do some reorganisation and calculate the proportion of search results.
    Parameters:
        df - the search results in a dataframe
        df_total - total number of articles per year in a dataframe
    Returns:
        A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion'
        (plus any other columns that are in the search results dataframe).
    """
    # Merge the two dataframes on year
    # Note that we're joining the two dataframes on the year column
    df_merged = pd.merge(df, df_total, how=how, on="term")

    # Rename the columns for convenience
    df_merged.rename(
        {"total_results_y": "total_articles"}, inplace=True, axis="columns"
    )
    df_merged.rename({"total_results_x": "total_results"}, inplace=True, axis="columns")

    # Set blank values to zero to avoid problems
    df_merged["total_results"] = df_merged["total_results"].fillna(0).astype(int)

    # Calculate proportion by dividing the search results by the total articles
    df_merged["proportion"] = df_merged["total_results"] / df_merged["total_articles"]
    return df_merged

In [17]:

# Merge the search results with the total articles
df_merged = merge_df_with_total(df, df_total)
df_merged.head()

Out[17]:

	term	total_results	total_articles	proportion
0	1803	526	526	1.0
1	1804	619	619	1.0
2	1805	430	430	1.0
3	1806	367	367	1.0
4	1807	134	134	1.0

Let's visualise the results, showing both the number of articles with corrections each year, and the proportion of articles each year with corrections.

In [18]:

# Number of articles with corrections
chart1 = (
    alt.Chart(df_merged)
    .mark_line(point=True)
    .encode(
        x=alt.X("term:Q", axis=alt.Axis(format="c", title="Year")),
        y=alt.Y(
            "total_results:Q",
            axis=alt.Axis(format=",d", title="Number of articles with corrections"),
        ),
        tooltip=[
            alt.Tooltip("term:Q", title="Year"),
            alt.Tooltip("total_results:Q", title="Articles", format=","),
        ],
    )
    .properties(width=700, height=250)
)

# Proportion of articles with corrections
chart2 = (
    alt.Chart(df_merged)
    .mark_line(point=True, color="red")
    .encode(
        x=alt.X("term:Q", axis=alt.Axis(format="c", title="Year")),
        # This time we're showing the proportion (formatted as a percentage) on the Y axis
        y=alt.Y(
            "proportion:Q",
            axis=alt.Axis(format="%", title="Proportion of articles with corrections"),
        ),
        tooltip=[
            alt.Tooltip("term:Q", title="Year"),
            alt.Tooltip("proportion:Q", title="Proportion", format="%"),
        ],
        # Make the charts different colors
        color=alt.value("orange"),
    )
    .properties(width=700, height=250)
)

# This is a shorthand way of stacking the charts on top of each other
chart1 & chart2

Out[18]:

This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected.

Number of corrections by category¶

Let's see how the number of corrections varies across categories. This time we'll use the category facet instead of year.

In [19]:

params["q"] = "has:corrections"
params["facet"] = "category"

In [20]:

data = get_results(params)
facets = []
for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({"term": term["search"], "total_results": int(term["count"])})
df_categories = pd.DataFrame(facets)

In [21]:

df_categories.head()

Out[21]:

	term	total_results
0	Article	10385409
1	Family Notices	1362744
2	Advertising	1319419
3	Detailed Lists, Results, Guides	526437
4	Literature	11335

Once again, the raw numbers are probably not all that useful, so let's get the total number of articles in each category and calculate the proportion that have at least one correction.

In [22]:

# Blank query
params["q"] = " "
data = get_results(params)
facets = []
for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({"term": term["search"], "total_results": int(term["count"])})
df_total_categories = pd.DataFrame(facets)

We'll merge the two corrections by category data with the total articles per category and calculate the proportion.

In [23]:

df_categories_merged = merge_df_with_total(df_categories, df_total_categories)
df_categories_merged

Out[23]:

	term	total_results	total_articles	proportion
0	Article	10385409	162296409	0.063990
1	Family Notices	1362744	1926852	0.707239
2	Advertising	1319419	43238969	0.030515
3	Detailed Lists, Results, Guides	526437	26177949	0.020110
4	Literature	11335	34669	0.326949
5	Obituaries	9493	10459	0.907639
6	Humour	8698	26571	0.327349
7	News	7983	9928	0.804089
8	Law, Courts, And Crime	6615	9047	0.731182
9	Sport And Games	6029	13682	0.440652
10	Letters	3372	11674	0.288847
11	Editorial	2228	12908	0.172606
12	Arts And Culture	2212	3054	0.724296
13	Puzzles	1616	38716	0.041740
14	Reviews	1366	1877	0.727757
15	Shipping Notices	1332	1796	0.741648
16	Classified Advertisements And Notices	1222	1410	0.866667
17	Official Appointments And Notices	1019	1046	0.974187
18	Weather	1011	8625	0.117217
19	Commerce And Business	1005	2566	0.391660
20	Display Advertisement	415	456	0.910088

A lot of the categories have been added recently and don't contain a lot of articles. Some of these have a very high proportion of articles with corrections – 'Obituaries' for example. This suggests users are systematically categorising and correcting certain types of article.

Let's focus on the main categories by filtering out those with less than 30,000 articles.

In [24]:

df_categories_filtered = df_categories_merged.loc[
    df_categories_merged["total_articles"] > 30000
]
df_categories_filtered

Out[24]:

	term	total_results	total_articles	proportion
0	Article	10385409	162296409	0.063990
1	Family Notices	1362744	1926852	0.707239
2	Advertising	1319419	43238969	0.030515
3	Detailed Lists, Results, Guides	526437	26177949	0.020110
4	Literature	11335	34669	0.326949
13	Puzzles	1616	38716	0.041740

And now we can visualise the results.

In [25]:

cat_chart1 = (
    alt.Chart(df_categories_filtered)
    .mark_bar()
    .encode(
        x=alt.X("term:N", title="Category"),
        y=alt.Y("total_results:Q", title="Articles with corrections"),
    )
)

cat_chart2 = (
    alt.Chart(df_categories_filtered)
    .mark_bar()
    .encode(
        x=alt.X("term:N", title="Category"),
        y=alt.Y(
            "proportion:Q",
            axis=alt.Axis(format="%", title="Proportion of articles with corrections"),
        ),
        color=alt.value("orange"),
    )
)

cat_chart1 | cat_chart2

Out[25]:

As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names.

Number of corrections by newspaper¶

How do rates of correction vary across newspapers? We can use the title facet to find out.

In [26]:

params["q"] = "has:corrections"
params["facet"] = "title"

In [27]:

data = get_results(params)
facets = []
for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({"term": term["search"], "total_results": int(term["count"])})
df_newspapers = pd.DataFrame(facets)

In [28]:

df_newspapers.head()

Out[28]:

	term	total_results
0	35	827605
1	13	780708
2	11	373062
3	16	347336
4	30	317381

Once again we'll calculate the proportion of articles corrected for each newspaper by getting the total number of articles for each newspaper on Trove.

In [29]:

params["q"] = " "

In [30]:

data = get_results(params)
facets = []
for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({"term": term["search"], "total_results": int(term["count"])})
df_newspapers_total = pd.DataFrame(facets)

In [31]:

df_newspapers_merged = merge_df_with_total(
    df_newspapers, df_newspapers_total, how="right"
)

In [32]:

df_newspapers_merged.sort_values(by="proportion", ascending=False, inplace=True)
df_newspapers_merged.rename(columns={"term": "id"}, inplace=True)

In [33]:

df_newspapers_merged.head()

Out[33]:

	id	total_results	total_articles	proportion
1662	729	3	3	1.0
1548	1028	286	286	1.0
1629	1047	38	38	1.0
1617	686	53	53	1.0
1614	118	56	56	1.0

The title facet only gives us the id number for each newspaper, not its title. Let's get all the titles and then merge them with the facet data.

In [34]:

# Get all the newspaper titles
title_params = {
    "key": API_KEY,
    "encoding": "json",
}

title_data = s.get(
    "https://api.trove.nla.gov.au/v2/newspaper/titles", params=params
).json()

In [35]:

titles = []
for newspaper in title_data["response"]["records"]["newspaper"]:
    titles.append({"title": newspaper["title"], "id": int(newspaper["id"])})
df_titles = pd.DataFrame(titles)

In [36]:

df_titles.head()

Out[36]:

	title	id
0	Canberra Community News (ACT : 1925 - 1927)	166
1	Canberra Illustrated: A Quarterly Magazine (AC...	165
2	Federal Capital Pioneer (Canberra, ACT : 1924 ...	69
3	Good Neighbour (ACT : 1950 - 1969)	871
4	Student Notes/Canberra University College Stud...	665

In [37]:

df_titles.shape

Out[37]:

(1698, 2)

One problem with this list is that it also includes the titles of the Government Gazettes (this seems to be a bug in the API). Let's get the gazette titles and then subtract them from the complete list.

In [38]:

# Get gazette titles
gazette_data = s.get(
    "https://api.trove.nla.gov.au/v2/gazette/titles", params=params
).json()
gazettes = []
for gaz in gazette_data["response"]["records"]["newspaper"]:
    gazettes.append({"title": gaz["title"], "id": int(gaz["id"])})
df_gazettes = pd.DataFrame(gazettes)

In [39]:

df_gazettes.shape

Out[39]:

(38, 2)

Subtract the gazettes from the list of titles.

In [40]:

df_titles_not_gazettes = df_titles[~df_titles["id"].isin(df_gazettes["id"])]

Now we can merge the newspaper titles with the facet data using the id to link the two datasets.

In [41]:

df_newspapers_with_titles = (
    pd.merge(df_titles_not_gazettes, df_newspapers_merged, how="left", on="id")
    .fillna(0)
    .sort_values(by="proportion", ascending=False)
)

In [42]:

# Convert the totals back to integers
df_newspapers_with_titles[
    ["total_results", "total_articles"]
] = df_newspapers_with_titles[["total_results", "total_articles"]].astype(int)

Now we can display the newspapers with the highest rates of correction. Remember, that a proportion of 1.00 means that every available article has at least one correction.

In [43]:

df_newspapers_with_titles[:25]

Out[43]:

	title	id	total_results	total_articles	proportion
468	The Temora Telegraph and Mining Advocate (NSW ...	729	3	3	1.000000
824	Hobart Town Gazette and Van Diemen's Land Adve...	5	1556	1556	1.000000
421	The Satirist and Sporting Chronicle (Sydney, N...	1028	286	286	1.000000
149	Justice (Narrabri, NSW : 1891)	885	45	45	1.000000
921	Alexandra and Yea Standard, Thornton, Gobur an...	154	21	21	1.000000
739	Suedaustralische Zeitung (Adelaide, SA : 1850 ...	314	47	47	1.000000
472	The True Sun and New South Wales Independent P...	1038	20	20	1.000000
284	The Branxton Advocate: Greta and Rothbury Reco...	686	53	53	1.000000
24	The Australian Abo Call (National : 1938)	51	78	78	1.000000
838	Tasmanian and Port Dalrymple Advertiser (Launc...	273	193	193	1.000000
538	Moonta Herald and Northern Territory Gazette (...	118	56	56	1.000000
215	Society (Sydney, NSW : 1887)	1042	21	21	1.000000
195	Party (Sydney, NSW : 1942)	1000	6	6	1.000000
979	Elsternwick Leader and East Brighton, ... (Vic...	201	17	17	1.000000
1472	Swan River Guardian (WA : 1836 - 1838)	1142	437	437	1.000000
862	The Derwent Star and Van Diemen's Land Intelli...	1046	12	12	1.000000
563	Logan and Albert Advocate (Qld. : 1893 - 1900)	842	84	84	1.000000
905	The Van Diemen's Land Gazette and General Adve...	1047	38	38	1.000000
871	The Hobart Town Gazette and Southern Reporter ...	4	1922	1923	0.999480
2	Federal Capital Pioneer (Canberra, ACT : 1924 ...	69	542	545	0.994495
1231	The Melbourne Advertiser (Vic. : 1838)	935	120	121	0.991736
728	South Australian Gazette and Colonial Register...	40	1051	1065	0.986854
143	Intelligence (Bowral, NSW : 1884)	624	117	119	0.983193
892	The People's Horn Boy (Hobart Town, Tas. : 1834)	1240	99	101	0.980198
1657	York Advocate (WA : 1915)	1131	236	241	0.979253

At the other end, we can see the newspapers with the smallest rates of correction. Note that some newspapers have no corrections at all.

In [44]:

df_newspapers_with_titles.sort_values(by="proportion")[:25]

Out[44]:

	title	id	total_results	total_articles	proportion
1247	The Morwell Advocate and Narracan, Boolara and...	1734	0	208	0.000000
1559	The Mount Margaret Mercury (WA : 1897)	1641	0	24	0.000000
1553	The Miner's Right (Perth, WA : 1894)	1729	0	426	0.000000
1508	The Derby News (WA : 1887)	1617	0	9	0.000000
864	The Herald of Tasmania (Hobart, Tas. : 1845)	1741	0	50	0.000000
1246	The Morwell Advocate and Boolara and Mirboo Ch...	1733	0	33	0.000000
874	The Hobart Town Herald and abstinence advocate...	1743	0	434	0.000000
1144	Seamen's Strike Bulletin (Melbourne, Vic. : 1919)	1043	0	14	0.000000
872	The Hobart Town Herald (Tas. : 1845)	1740	0	57	0.000000
875	The Hobart Town Herald, or, Southern reporter ...	1742	0	103	0.000000
837	Saturday Evening Express (Launceston, Tas. : 1...	1747	7	44312	0.000158
514	Vil'na Dumka = Free Thought (Sydney, NSW : 194...	1593	2	11607	0.000172
926	Australier Leben = Australian Life (Melbourne,...	1686	1	3816	0.000262
500	To Ethnico Vema = Greek National Tribune (Arnc...	1592	27	62861	0.000430
685	Kimba Dispatch (SA. : 1927 - 1941)	1731	17	35136	0.000484
746	The Challenger (Port Lincoln, SA. : 1932 - 1934)	1732	3	5599	0.000536
1273	The Western Port Times and Phillip Island and ...	1365	7	12684	0.000552
155	L'Italo-Australiano = The Italo-Australian (Sy...	1597	5	6106	0.000819
178	Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195...	1594	9	9060	0.000993
764	The Northern Districts Courier (North Adelaide...	1711	1	885	0.001130
1250	The Narracan Shire Advocate and Yallourn Brown...	1735	49	42777	0.001145
147	Italo-Australian (Sydney, NSW : 1927 - 1940)	1595	56	38986	0.001436
1535	The Inland Watch (Leonora, WA : 1937 - 1943)	1630	50	28725	0.001741
873	The Hobart Town Herald (Tas. : 1880)	1744	1	560	0.001786
192	Oceania (Sydney, NSW : 1913 - 1915)	1598	4	2167	0.001846

We'll save the full list of newspapers as a CSV file, but first we'll fix up the column headings and add urls for each title.

In [45]:

df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()
df_newspapers_with_titles_csv.rename(
    {"total_results": "articles_with_corrections"}, axis=1, inplace=True
)
df_newspapers_with_titles_csv["percentage_with_corrections"] = (
    df_newspapers_with_titles_csv["proportion"] * 100
)
df_newspapers_with_titles_csv.sort_values(
    by=["percentage_with_corrections"], inplace=True
)
df_newspapers_with_titles_csv[
    [
        "id",
        "title",
        "articles_with_corrections",
        "total_articles",
        "percentage_with_corrections",
    ]
].to_csv("titles_corrected.csv", index=False)
df_newspapers_with_titles_csv["title_url"] = df_newspapers_with_titles_csv["id"].apply(
    lambda x: f"http://nla.gov.au/nla.news-title{x}"
)

Now we'll save the data as a CSV file and display a link.

In [46]:

df_newspapers_with_titles_csv.to_csv("titles_corrected.csv", index=False)
display(FileLink("titles_corrected.csv"))

titles_corrected.csv

Neediest newspapers¶

Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.

To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!

In [47]:

# Search for 'tbe' to get an indication of errors by newspaper
params["q"] = 'text:"tbe"~0'
params["facet"] = "title"

In [48]:

data = get_results(params)
facets = []
for term in data["response"]["zone"][0]["facets"]["facet"]["term"]:
    # Get the state and the number of results, and convert it to integers, before adding to our results
    facets.append({"term": term["search"], "total_results": int(term["count"])})
df_errors = pd.DataFrame(facets)

Merge the error data with the total articles per newspaper to calculate the proportion.

In [49]:

df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how="right")
df_errors_merged.sort_values(by="proportion", ascending=False, inplace=True)
df_errors_merged.rename(columns={"term": "id"}, inplace=True)

In [50]:

df_errors_merged.head()

Out[50]:

	id	total_results	total_articles	proportion
1253	1316	2004	2954	0.678402
1046	758	5245	8078	0.649294
821	927	9438	17227	0.547861
912	382	6956	12744	0.545825
938	262	6256	11527	0.542726

Add the title names.

In [51]:

df_errors_with_titles = (
    pd.merge(df_titles_not_gazettes, df_errors_merged, how="left", on="id")
    .fillna(0)
    .sort_values(by="proportion", ascending=False)
)

So this is a list of the newspapers with the highest rate of OCR error (by our rather dodgy measure).

In [52]:

df_errors_with_titles[:25]

Out[52]:

	title	id	total_results	total_articles	proportion
487	The Weekly Advance (Granville, NSW : 1892 - 1893)	1316	2004	2954	0.678402
977	Dunolly and Betbetshire Express and County of ...	758	5245	8078	0.649294
1021	Hamilton Spectator and Grange District Adverti...	927	9438	17227	0.547861
519	Wagga Wagga Express and Murrumbidgee District ...	382	6956	12744	0.545825
619	The North Australian, Ipswich and General Adve...	262	6256	11527	0.542726
618	The North Australian (Brisbane, Qld. : 1863 - ...	264	2868	5314	0.539706
864	The Herald of Tasmania (Hobart, Tas. : 1845)	1741	26	50	0.520000
342	The Hay Standard and Advertiser for Balranald,...	725	21671	42068	0.515142
209	Robertson Advocate (NSW : 1894 - 1923)	530	36946	72383	0.510424
235	Temora Herald and Mining Journal (NSW : 1882 -...	728	639	1253	0.509976
229	Sydney Mail (NSW : 1860 - 1871)	697	24539	48535	0.505594
840	Tasmanian Morning Herald (Hobart, Tas. : 1865 ...	865	5178	10290	0.503207
835	Morning Star and Commercial Advertiser (Hobart...	1242	850	1703	0.499119
169	Molong Argus (NSW : 1896 - 1921)	424	52061	104984	0.495895
1114	Port Phillip Gazette (Vic. : 1851)	1139	241	491	0.490835
1115	Port Phillip Gazette and Settler's Journal (Vi...	1138	5947	12127	0.490393
846	Telegraph (Hobart Town, Tas. : 1867)	1250	68	140	0.485714
872	The Hobart Town Herald (Tas. : 1845)	1740	27	57	0.473684
310	The Cumberland Free Press (Parramatta, NSW : 1...	724	6231	13247	0.470371
908	Trumpeter General (Hobart, Tas. : 1833 - 1834)	869	693	1482	0.467611
649	Adelaide Chronicle and South Australian Litera...	986	892	1937	0.460506
565	Logan Witness (Beenleigh, Qld. : 1878 - 1893)	850	6681	14654	0.455916
392	The News, Shoalhaven and Southern Coast Distri...	1588	2466	5495	0.448772
611	The Darling Downs Gazette and General Advertis...	257	29213	65268	0.447585
859	The Cornwall Chronicle (Launceston, Tas. : 183...	170	72532	163791	0.442833

And those with the lowest rate of errors. Note the number of non-English newspapers in this list – of course our measure of accuracy fails completely in newspapers that don't use the word 'the'!

In [53]:

df_errors_with_titles[-25:]

Out[53]:

	title	id	total_articles
669	Deutsche Zeitung für Sud-Australien = German ...	1577	14
774	The Progressive Times (Largs North, SA : 1949 ...	1307	1446
1506	The Dawn Newsletter (Perth, WA : 1952 - 1954)	1773	40
1359	Eco Italiano (Perth, WA : 1958 - 1959)	1387	1579
1208	The Elsternwick Leader and Caulfield and Balac...	200	47
1358	Echo : Polski Tygodnik Niezalezny (Perth, WA :...	1384	2601
34	Auburn and District News (NSW : 1929)	1320	25
1354	Der Australische Spiegel = The Australian Mirr...	1385	1455
468	The Temora Telegraph and Mining Advocate (NSW ...	729	3
921	Alexandra and Yea Standard, Thornton, Gobur an...	154	21
1348	Dampier Despatch (Broome, WA : 1904 - 1905)	1407	871
472	The True Sun and New South Wales Independent P...	1038	20
49	Blayney West Macquarie (NSW : 1949)	802	110
195	Party (Sydney, NSW : 1942)	1000	6
1597	The Southern Cross (Perth, WA : 1893)	1660	59
905	The Van Diemen's Land Gazette and General Adve...	1047	38
193	Out Of Work (Sydney, NSW : 1922)	1008	32
862	The Derwent Star and Van Diemen's Land Intelli...	1046	12
1195	The Chinese Advertiser (Ballarat, Vic. : 1856)	706	15
1	Canberra Illustrated: A Quarterly Magazine (AC...	165	57
1505	The Dawn News-Sheet (Perth, WA : 1950 - 1952)	1772	29
1589	The Possum (Fremantle, WA : 1890)	1201	105
69	Citizen Soldier (Sydney, NSW : 1942)	996	60
70	Clarence and Richmond Examiner (Grafton, NSW :...	104	111
196	Party Builder (Sydney, NSW : 1942)	1001	160

Now let's merge the error data with the correction data.

In [54]:

corrections_errors_merged_df = pd.merge(
    df_newspapers_with_titles, df_errors_with_titles, how="left", on="id"
)

In [55]:

corrections_errors_merged_df.head()

Out[55]:

	title_x	id	total_results_x	total_articles_x	proportion_x	title_y	total_results_y	total_articles_y	proportion_y
0	The Temora Telegraph and Mining Advocate (NSW ...	729	3	3	1.0	The Temora Telegraph and Mining Advocate (NSW ...	0	3	0.000000
1	Hobart Town Gazette and Van Diemen's Land Adve...	5	1556	1556	1.0	Hobart Town Gazette and Van Diemen's Land Adve...	40	1556	0.025707
2	The Satirist and Sporting Chronicle (Sydney, N...	1028	286	286	1.0	The Satirist and Sporting Chronicle (Sydney, N...	0	286	0.000000
3	Justice (Narrabri, NSW : 1891)	885	45	45	1.0	Justice (Narrabri, NSW : 1891)	1	45	0.022222
4	Alexandra and Yea Standard, Thornton, Gobur an...	154	21	21	1.0	Alexandra and Yea Standard, Thornton, Gobur an...	0	21	0.000000

In [56]:

corrections_errors_merged_df["proportion_uncorrected"] = corrections_errors_merged_df[
    "proportion_x"
].apply(lambda x: 1 - x)
corrections_errors_merged_df.rename(
    columns={
        "title_x": "title",
        "proportion_x": "proportion_corrected",
        "proportion_y": "proportion_with_errors",
    },
    inplace=True,
)
corrections_errors_merged_df.sort_values(
    by=["proportion_with_errors", "proportion_uncorrected"],
    ascending=False,
    inplace=True,
)

So, for what it's worth, here's a list of the neediest newspapers – those with high error rates and low correction rates! As I've said, this is a pretty dodgy method, but interesting nonetheless.

In [57]:

corrections_errors_merged_df[
    ["title", "proportion_with_errors", "proportion_uncorrected"]
][:25]

Out[57]:

	title	proportion_with_errors	proportion_uncorrected
1193	The Weekly Advance (Granville, NSW : 1892 - 1893)	0.678402	0.970887
630	Dunolly and Betbetshire Express and County of ...	0.649294	0.929809
393	Hamilton Spectator and Grange District Adverti...	0.547861	0.886631
419	Wagga Wagga Express and Murrumbidgee District ...	0.545825	0.893597
189	The North Australian, Ipswich and General Adve...	0.542726	0.758393
259	The North Australian (Brisbane, Qld. : 1863 - ...	0.539706	0.827249
1653	The Herald of Tasmania (Hobart, Tas. : 1845)	0.520000	1.000000
1030	The Hay Standard and Advertiser for Balranald,...	0.515142	0.960397
857	Robertson Advocate (NSW : 1894 - 1923)	0.510424	0.949864
615	Temora Herald and Mining Journal (NSW : 1882 -...	0.509976	0.927374
349	Sydney Mail (NSW : 1860 - 1871)	0.505594	0.871639
509	Tasmanian Morning Herald (Hobart, Tas. : 1865 ...	0.503207	0.910690
144	Morning Star and Commercial Advertiser (Hobart...	0.499119	0.707575
691	Molong Argus (NSW : 1896 - 1921)	0.495895	0.935047
220	Port Phillip Gazette (Vic. : 1851)	0.490835	0.798371
204	Port Phillip Gazette and Settler's Journal (Vi...	0.490393	0.786757
197	Telegraph (Hobart Town, Tas. : 1867)	0.485714	0.771429
1654	The Hobart Town Herald (Tas. : 1845)	0.473684	1.000000
361	The Cumberland Free Press (Parramatta, NSW : 1...	0.470371	0.876349
134	Trumpeter General (Hobart, Tas. : 1833 - 1834)	0.467611	0.671390
127	Adelaide Chronicle and South Australian Litera...	0.460506	0.665978
261	Logan Witness (Beenleigh, Qld. : 1878 - 1893)	0.455916	0.828102
1139	The News, Shoalhaven and Southern Coast Distri...	0.448772	0.967789
278	The Darling Downs Gazette and General Advertis...	0.447585	0.836168
243	The Cornwall Chronicle (Launceston, Tas. : 183...	0.442833	0.816089

Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.