Notebook

Finding non-English newspapers in Trove¶

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

How not to do it...¶

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using format:Periodical/Newspaper in the books and libraries category (or the article API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the sort of results you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

How I actually did it¶

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found pycld3 which installed with pip, and just worked.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

params = {
    'zone': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles.

In general this worked pretty well, and the result was a list of 52 newspapers (also as a Gist) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

Problems / limitations¶

It's no surprise that the results of the language detection are affected by the quality of the OCR.
In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content.
I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed.
I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

Import what we need¶

In [1]:

import os
import re
import time
from collections import Counter
from pathlib import Path

import altair as alt
import cld3
import pandas as pd
import requests_cache
from IPython.display import display
from language_tags import tags
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [2]:

%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [3]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

Harvest the data and run language detection on articles¶

In [4]:

def get_newspapers():
    """
    Get a list of newspapers in Trove.
    """
    response = s.get(
        "https://api.trove.nla.gov.au/v2/newspaper/titles",
        params={"encoding": "json", "key": API_KEY},
    )
    data = response.json()
    return data["response"]["records"]["newspaper"]

In [5]:

params = {
    "zone": "newspaper",
    "encoding": "json",
    # 'l-category': 'Article',
    "l-word": "100 - 1000 Words",
    "include": "articletext",
    "key": API_KEY,
    "q": " ",
    "n": 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
    langs = []
    # print(f'\n{newspaper["title"]}')
    params["l-title"] = newspaper["id"]
    response = s.get("https://api.trove.nla.gov.au/v2/result", params=params)
    data = response.json()
    n = data["response"]["zone"][0]["records"]["n"]
    try:
        articles = data["response"]["zone"][0]["records"]["article"]
    except KeyError:
        # print('Not found')
        pass
    else:
        # Detect language for each article in results
        for article in articles:
            if "articleText" in article:
                # Clean up OCRd text by removing tags and extra whitespace
                text = article["articleText"]
                text = re.sub(r"<[^<]+?>", "", text)
                text = re.sub(r"\s\s+", " ", text)
                # Get the language
                ld = cld3.get_language(text)
                # If the language prediction is reliable, save it
                if ld.is_reliable:
                    langs.append(ld.language)
        # Find the count of each language detected in the sample of articles
        for lang, count in dict(Counter(langs)).items():
            # Calculate the language count as a proportion of the total number of results
            prop = int(count) / len(langs)
            newspaper_langs.append(
                {
                    "id": newspaper["id"],
                    "title": newspaper["title"],
                    "language": lang,
                    "proportion": prop,
                    "number": n,
                }
            )
    if not response.from_cache:
        time.sleep(0.2)

  0%|          | 0/1741 [00:00<?, ?it/s]

Convert the results into a dataframe.

In [6]:

df = pd.DataFrame(newspaper_langs)
df.head()

Out[6]:

	id	title	language	proportion	number
0	166	Canberra Community News (ACT : 1925 - 1927)	en	1.0	100
1	165	Canberra Illustrated: A Quarterly Magazine (AC...	en	1.0	29
2	69	Federal Capital Pioneer (Canberra, ACT : 1924 ...	en	1.0	100
3	871	Good Neighbour (ACT : 1950 - 1969)	en	1.0	100
4	665	Student Notes/Canberra University College Stud...	en	1.0	100

Add full language names¶

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the language-tags package.

In [7]:

def get_full_language(lc):
    """
    Get full language names from codes
    """
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc


df["language_full"] = df["language"].apply(get_full_language)

Filtering the results¶

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [8]:

df["language_full"].value_counts()

Out[8]:

English                  1680
Maltese                   177
Japanese                   28
Italian                    22
Somali                     18
German                     16
Welsh                      15
Catalan                    12
Portuguese                  9
Norwegian                   9
Chinese                     8
Estonian                    7
Danish                      7
Hindi                       6
French                      6
Western Frisian             6
Corsican                    6
Hawaiian                    4
Bulgarian                   4
Vietnamese                  4
Polish                      4
Igbo                        4
Indonesian                  4
Modern Greek (1453-)        4
Luxembourgish               3
Javanese                    3
Yiddish                     3
Dutch                       3
Scottish Gaelic             3
Swedish                     3
Czech                       2
Samoan                      2
Latin                       2
Kurdish                     2
Malagasy                    2
Filipino                    2
Russian                     2
Malay (macrolanguage)       2
Bosnian                     2
Spanish                     2
Cebuano                     2
Uzbek                       1
Slovenian                   1
Irish                       1
Croatian                    1
Haitian                     1
Turkish                     1
Hebrew                      1
Maori                       1
Zulu                        1
Galician                    1
Latvian                     1
Shona                       1
Ukrainian                   1
Lithuanian                  1
Afrikaans                   1
Hausa                       1
Macedonian                  1
Name: language_full, dtype: int64

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [9]:

df.loc[df["proportion"] == 1]["language_full"].value_counts()

Out[9]:

English                 1422
German                     3
Italian                    3
Modern Greek (1453-)       2
Estonian                   1
Yiddish                    1
Name: language_full, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [10]:

alt.Chart(df).mark_bar().encode(x=alt.X("proportion:Q", bin=True), y="count():Q")

Out[10]:

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives.

In [11]:

alt.Chart(df.loc[df["proportion"] < 0.1]).mark_bar().encode(
    x=alt.X("proportion:Q", bin=True), y="count():Q"
)

Out[11]:

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 34 newspapers published articles in Maltese?

In [12]:

df.loc[df["proportion"] >= 0.05]["language_full"].value_counts()

Out[12]:

English                  1670
Maltese                    33
Italian                    15
German                      9
Chinese                     8
Somali                      5
Modern Greek (1453-)        4
Japanese                    3
Portuguese                  3
Yiddish                     3
French                      3
Polish                      3
Western Frisian             2
Dutch                       2
Malay (macrolanguage)       1
Lithuanian                  1
Ukrainian                   1
Estonian                    1
Indonesian                  1
Vietnamese                  1
Danish                      1
Swedish                     1
Bosnian                     1
Russian                     1
Scottish Gaelic             1
Welsh                       1
Spanish                     1
Corsican                    1
Macedonian                  1
Bulgarian                   1
Name: language_full, dtype: int64

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the Mildura Irrigationist from 1892-3 is in Maltese. So what's going on?

In [13]:

df.loc[(df["proportion"] > 0.1) & (df["language_full"] == "Maltese")]

Out[13]:

	id	title	language	proportion	number	language_full
203	1596	L'Italo-Australiano = The Italo-Australian (Su...	mt	0.206349	100	Maltese
270	389	Reporter and Illawarra Journal (Kiama, NSW : 1...	mt	0.105882	100	Maltese
286	418	Southern Morning Herald (Goulburn, NSW : 1920 ...	mt	0.146667	100	Maltese
289	623	Sunday News (Sydney, NSW : 1919)	mt	0.181818	100	Maltese
530	500	The Richmond River Express and Casino Kyogle A...	mt	0.126437	100	Maltese
654	810	Upper Hunter Courier (Murrurundi, NSW : 1871)	mt	0.142857	14	Maltese
812	892	Warwick Daily News (Qld. : 1919 -1954)	mt	0.111111	100	Maltese
928	34	The Advertiser (Adelaide, SA : 1889 - 1931)	mt	0.486111	100	Maltese
1205	543	Cobden Times (Vic. : 1918)	mt	0.109890	100	Maltese
1375	384	North Melbourne Gazette (Vic. : 1894 - 1901)	mt	0.189873	100	Maltese
1431	318	Sandringham Southern Cross (Vic. : 1914 - 1918)	mt	0.243902	100	Maltese
1565	1583	The Mildura Irrigationist (Vic. : 1892 - 1893)	mt	0.762500	100	Maltese
1568	1581	The Mildura Irrigationist and Murray River Agr...	mt	0.626667	100	Maltese
1577	1733	The Morwell Advocate and Boolara and Mirboo Ch...	mt	0.625000	21	Maltese
1580	1734	The Morwell Advocate and Narracan, Boolara and...	mt	0.170732	100	Maltese
1927	1617	The Derby News (WA : 1887)	mt	0.750000	5	Maltese

If you look at results for the Mildura Irrigationist in Trove you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lHa ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiwa afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f*' ""j •fria—lhati tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aMtoclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af t«l. i pwwiaf Mtan (tot jw. twy MwUI «a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa

What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is 96% sure that it's Maltese! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [14]:

ocr = """ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa"""
cld3.get_language(ocr)

Out[14]:

LanguagePrediction(language='mt', probability=0.960280179977417, is_reliable=True, proportion=1.0)

Of course there might actually be newspapers with articles in Maltese, so we don't want to filter them all out. So let's do some manual inspection of the newspapers that seem to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 89 different titles.

In [15]:

# The filter on the groupby drops out newspapers that only have articles in English.
filtered = (
    df.loc[df["proportion"] >= 0.05]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

Out[15]:

Let's list those 89 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [16]:

for n, l in papers:
    if not l.loc[(~df["language"].isin(["en"])) & (df["proportion"] >= 0.05)].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )

A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)

	language_full	language	proportion
8	Portuguese	pt	0.988889

Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)

	language_full	language	proportion
828	German	de	1.0

Auburn and District News (NSW : 1929) (1320)

	language_full	language	proportion
43	English	en	0.947368
44	Vietnamese	vi	0.052632

Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)

	language_full	language	proportion
1158	Yiddish	yi	1.0

Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)

	language_full	language	proportion
832	German	de	1.0

Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)

	language_full	language	proportion
14	Malay (macrolanguage)	ms	0.891304
15	Indonesian	id	0.108696

Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)

	language_full	language	proportion
83	Chinese	zh	0.928571

Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)

	language_full	language	proportion
1194	Chinese	zh	0.918367

Chronicle and North Coast Advertiser (Qld. : 1903 - 1922) (286)

	language_full	language	proportion
695	English	en	0.94898
696	Maltese	mt	0.05102

Chung Wah News (Perth, WA : 1981 - 1987) (1383)

	language_full	language	proportion
1694	English	en	0.566667
1693	Chinese	zh	0.388889

Cobden Times (Vic. : 1918) (543)

	language_full	language	proportion
1204	English	en	0.857143
1205	Maltese	mt	0.109890

Colac Reformer (Vic. : 1914 - 1918) (763)

	language_full	language	proportion
1214	English	en	0.947368
1215	Maltese	mt	0.052632

Daily Post (Hobart, Tas. : 1908 - 1918) (860)

	language_full	language	proportion
1011	English	en	0.719101
1012	Japanese	ja	0.112360

Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)

	language_full	language	proportion
1716	German	de	0.82
1717	English	en	0.18

Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)

	language_full	language	proportion
125	German	de	1.0

Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)

	language_full	language	proportion
844	German	de	0.9
843	English	en	0.1

Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)

	language_full	language	proportion
126	German	de	0.704082
127	English	en	0.295918

Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)

	language_full	language	proportion
845	German	de	0.989583

Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)

	language_full	language	proportion
131	Dutch	nl	0.969697

Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)

	language_full	language	proportion
134	Dutch	nl	0.919192
135	English	en	0.060606

Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)

	language_full	language	proportion
1721	Polish	pl	0.91
1722	English	en	0.09

Eco Italiano (Perth, WA : 1958 - 1959) (1387)

	language_full	language	proportion
1723	Italian	it	1.0

Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899) (116)

	language_full	language	proportion
1027	English	en	0.933333
1028	Maltese	mt	0.066667

Evelyn Observer, and South and East Bourke Record (Vic. : 1882 - 1902) (145)

	language_full	language	proportion
1241	English	en	0.913978
1240	Maltese	mt	0.075269

Geraldton Advocate and Johnstone River Guardian (Qld. : 1895 - 1896) (1103)

	language_full	language	proportion
704	English	en	0.947917
705	Maltese	mt	0.052083

Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)

	language_full	language	proportion
1734	English	en	0.643836
1735	Maltese	mt	0.095890
1739	Japanese	ja	0.068493

Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)

	language_full	language	proportion
162	Chinese	zh	0.854167
165	Western Frisian	fy	0.062500

Hamilton Spectator and Grange District Advertiser (Vic. : 1860 - 1870) (927)

	language_full	language	proportion
1282	English	en	0.915789
1283	Maltese	mt	0.073684

Hellenic Echo (Perth, WA : 1967 - 1968) (1389)

	language_full	language	proportion
1771	Modern Greek (1453-)	el	1.0

Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)

	language_full	language	proportion
1773	Italian	it	0.97

Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)

	language_full	language	proportion
175	Italian	it	0.91
176	English	en	0.09

Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)

	language_full	language	proportion
177	Italian	it	0.75
178	English	en	0.25

Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)

	language_full	language	proportion
188	English	en	0.833333
189	Italian	it	0.166667

Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)

	language_full	language	proportion
190	English	en	0.893617
191	Italian	it	0.106383

Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)

	language_full	language	proportion
192	Italian	it	0.97

Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)

	language_full	language	proportion
1777	Japanese	ja	0.9375

Kyabram Union (Vic. : 1886 - 1894) (196)

	language_full	language	proportion
1326	English	en	0.931818
1327	Maltese	mt	0.068182

L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)

	language_full	language	proportion
202	Italian	it	0.698413
203	Maltese	mt	0.206349

L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)

	language_full	language	proportion
208	Italian	it	0.97

La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)

	language_full	language	proportion
1796	Italian	it	0.98

Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)

	language_full	language	proportion
212	French	fr	0.76
213	English	en	0.24

Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)

	language_full	language	proportion
1815	Modern Greek (1453-)	el	0.357143
1814	English	en	0.224490
1816	Portuguese	pt	0.153061
1809	French	fr	0.081633
1808	Spanish	es	0.061224

Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) (280)

	language_full	language	proportion
221	Estonian	et	1.0

Murchison Times and Cue-Big Bell-Reedy Advocate (WA : 1937 - 1942) (1543)

	language_full	language	proportion
1838	English	en	0.892857
1839	Maltese	mt	0.071429

Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)

	language_full	language	proportion
233	Lithuanian	lt	0.95

Nasza droga (Adelaide, SA : 1952 - 1954) (1323)

	language_full	language	proportion
869	Polish	pl	0.89
870	English	en	0.11

Norden (Melbourne, Vic. : 1914 - 1918) (797)

	language_full	language	proportion
1366	Danish	da	0.752809
1369	Swedish	sv	0.112360
1367	English	en	0.067416

North Melbourne Gazette (Vic. : 1894 - 1901) (384)

	language_full	language	proportion
1374	English	en	0.784810
1375	Maltese	mt	0.189873

Oceania (Sydney, NSW : 1913 - 1915) (1598)

	language_full	language	proportion
254	Italian	it	0.54
255	English	en	0.46

Reporter and Illawarra Journal (Kiama, NSW : 1887 - 1894) (389)

	language_full	language	proportion
269	English	en	0.894118
270	Maltese	mt	0.105882

Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)

	language_full	language	proportion
271	French	fr	0.98

Ringwood and Croydon Chronicle (Vic. : 1914 - 1918) (329)

	language_full	language	proportion
1422	English	en	0.938144
1423	Maltese	mt	0.061856

Sandringham Southern Cross (Vic. : 1914 - 1918) (318)

	language_full	language	proportion
1430	English	en	0.731707
1431	Maltese	mt	0.243902

Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)

	language_full	language	proportion
1436	Polish	pl	0.4
1435	Bosnian	bs	0.2
1437	Russian	ru-Latn	0.2
1438	Western Frisian	fy	0.2

Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)

	language_full	language	proportion
285	English	en	0.800000
286	Maltese	mt	0.146667
287	Somali	so	0.053333

Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)

	language_full	language	proportion
1881	Italian	it	0.97

Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)

	language_full	language	proportion
924	German	de	0.888889
925	English	en	0.111111

Sunday News (Sydney, NSW : 1919) (623)

	language_full	language	proportion
290	English	en	0.779221
289	Maltese	mt	0.181818

Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)

	language_full	language	proportion
1888	Italian	it	1.0

Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)

	language_full	language	proportion
922	German	de	0.989691

The Advertiser (Adelaide, SA : 1889 - 1931) (34)

	language_full	language	proportion
927	English	en	0.513889
928	Maltese	mt	0.486111

The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)

	language_full	language	proportion
1473	English	en	0.810526
1475	Yiddish	yi	0.157895

The Castlereagh (Gilgandra, NSW : 1905 - 1907) (224)

	language_full	language	proportion
384	English	en	0.609195
385	Somali	so	0.310345
386	Maltese	mt	0.080460

The Chinese Advertiser (Ballarat, Vic. : 1856) (706)

	language_full	language	proportion
1504	Chinese	zh	0.500000
1506	English	en	0.333333
1505	Scottish Gaelic	gd	0.166667

The Derby News (WA : 1887) (1617)

	language_full	language	proportion
1927	Maltese	mt	0.75
1928	Corsican	co	0.25

The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)

	language_full	language	proportion
1522	English	en	0.894737
1523	Chinese	zh	0.052632
1524	Maltese	mt	0.052632

The Hay Standard and Advertiser for Balranald, Wentworth, Maude...(Hay, NSW : 1871 - 1873; 1880 - 1881; 1890 - 1900) (725)

	language_full	language	proportion
441	English	en	0.947368
442	Maltese	mt	0.052632

The Herald of Tasmania (Hobart, Tas. : 1845) (1741)

	language_full	language	proportion
1083	English	en	0.857143
1085	Italian	it	0.095238

The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)

	language_full	language	proportion
1535	English	en	0.81
1536	Yiddish	yi	0.19

The Melbourne Advertiser (Vic. : 1838) (935)

	language_full	language	proportion
1550	English	en	0.666667
1551	Welsh	cy	0.333333

The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)

	language_full	language	proportion
1565	Maltese	mt	0.7625
1564	English	en	0.1250
1566	Somali	so	0.1125

The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)

	language_full	language	proportion
1568	Maltese	mt	0.626667
1569	English	en	0.240000
1567	Somali	so	0.133333

The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)

	language_full	language	proportion
1570	English	en	0.746667
1571	Somali	so	0.146667
1572	Maltese	mt	0.093333

The Miner's Right (Boulder, WA : 1897) (1638)

	language_full	language	proportion
1984	English	en	0.908163
1986	Maltese	mt	0.061224

The Morwell Advocate and Boolara and Mirboo Chronicle (Vic. : 1886) (1733)

	language_full	language	proportion
1577	Maltese	mt	0.625
1578	English	en	0.375

The Morwell Advocate and Narracan, Boolara and Mirboo Chronicle (Vic. : 1886) (1734)

	language_full	language	proportion
1579	English	en	0.829268
1580	Maltese	mt	0.170732

The Reporter (Box Hill, Vic. : 1889 - 1925) (244)

	language_full	language	proportion
1594	English	en	0.904255
1593	Maltese	mt	0.085106

The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)

	language_full	language	proportion
532	English	en	0.827586
530	Maltese	mt	0.126437

The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)

	language_full	language	proportion
2064	Modern Greek (1453-)	el	0.98

To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)

	language_full	language	proportion
626	Modern Greek (1453-)	el	1.0

Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)

	language_full	language	proportion
632	Chinese	zh	0.94

Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)

	language_full	language	proportion
638	Chinese	zh	0.926316

Twofold Bay and Maneroo Observer (NSW : 1860) (394)

	language_full	language	proportion
645	English	en	0.886364
647	Maltese	mt	0.090909

Uniamoci (Sydney, NSW : 1903 - 1904) (1599)

	language_full	language	proportion
652	Italian	it	1.0

Upper Hunter Courier (Murrurundi, NSW : 1871) (810)

	language_full	language	proportion
653	English	en	0.857143
654	Maltese	mt	0.142857

Vesnik (Perth, WA : 1975 - 1994) (1382)

	language_full	language	proportion
2093	Macedonian	mk	0.408163
2092	English	en	0.357143
2094	Bulgarian	bg-Latn	0.224490

Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)

	language_full	language	proportion
655	Ukrainian	uk	0.82
656	English	en	0.18

Warwick Daily News (Qld. : 1919 -1954) (892)

	language_full	language	proportion
811	English	en	0.864198
812	Maltese	mt	0.111111

Williamstown Trade Circular (Vic. : 1855 - 1856) (213)

	language_full	language	proportion
1658	English	en	0.882353
1659	Portuguese	pt	0.117647

I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [17]:

# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = [
    "1036",
    "1043",
    "1103",
    "116",
    "1207",
    "1265",
    "13",
    "1320",
    "1336",
    "140",
    "1400",
    "145",
    "1488",
    "1543",
    "1546",
    "1581",
    "1582",
    "1583",
    "1617",
    "1623",
    "1626",
    "1638",
    "1675",
    "1678",
    "171",
    "1733",
    "1734",
    "1741",
    "196",
    "213",
    "224",
    "244",
    "286",
    "292",
    "318",
    "329",
    "34",
    "384",
    "389",
    "394",
    "418",
    "430",
    "431",
    "452",
    "479",
    "499",
    "500",
    "543",
    "570",
    "623",
    "725",
    "763",
    "810",
    "860",
    "886",
    "892",
    "906",
    "92",
    "926",
    "927",
    "935",
    "937",
    "94",
    "946",
    "970",
    "986",
]

Here we'll add the dodgy title ids into our filter. It seems that we have 52 newspapers with significant amounts of non-English content.

In [18]:

# The filter removes titles that only have one language, which is English
filtered = (
    df.loc[(~df["id"].isin(dodgy)) & (df["proportion"] >= 0.05)]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

Out[18]:

Let's list them.

In [19]:

for n, l in papers:
    print(n[0])

A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952)
Eco Italiano (Perth, WA : 1958 - 1959)
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923)
Hellenic Echo (Perth, WA : 1967 - 1968)
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957)
Il Giornale Italiano (Sydney, NSW : 1932 - 1940)
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954)
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940)
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935)
Italo-Australian (Sydney, NSW : 1927 - 1940)
Japanese Perth Times (Subiaco, WA : 1989 - 1996)
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885)
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909)
La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984)
Le Courrier Australien (Sydney, NSW : 1892 - 2011)
Mediterranean Voice (Perth, WA : 1971 - 1972)
Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956)
Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954)
Nasza droga (Adelaide, SA : 1952 - 1954)
Norden (Melbourne, Vic. : 1914 - 1918)
Oceania (Sydney, NSW : 1913 - 1915)
Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874)
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932)
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851)
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959)
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874)
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999)
The Chinese Advertiser (Ballarat, Vic. : 1856)
The English and Chinese Advertiser (Vic. : 1856 - 1858)
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935)
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957)
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954)
Tung Wah News (Sydney, NSW : 1898 - 1902)
Tung Wah Times (Sydney, NSW : 1901 - 1936)
Uniamoci (Sydney, NSW : 1903 - 1904)
Vesnik (Perth, WA : 1975 - 1994)
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954)

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the list of all 52 newspapers (also as a Gist).

In [20]:

with open(Path("non-english-newspapers.md"), "w") as md_file:
    i = 1
    for n, l in papers:
        md_file.write(
            f"\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n"
        )
        md_file.write("| Language | Language code | Proportion of sample |\n")
        md_file.write("|---|---|---|\n")
        for row in (
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
            .itertuples()
        ):
            md_file.write(
                f"| {row.language_full} | {row.language} | {row.proportion} |\n"
            )
        i += 1

If you look at the Markdown files you'll see that there are still some dodgy results – for example, 16% of the Chinese Advertiser is detected as 'Scottish Gaelic'. But the point of this exercise was to find non-English newspapers, rather than accurately detect the proportion of non-English content, so I think we can live with it for now.

Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.