Finding non-English newspapers in Trove

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

How not to do it...

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using format:Periodical/Newspaper in the books and libraries category (or the article API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the sort of results you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

How I actually did it

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found pycld3 which installed with pip, and just worked.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

params = {
    'zone': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles.

In general this worked pretty well, and the result was a list of 48 newspapers (also as a Gist) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

Problems / limitations

  • It's no surprise that the results of the language detection are affected by the quality of the OCR.
  • In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content.
  • I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed.
  • I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
  • Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

Import what we need

In [1]:
import requests
import time
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from collections import Counter
import re
from langdetect import detect
from tqdm.auto import tqdm
import pandas as pd
import cld3
import pycountry
from language_tags import tags
import altair as alt
from pathlib import Path

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [2]:
TROVE_API_KEY = '[YOUR API KEY]'
In [2]:
TROVE_API_KEY = '6pi5hht0d2umqcro'

Harvest the data and run language detection on articles

In [3]:
def get_newspapers():
    '''
    Get a list of newspapers in Trove.
    '''
    response = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params={'encoding': 'json', 'key': TROVE_API_KEY})
    data = response.json()
    return data['response']['records']['newspaper']
In [4]:
params = {
    'zone': 'newspaper',
    'encoding': 'json',
    #'l-category': 'Article',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
    langs = []
    # print(f'\n{newspaper["title"]}')
    params['l-title'] = newspaper['id']
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    n = data['response']['zone'][0]['records']['n']
    try:
        articles = data['response']['zone'][0]['records']['article']
    except KeyError:
        # print('Not found')
        pass
    else:
        # Detect language for each article in results
        for article in articles:
            if 'articleText' in article:
                # Clean up OCRd text by removing takings and extra whitespace
                text = article['articleText']
                text = re.sub('<[^<]+?>', '', text)
                text = re.sub("\s\s+", " ", text)
                # Get the language
                ld = cld3.get_language(text)
                # If the language prediction is reliable, save it
                if ld.is_reliable:
                    langs.append(ld.language)
        # Find the count of each language detected in the sample of articles
        for lang, count in dict(Counter(langs)).items():
            # Calculate the language count as a proportion of the total number of results
            prop = int(count) / len(langs)
            newspaper_langs.append({'id': newspaper['id'], 'title': newspaper['title'], 'language': lang, 'proportion': prop, 'number': n})
    if not response.from_cache:
        time.sleep(0.2)
            

Convert the results into a dataframe.

In [5]:
df = pd.DataFrame(newspaper_langs)
df.head()
Out[5]:
id title language proportion number
0 166 Canberra Community News (ACT : 1925 - 1927) en 1.0 100
1 165 Canberra Illustrated: A Quarterly Magazine (AC... en 1.0 29
2 69 Federal Capital Pioneer (Canberra, ACT : 1924 ... en 1.0 100
3 871 Good Neighbour (ACT : 1950 - 1969) en 1.0 100
4 665 Student Notes/Canberra University College Stud... en 1.0 100

Add full language names

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the language-tags package.

In [6]:
def get_full_language(lc):
    '''
    Get full language names from codes
    '''
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc

df['language_full'] = df['language'].apply(get_full_language)

Filtering the results

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [7]:
df['language_full'].value_counts()
Out[7]:
English                  1608
Maltese                   285
Catalan                    52
Welsh                      36
Japanese                   32
Italian                    31
Norwegian                  24
Somali                     24
Danish                     18
German                     17
Portuguese                 10
Igbo                       10
French                     10
Samoan                     10
Chinese                     8
Estonian                    8
Luxembourgish               8
Hawaiian                    8
Scottish Gaelic             8
Western Frisian             7
Vietnamese                  7
Corsican                    6
Russian                     6
Modern Greek (1453-)        5
Filipino                    5
Swedish                     5
Bulgarian                   4
Afrikaans                   4
Polish                      4
Indonesian                  4
Javanese                    4
Hindi                       4
Malagasy                    4
Haitian                     3
Latin                       3
Malay (macrolanguage)       3
Dutch                       3
Cebuano                     2
Kurdish                     2
Shona                       2
Hebrew                      2
Bosnian                     2
Ukrainian                   2
Spanish                     2
Yiddish                     2
Irish                       2
Albanian                    2
Maori                       1
Turkish                     1
Slovak                      1
Zulu                        1
Marathi                     1
Galician                    1
Czech                       1
Croatian                    1
Macedonian                  1
Lithuanian                  1
Slovenian                   1
Name: language_full, dtype: int64

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [8]:
df.loc[df['proportion'] == 1]['language_full'].value_counts()
Out[8]:
English                 1144
German                     3
Italian                    3
Modern Greek (1453-)       1
Estonian                   1
Portuguese                 1
Name: language_full, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [9]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('proportion:Q', bin=True),
    y='count():Q'
)
Out[9]:

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives.

In [10]:
alt.Chart(df.loc[df['proportion'] < 0.1]).mark_bar().encode(
    x=alt.X('proportion:Q', bin=True),
    y='count():Q'
)
Out[10]:

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 48 newspapers published articles in Maltese?

In [11]:
df.loc[df['proportion'] >= 0.05]['language_full'].value_counts()
Out[11]:
English                  1601
Maltese                    50
Italian                    14
German                      9
Chinese                     8
Catalan                     6
Somali                      5
Modern Greek (1453-)        4
French                      3
Polish                      3
Japanese                    3
Portuguese                  3
Western Frisian             2
Yiddish                     2
Dutch                       2
Malay (macrolanguage)       1
Indonesian                  1
Bosnian                     1
Russian                     1
Estonian                    1
Ukrainian                   1
Lithuanian                  1
Danish                      1
Spanish                     1
Macedonian                  1
Corsican                    1
Welsh                       1
Bulgarian                   1
Vietnamese                  1
Scottish Gaelic             1
Samoan                      1
Name: language_full, dtype: int64

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the Mildura Irrigationist from 1892-3 is in Maltese. So what's going on?

In [12]:
df.loc[(df['proportion'] > 0.1) & (df['language_full'] == 'Maltese')]
Out[12]:
id title language proportion number language_full
222 1596 L'Italo-Australiano = The Italo-Australian (Su... mt 0.218750 100 Maltese
314 623 Sunday News (Sydney, NSW : 1919) mt 0.219178 100 Maltese
414 224 The Castlereagh (Gilgandra, NSW : 1905 - 1907) mt 0.105882 100 Maltese
582 500 The Richmond River Express and Casino Kyogle A... mt 0.168675 100 Maltese
652 452 The Sydney Wool and Stock Journal (NSW : 1899 ... mt 0.233766 100 Maltese
725 394 Twofold Bay and Maneroo Observer (NSW : 1860) mt 0.139535 100 Maltese
734 810 Upper Hunter Courier (Murrurundi, NSW : 1871) mt 0.142857 14 Maltese
857 1207 The Coolangatta Chronicle (Qld. : 1926) mt 0.130435 26 Maltese
907 892 Warwick Daily News (Qld. : 1919 -1954) mt 0.139241 100 Maltese
1052 34 The Advertiser (Adelaide, SA : 1889 - 1931) mt 0.486111 100 Maltese
1539 384 North Melbourne Gazette (Vic. : 1894 - 1901) mt 0.146341 100 Maltese
1603 318 Sandringham Southern Cross (Vic. : 1914 - 1918) mt 0.312500 100 Maltese
1645 13 The Argus (Melbourne, Vic. : 1848 - 1957) mt 0.629630 100 Maltese
1754 1583 The Mildura Irrigationist (Vic. : 1892 - 1893) mt 0.795455 100 Maltese
1758 1581 The Mildura Irrigationist and Murray River Agr... mt 0.739130 100 Maltese
1760 1582 The Mildura Irrigationist and Murray River Cul... mt 0.333333 100 Maltese
2027 1543 Murchison Times and Cue-Big Bell-Reedy Advocat... mt 0.137500 100 Maltese
2119 1617 The Derby News (WA : 1887) mt 0.750000 5 Maltese

If you look at results for the Mildura Irrigationist in Trove you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lHa ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiwa afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^— attor aakwt mm rvfimMiMh ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa tm enr a Mtcfc tto watrr tto wiaaal m a a day pfaMat. aa4 (h ilj amintir ilm tTtsjtvL.f**' ""j •fria—lhati tow ««4M k." tlml t | r 4m» wtn .aa rUa I h ha«« t ctoantaf InMM aMtoclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af t«l. i pwwiaf Mtan (tot jw. twy MwUI «a1 a«ry ftajr «ndl tar tlw aad annaH a«r aarf a««r aaria. tiaa

What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is 96% sure that it's Maltese! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [13]:
ocr = '''ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa'''
cld3.get_language(ocr)
Out[13]:
LanguagePrediction(language='mt', probability=0.960280179977417, is_reliable=True, proportion=1.0)

Of course there might actually be newspapers with articles in Maltese, so we don't want to filter them all out. So let's do some manual inspection of the newspapers that seem to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 105 different titles.

In [14]:
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = df.loc[df['proportion'] >= 0.05].groupby(by=['title', 'id']).filter(lambda x: (len(x) > 1) or (len(x)== 1 and x['language'] != 'en'))
papers = filtered.groupby(by=['title', 'id'])
len(papers)
Out[14]:
111

Let's list those 111 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [15]:
for n, l in papers:
    if not l.loc[(~df['language'].isin(['en'])) & (df['proportion'] >= 0.05)].empty:
        print(f'\n{n[0]} ({n[1]})')
        display(l[['language_full', 'language', 'proportion']].loc[(l['proportion'] > 0.05)].sort_values(by='proportion', ascending=False))
A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)
language_full language proportion
9 Portuguese pt 1.0
Adelaide Chronicle and South Australian Literary Record (SA : 1840 - 1842) (986)
language_full language proportion
917 English en 0.929293
916 Catalan ca 0.070707
Adelaide Independent and Cabinet of Amusement (SA : 1841) (1336)
language_full language proportion
918 English en 0.928571
920 Catalan ca 0.061224
Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)
language_full language proportion
927 German de 1.0
Auburn and District News (NSW : 1929) (1320)
language_full language proportion
40 English en 0.947368
41 Vietnamese vi 0.052632
Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)
language_full language proportion
931 German de 1.0
Bangkok Recorder (Thailand : 1865 - 1867) (1488)
language_full language proportion
10 English en 0.925532
11 Maltese mt 0.053191
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)
language_full language proportion
14 Malay (macrolanguage) ms 0.891304
15 Indonesian id 0.108696
Bulong Bulletin and Mining Register (WA : 1897 - 1898) (1400)
language_full language proportion
1852 English en 0.913043
1853 Maltese mt 0.086957
Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)
language_full language proportion
83 Chinese zh 0.945652
Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)
language_full language proportion
1330 Chinese zh 0.843373
Chronicle and North Coast Advertiser (Qld. : 1903 - 1922) (286)
language_full language proportion
780 English en 0.93617
781 Maltese mt 0.06383
Chung Wah News (Perth, WA : 1981 - 1987) (1383)
language_full language proportion
1870 English en 0.637363
1869 Chinese zh 0.263736
Colac Reformer (Vic. : 1914 - 1918) (763)
language_full language proportion
1350 English en 0.947917
1351 Maltese mt 0.052083
Daily Post (Hobart, Tas. : 1908 - 1918) (860)
language_full language proportion
1141 English en 0.704545
1140 Japanese ja 0.125000
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)
language_full language proportion
1895 German de 0.83
1896 English en 0.17
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)
language_full language proportion
130 German de 1.0
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)
language_full language proportion
945 German de 0.9
944 English en 0.1
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)
language_full language proportion
131 German de 0.729167
132 English en 0.270833
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)
language_full language proportion
946 German de 0.989691
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)
language_full language proportion
136 Dutch nl 0.882979
137 English en 0.106383
Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)
language_full language proportion
139 Dutch nl 0.924731
140 English en 0.053763
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)
language_full language proportion
1901 Polish pl 0.91
1902 English en 0.09
Eco Italiano (Perth, WA : 1958 - 1959) (1387)
language_full language proportion
1903 Italian it 1.0
Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899) (116)
language_full language proportion
1157 English en 0.929412
1158 Maltese mt 0.070588
Evelyn Observer, and South and East Bourke Record (Vic. : 1882 - 1902) (145)
language_full language proportion
1384 English en 0.913978
1383 Maltese mt 0.075269
Geelong Advertiser (Vic. : 1840 - 1845) (292)
language_full language proportion
1405 English en 0.904255
1404 Samoan sm 0.074468
Geraldton Advocate and Johnstone River Guardian (Qld. : 1895 - 1896) (1103)
language_full language proportion
789 English en 0.910112
790 Maltese mt 0.089888
Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)
language_full language proportion
1914 English en 0.661538
1918 Maltese mt 0.076923
1915 Japanese ja 0.061538
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)
language_full language proportion
174 Chinese zh 0.803030
177 Western Frisian fy 0.075758
Hamilton Spectator and Grange District Advertiser (Vic. : 1860 - 1870) (927)
language_full language proportion
1436 English en 0.921348
1435 Maltese mt 0.078652
Healesville Guardian (Vic. : 1893 - 1898) (140)
language_full language proportion
1441 English en 0.938144
1442 Maltese mt 0.051546
Hellenic Echo (Perth, WA : 1967 - 1968) (1389)
language_full language proportion
1956 Modern Greek (1453-) el 1.0
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)
language_full language proportion
1958 Italian it 0.97
Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)
language_full language proportion
190 Italian it 0.92
191 English en 0.08
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)
language_full language proportion
192 Italian it 0.777778
193 English en 0.222222
Inglewood Advertiser (Vic. : 1914 - 1918) (570)
language_full language proportion
1461 English en 0.936842
1462 Maltese mt 0.063158
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)
language_full language proportion
203 English en 0.840426
204 Italian it 0.159574
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)
language_full language proportion
205 English en 0.903226
206 Italian it 0.096774
Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)
language_full language proportion
207 Italian it 0.909091
208 English en 0.090909
Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)
language_full language proportion
1963 Japanese ja 0.93617
1964 English en 0.06383
Katoomba Times (NSW : 1889 - 1894) (906)
language_full language proportion
211 English en 0.934066
213 Maltese mt 0.054945
Kyabram Union (Vic. : 1886 - 1894) (196)
language_full language proportion
1482 English en 0.921348
1483 Maltese mt 0.056180
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)
language_full language proportion
221 Italian it 0.68750
222 Maltese mt 0.21875
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)
language_full language proportion
227 Italian it 0.95
La Rondine (Perth, WA : 1969 - 1994) (1388)
language_full language proportion
1981 Italian it 0.928571
1982 English en 0.071429
Laura Standard and Crystal Brook Courier (SA : 1917 - 1948) (926)
language_full language proportion
963 English en 0.931034
964 Maltese mt 0.068966
Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)
language_full language proportion
232 French fr 0.816327
233 English en 0.173469
Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)
language_full language proportion
2000 Modern Greek (1453-) el 0.375000
1994 English en 0.281250
2001 Portuguese pt 0.104167
1995 French fr 0.062500
1993 Spanish es 0.052083
Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) (280)
language_full language proportion
242 Estonian et 1.0
Murchison Times and Cue-Big Bell-Reedy Advocate (WA : 1937 - 1942) (1543)
language_full language proportion
2026 English en 0.8250
2027 Maltese mt 0.1375
Mu̇sų Pastogė = Our Haven (Sydney, NSW : 1950 - 1954) (1594)
language_full language proportion
254 Lithuanian lt 0.95
Narandera Argus and Riverina Advertiser (NSW : 1893 - 1953) (431)
language_full language proportion
258 English en 0.940476
259 Maltese mt 0.059524
Narromine News and Trangie Advocate (NSW : 1898 - 1955) (430)
language_full language proportion
260 English en 0.946809
261 Maltese mt 0.053191
Nasza droga (Adelaide, SA : 1952 - 1954) (1323)
language_full language proportion
970 Polish pl 0.9
971 English en 0.1
Norden (Melbourne, Vic. : 1914 - 1918) (797)
language_full language proportion
1531 English en 0.467391
1530 Danish da 0.413043
1532 Maltese mt 0.065217
North Melbourne Gazette (Vic. : 1894 - 1901) (384)
language_full language proportion
1538 English en 0.829268
1539 Maltese mt 0.146341
Oceania (Sydney, NSW : 1913 - 1915) (1598)
language_full language proportion
274 English en 0.574468
273 Italian it 0.425532
Referee (Sydney, NSW : 1886 - 1939) (499)
language_full language proportion
288 English en 0.924242
289 Maltese mt 0.075758
Reporter and Illawarra Journal (Kiama, NSW : 1887 - 1894) (389)
language_full language proportion
290 English en 0.891566
292 Maltese mt 0.084337
Revue Australienne : Journal des Interets Francais en Australie, Nouvelle Caledonie, Nouvelle Zelande, Fiji, Tahiti, Polynesie = (1604)
language_full language proportion
294 French fr 0.99
Ringwood and Croydon Chronicle (Vic. : 1914 - 1918) (329)
language_full language proportion
1591 English en 0.93617
1592 Maltese mt 0.06383
Rockhampton Bulletin and Central Queensland Advertiser (Qld. : 1861 - 1871) (92)
language_full language proportion
837 English en 0.946237
838 Maltese mt 0.053763
Sandringham Southern Cross (Vic. : 1914 - 1918) (318)
language_full language proportion
1602 English en 0.6500
1603 Maltese mt 0.3125
Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)
language_full language proportion
1610 Polish pl 0.4
1608 Western Frisian fy 0.2
1609 Bosnian bs 0.2
1611 Russian ru-Latn 0.2
Southern Australian (Adelaide, SA : 1838 - 1844) (171)
language_full language proportion
1036 English en 0.904255
1035 Catalan ca 0.074468
Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)
language_full language proportion
310 English en 0.909091
312 Maltese mt 0.077922
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)
language_full language proportion
2069 Italian it 0.97
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)
language_full language proportion
1046 German de 0.888889
1047 English en 0.111111
Sunday News (Sydney, NSW : 1919) (623)
language_full language proportion
315 English en 0.739726
314 Maltese mt 0.219178
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)
language_full language proportion
2075 Italian it 1.0
Sydney Chronicle (NSW : 1846 - 1848) (94)
language_full language proportion
319 English en 0.923077
320 Maltese mt 0.076923
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)
language_full language proportion
1044 German de 0.989691
Tasmanian Evening Herald (Launceston, Tas. : 1878) (1265)
language_full language proportion
1181 English en 0.898876
1180 Maltese mt 0.067416
The Advertiser (Adelaide, SA : 1889 - 1931) (34)
language_full language proportion
1051 English en 0.513889
1052 Maltese mt 0.486111
The Argus (Melbourne, Vic. : 1848 - 1957) (13)
language_full language proportion
1645 Maltese mt 0.629630
1646 English en 0.358025
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)
language_full language proportion
1657 English en 0.894737
1659 Yiddish yi 0.084211
The Castlereagh (Gilgandra, NSW : 1905 - 1907) (224)
language_full language proportion
413 English en 0.741176
415 Somali so 0.152941
414 Maltese mt 0.105882
The Chinese Advertiser (Ballarat, Vic. : 1856) (706)
language_full language proportion
1680 Chinese zh 0.500000
1682 English en 0.333333
1681 Scottish Gaelic gd 0.166667
The Coolangatta Chronicle (Qld. : 1926) (1207)
language_full language proportion
856 English en 0.869565
857 Maltese mt 0.130435
The Derby News (WA : 1887) (1617)
language_full language proportion
2119 Maltese mt 0.75
2120 Corsican co 0.25
The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)
language_full language proportion
1699 English en 0.894737
1700 Chinese zh 0.052632
1701 Maltese mt 0.052632
The Goldfields Observer (Kalgoorlie, WA : 1930 - 1939) (1626)
language_full language proportion
2143 English en 0.909091
2145 Maltese mt 0.051948
The Gwydir Examiner and Moree General Advertiser (NSW : 1898 - 1899) (886)
language_full language proportion
480 English en 0.910112
481 Maltese mt 0.078652
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)
language_full language proportion
1717 English en 0.783505
1718 Yiddish yi 0.195876
The Melbourne Advertiser (Vic. : 1838) (935)
language_full language proportion
1735 English en 0.666667
1736 Welsh cy 0.333333
The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)
language_full language proportion
1754 Maltese mt 0.795455
1753 English en 0.113636
1755 Somali so 0.090909
The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)
language_full language proportion
1758 Maltese mt 0.739130
1756 English en 0.130435
1757 Somali so 0.130435
The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)
language_full language proportion
1761 English en 0.523810
1760 Maltese mt 0.333333
1759 Somali so 0.126984
The Millicent Times (SA : 1891 - 1905) (970)
language_full language proportion
1074 English en 0.94898
1075 Catalan ca 0.05102
The Miner's Right (Boulder, WA : 1897) (1638)
language_full language proportion
2174 English en 0.909091
2176 Maltese mt 0.070707
The News, Shoalhaven, Broughton Creek and Ulladulla Advertiser (NSW : 1875 - 1877) (1678)
language_full language proportion
551 English en 0.913978
552 Catalan ca 0.086022
The Phillips River Times (Ravensthorpe, WA : 1908 - 1909) (1546)
language_full language proportion
2220 English en 0.9
2221 Maltese mt 0.1
The Port Phillip Patriot and Morning Advertiser (Vic. : 1845 - 1848) (937)
language_full language proportion
1768 English en 0.894737
1767 Maltese mt 0.084211
The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)
language_full language proportion
584 English en 0.734940
582 Maltese mt 0.168675
583 Somali so 0.072289
The Sydney Wool and Stock Journal (NSW : 1899 - 1917) (452)
language_full language proportion
654 English en 0.727273
652 Maltese mt 0.233766
The Tasmanian (Launceston, Tas. : 1871 - 1879) (946)
language_full language proportion
1243 English en 0.917808
1244 Maltese mt 0.082192
The Teetotaller and General Newspaper (Sydney, NSW : 1842 - 1843) (1036)
language_full language proportion
657 English en 0.95
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)
language_full language proportion
2262 Modern Greek (1453-) el 0.97
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)
language_full language proportion
705 Modern Greek (1453-) el 0.989362
Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)
language_full language proportion
712 Chinese zh 0.926316
Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)
language_full language proportion
719 Chinese zh 0.968085
Twofold Bay Telegraph (NSW : 1860) (479)
language_full language proportion
730 English en 0.945652
731 Maltese mt 0.054348
Twofold Bay and Maneroo Observer (NSW : 1860) (394)
language_full language proportion
724 English en 0.825581
725 Maltese mt 0.139535
Uniamoci (Sydney, NSW : 1903 - 1904) (1599)
language_full language proportion
732 Italian it 1.0
Upper Hunter Courier (Murrurundi, NSW : 1871) (810)
language_full language proportion
733 English en 0.857143
734 Maltese mt 0.142857
Vesnik (Perth, WA : 1975 - 1994) (1382)
language_full language proportion
2297 Macedonian mk 0.410526
2296 English en 0.357895
2298 Bulgarian bg-Latn 0.221053
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)
language_full language proportion
735 Ukrainian uk 0.82
736 English en 0.18
Warwick Daily News (Qld. : 1919 -1954) (892)
language_full language proportion
906 English en 0.835443
907 Maltese mt 0.139241
Williamstown Trade Circular (Vic. : 1855 - 1856) (213)
language_full language proportion
1831 English en 0.875
1832 Portuguese pt 0.125

I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [21]:
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = ['1036', '1043', '1103', '116', '1207', '1265', '13', '1320', '1336', '140', '1400', '145', '1488', '1543', '1546', '1581', '1582', '1583', '1617', '1623', '1626', '1638', '1675', '1678', '171', '196', '213', '224', '286', '292', '318', '329', '34', '384', '389', '394', '418', '430', '431', '452', '479', '499', '500', '570', '623', '763', '810', '860', '886', '892', '906', '92', '926', '927', '935', '937', '94', '946', '970', '986']

Here we'll add the dodgy title ids into our filter. It seems that we have 51 newspapers with significant amounts of non-English content.

In [22]:
# The filter removes titles that only have one language, which is English
filtered = df.loc[(~df['id'].isin(dodgy)) & (df['proportion'] >= 0.05)].groupby(by=['title', 'id']).filter(lambda x: (len(x) > 1) or (len(x)== 1 and x['language'] != 'en'))
papers = filtered.groupby(by=['title', 'id'])
len(papers)
Out[22]:
51

Let's list them.

In [23]:
for n, l in papers:
    print(n[0])
A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952)
Eco Italiano (Perth, WA : 1958 - 1959)
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923)
Hellenic Echo (Perth, WA : 1967 - 1968)
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957)
Il Giornale Italiano (Sydney, NSW : 1932 - 1940)
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954)
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940)
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935)
Italo-Australian (Sydney, NSW : 1927 - 1940)
Japanese Perth Times (Subiaco, WA : 1989 - 1996)
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885)
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909)
La Rondine (Perth, WA : 1969 - 1994)
Le Courrier Australien (Sydney, NSW : 1892 - 2011)
Mediterranean Voice (Perth, WA : 1971 - 1972)
Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956)
Mu̇sų Pastogė = Our Haven (Sydney, NSW : 1950 - 1954)
Nasza droga (Adelaide, SA : 1952 - 1954)
Norden (Melbourne, Vic. : 1914 - 1918)
Oceania (Sydney, NSW : 1913 - 1915)
Revue Australienne : Journal des Interets Francais en Australie, Nouvelle Caledonie, Nouvelle Zelande, Fiji, Tahiti, Polynesie =
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932)
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851)
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959)
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874)
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999)
The Chinese Advertiser (Ballarat, Vic. : 1856)
The English and Chinese Advertiser (Vic. : 1856 - 1858)
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935)
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957)
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954)
Tung Wah News (Sydney, NSW : 1898 - 1902)
Tung Wah Times (Sydney, NSW : 1901 - 1936)
Uniamoci (Sydney, NSW : 1903 - 1904)
Vesnik (Perth, WA : 1975 - 1994)
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954)

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the list of all 51 newspapers (also as a Gist).

In [24]:
with open(Path('non-english-newspapers.md'), 'w') as md_file:
    i = 1
    for n, l in papers:
        md_file.write(f'\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n')
        md_file.write('| Language | Language code | Proportion of sample |\n')
        md_file.write('|---|---|---|\n')
        for row in l[['language_full', 'language', 'proportion']].loc[(l['proportion'] > 0.05)].sort_values(by='proportion', ascending=False).itertuples():
            md_file.write(f'| {row.language_full} | {row.language} | {row.proportion} |\n')
        i += 1

If you look at the Markdown files you'll see that there are still some dodgy results – for example, 16% of the Chinese Advertiser is detected as 'Scottish Gaelic'. But the point of this exercise was to find non-English newspapers, rather than accurately detect the proportion of non-English content, so I think we can live with it for now.


Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.