Exploring Glottolog with Python

Sebastian Bank ([email protected]) http://www.uni-leipzig.de/~sbank/

The latest version of this IPython Notebook is available at http://gist.github.com/xflr6/9050337.

Glottolog provides its comprehensive catalog of the world's languages, language families and dialects for download in linked data format.

In this notebook, I will process this data set using the following tools:

If you are new to scientific Python, the Anaconda Python Distribution is probably the fastest way to get Python installed with all the commonly used scientific packages. It supports all platforms (Linux, Mac, and Windows).

If you are on Windows, there are Unofficial Windows Binaries for a lot of Python extension packages used in scientific computing.

Getting the file

Download the RDF export file with Pythons built-in urllib module (docs).

In [1]:
import urllib

URL = 'http://glottolog.org/static/download/2.7/glottolog-language.n3.gz'

filename, headers = urllib.urlretrieve(URL, URL.rpartition('/')[2])

The file contains RDF in Notation3 compressed with gzip.

In [2]:
print filename
glottolog-language.n3.gz

Display the size in megabytes.

In [3]:
size = int(headers['Content-Length'])

print size / 1024.0 ** 2
2.63014793396

A first look

Read the first few bytes from the file with gzip (docs) so we can get an impression of the format.

In [4]:
import gzip

with gzip.open(filename) as fd:
    sample = fd.read(4000)

Split the sample into the namespaces definitions and the actual RDF triples. They are separated by a blank line.

In [5]:
head, _, body = sample.partition('\n\n')

len(head), len(body)
Out[5]:
(932, 3066)

Inspect the start of the namespaces.

In [6]:
print head[:600] + '...'
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dctype: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix gold: <http://purl.org/linguistics/gold/> .
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/> .
@prefix lexvo: <http://lexvo.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-r...

Glottolog uses well-known ontologies and some which are dedicated to linguistics like Lexvo and GOLD.

Display the first RDF triples.

In [7]:
print body[:600] + '...'
<http://glottolog.org/resource/languoid/id/muni1258> a dcterms:LinguisticSystem,
        gold:Language ;
    rdfs:label "Muniche"@en ;
    lexvo:iso639P3PCode "myr"^^xsd:string ;
    dcterms:description <http://glottolog.org/resource/reference/id/10167>,
        <http://glottolog.org/resource/reference/id/132589>,
        <http://glottolog.org/resource/reference/id/135495>,
        <http://glottolog.org/resource/reference/id/300702>,
        <http://glottolog.org/resource/reference/id/303200>,
        <http://glottolog.org/resource/reference/id/34227>,
        <http://glottolog.org/resource/re...

The entry starts with the full URI of the languoid, followed by its types, label, ISO 639-3 code and description.

Let's try to extract some meaningful information from this string just using Pythons regular expressions.

Using text processing

Load the whole file uncompressed into memory.

In [8]:
with gzip.open(filename) as fd:
    data = fd.read()

Display the size in megabytes.

In [9]:
print len(data) / 1024.0 ** 2
32.698595047

Extract the glottocode from the start of all dcterms:LinguisticSystem entries with the re module (docs) and count them.

In [10]:
import re

GLOTTOCODE = '<http://glottolog.org/resource/languoid/id/(\w+)> a dcterms:LinguisticSystem'

gcodes = re.findall(GLOTTOCODE, data)

len(gcodes)
Out[10]:
24393

Display the glottocodes of the first five entries.

In [11]:
gcodes[:5]
Out[11]:
['muni1258', 'west1503', 'port1278', 'west2205', 'nilo1247']

Looks unordered, sort them alphabetically and display the first and last five entries.

In [12]:
gcodes.sort()

print gcodes[:5] 
print gcodes[-5:]
['aala1237', 'aant1238', 'aari1238', 'aari1239', 'aari1240']
['zuti1239', 'zuwa1238', 'zwal1238', 'zyph1238', 'zyud1238']

Extract everything that looks like an ISO code. Count the results.

In [13]:
ISO_CODE = 'iso639P3PCode "(\w+)"'

icodes = re.findall(ISO_CODE, data)

len(icodes)
Out[13]:
7822

Display the first ten ISO codes.

In [14]:
icodes[:10]
Out[14]:
['myr', 'pko', 'oki', 'mwy', 'kqh', 'mwx', 'aam', 'spy', 'tec', 'kpz']

Sort them as well and display the start and end.

In [15]:
icodes.sort()

print icodes[:10]
print icodes[-10:]
['aaa', 'aab', 'aac', 'aad', 'aae', 'aaf', 'aag', 'aah', 'aai', 'aak']
['zun', 'zuy', 'zwa', 'zyb', 'zyg', 'zyj', 'zyn', 'zyp', 'zza', 'zzj']

Glottocodes

Glottocodes consist of four letters and some apparently recurring digit combinations.

Display the five most common of those digits and their frequency with collections.Counter (docs).

In [16]:
import collections

collections.Counter(g[4:] for g in gcodes).most_common(5)
Out[16]:
[('1238', 3022), ('1239', 1192), ('1242', 1039), ('1241', 997), ('1237', 903)]

Show the most common inital parts.

In [17]:
collections.Counter(g[:4] for g in gcodes).most_common(5)
Out[17]:
[('nort', 563), ('sout', 560), ('nucl', 508), ('west', 461), ('east', 425)]

Loading into RDFlib

Use rdflib (docs) to load the whole graph into memory.

This will take a while and fill a couple hundred megabytes of RAM.

In [18]:
import rdflib

graph = rdflib.Graph()

with gzip.open(filename) as fd:
    graph.parse(fd, format='n3')

graph
Out[18]:
<Graph identifier=N5f0224c79a154d14bd437619ecf4e397 (<class 'rdflib.graph.Graph'>)>

Count the number of triples.

In [19]:
len(graph)
Out[19]:
670194

Using the RDF graph

Display some of the triples (subject, predicate, object).

In [20]:
import itertools

for s, p, o in itertools.islice(graph, 15):
    print s[42:], graph.qname(p), o
pwon1235 rdf:type http://purl.org/dc/terms/LinguisticSystem
guin1260 dcterms:spatial http://www.geonames.org/countries/GW/
mogu1251 dcterms:description http://glottolog.org/resource/reference/id/156942
tibe1272 dcterms:description http://glottolog.org/resource/reference/id/26288
barr1251 dcterms:isReferencedBy http://glottolog.org/valuesets/vitality-barr1251
nang1261 skos:altLabel nang1261
choc1278 dcterms:spatial North America
song1308 rdf:type http://purl.org/linguistics/gold/Dialect
pato1242 void:inDataset http://glottolog.org/
nort2855 rdf:type http://purl.org/dc/terms/LinguisticSystem
chil1280 skos:broader http://glottolog.org/resource/languoid/id/nort2940
sate1242 dcterms:title Saterfriesisch
stan1290 dcterms:description http://glottolog.org/resource/reference/id/37004
bord1246 skos:broader http://glottolog.org/resource/languoid/id/komb1273
marg1251 dcterms:description http://glottolog.org/resource/reference/id/54615

Show all available predicates.

In [21]:
for p in sorted(set(graph.predicates())):
    print graph.qname(p)
lexvo:iso639P3PCode
dcterms:description
dcterms:isReferencedBy
dcterms:isReplacedBy
dcterms:spatial
dcterms:title
void:inDataset
rdf:type
rdfs:label
owl:sameAs
geo:lat
geo:long
skos:altLabel
skos:broader
skos:broaderTransitive
skos:changeNote
skos:editorialNote
skos:narrower
skos:prefLabel
skos:scopeNote

Create shortcuts for querying glottocodes and ISO codes. Translate glottocodes into ISO codes.

In [22]:
glottocode = rdflib.Namespace('http://glottolog.org/resource/languoid/id/')
lexvo = rdflib.Namespace('http://lexvo.org/ontology#')
iso639 = lexvo.iso639P3PCode

for g in gcodes[:5]:
    i = graph.value(glottocode[g], iso639, default='___')
    print '%s -> %s,' % (g, i),
aala1237 -> ___, aant1238 -> ___, aari1238 -> ___, aari1239 -> aiw, aari1240 -> aay,

Translate ISO codes into glottocodes

In [23]:
string = rdflib.namespace.XSD.string

for i in icodes[:5]:
    g = graph.value(None, iso639, rdflib.Literal(i, datatype=string))
    print '%s -> %s,' % (i, g[42:]),
aaa -> ghot1243, aab -> alum1246, aac -> arii1243, aad -> amal1242, aae -> arbe1236,

Retrieve the preferred label of languoids.

In [24]:
label = rdflib.namespace.RDFS.label

for g in gcodes[:5]:
    l = graph.value(glottocode[g], label)
    print '%s -> %s,' % (g, l),
aala1237 -> Aalawa, aant1238 -> Aantantara, aari1238 -> Aari-Gayil, aari1239 -> Aari, aari1240 -> Aariya,

Lookup an arbitrary languoid with a given label.

In [25]:
print graph.value(None, label, rdflib.Literal('Aalawa', lang='en'))
http://glottolog.org/resource/languoid/id/aala1237

Show the predicates and objects of an individual languoid.

In [26]:
for p, o in graph[glottocode['aala1237']]:
    print graph.qname(p), o
skos:prefLabel Aalawa
rdfs:label Aalawa
skos:broaderTransitive http://glottolog.org/resource/languoid/id/aust1307
dcterms:isReferencedBy http://glottolog.org/valuesets/fc42061
dcterms:title Aalawa
dcterms:isReferencedBy http://glottolog.org/valuesets/sc42061
skos:altLabel aala1237
void:inDataset http://glottolog.org/
skos:scopeNote language
skos:broader http://glottolog.org/resource/languoid/id/ramo1244
rdf:type http://purl.org/dc/terms/LinguisticSystem
dcterms:spatial Papunesia
rdf:type http://purl.org/linguistics/gold/Dialect

Display the nodes along a languoid's path up the tree.

In [27]:
broader = rdflib.namespace.SKOS.broader

aalawa = graph.resource(glottocode['aala1237'])

print ' -> '.join(b.label() for b in aalawa.transitive_objects(broader))
Aalawa -> Ramoaaina -> Kandas-Duke of York -> Label-Bilur -> St George linkage -> New Ireland-Northwest Solomonic linkage -> Meso Melanesian linkage -> Western Oceanic linkage -> Oceanic -> Eastern Malayo-Polynesian -> Central-Eastern Malayo-Polynesian -> Malayo-Polynesian -> Nuclear Austronesian -> Austronesian

Display the nodes immediately below a languoid.

In [28]:
narrower = rdflib.namespace.SKOS.narrower

atlaco = graph.resource(glottocode['atla1278'])

print '%s <- %s' % (atlaco.label(), ', '.join(n.label() for n in atlaco.objects(narrower)))
Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel

Count all nodes below a languoid.

In [29]:
len(list(atlaco.transitive_objects(narrower)))
Out[29]:
4608

Querying with SPARQL

Retrieve rows of glottocode, ISO code, and label with RDFs query language SPARQL. Also display the annotated language of the label.

In [30]:
GIL = """
SELECT
  (substr(str(?s), 43) AS ?glottocode) ?iso ?label
WHERE
  { ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label 
    OPTIONAL { ?s lexvo:iso639P3PCode ?iso } }
ORDER BY ?s LIMIT 10"""

for g, i, l in graph.query(GIL):
    print '%s | %-4s | %-10s | %s' % (g, i, l, l.language)
aala1237 | None | Aalawa     | en
aant1238 | None | Aantantara | en
aari1238 | None | Aari-Gayil | en
aari1239 | aiw  | Aari       | en
aari1240 | aay  | Aariya     | en
aari1244 | aiz  | Aari       | en
aasa1238 | aas  | Aasax      | en
aata1238 | None | Aatasaara  | en
abaa1238 | None | Aba        | en
abab1239 | None | Ababda     | en

Display the result as CSV (json and xml format are also supported).

In [31]:
print graph.query(GIL).serialize(format='csv')
glottocode,iso,label
aala1237,,Aalawa
aant1238,,Aantantara
aari1238,,Aari-Gayil
aari1239,aiw,Aari
aari1240,aay,Aariya
aari1244,aiz,Aari
aasa1238,aas,Aasax
aata1238,,Aatasaara
abaa1238,,Aba
abab1239,,Ababda

Determine the language families with the most child languages.

In [32]:
FAMILIES = """
SELECT
  ?label (count(*) as ?n)
WHERE
  { ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower+/a gold:Language }
GROUP BY ?s
ORDER BY desc(?n) LIMIT 10"""

for f, n in graph.query(FAMILIES):
    print '%s\t%s' % (f, n)
Atlantic-Congo	1430
Austronesian	1274
Indo-European	583
Sino-Tibetan	475
Bookkeeping	391
Afro-Asiatic	372
Nuclear Trans New Guinea	315
Pama-Nyungan	241
Otomanguean	179
Sign Language	168

Display the immediate children for some families.

In [33]:
CHILDREN = """
SELECT
  ?label (group_concat(?o; separator=", ") as ?children)
WHERE
  { ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower/rdfs:label ?o }
GROUP BY ?s
ORDER BY ?label OFFSET 10 LIMIT 5"""

for f, c in graph.query(CHILDREN):
    print '%s <- %s' % (f, c)
Arawan <- Suruahá, Paumarí, Aruá (Amazonas State), Madi-Madiha
Artificial Language <- Neo, Efate group based (Artificial Language), Kotava, Esperanto, Lingua Franca Nova, Talossan, Interlingua (International Auxiliary Language Association), Rennellese Sign Language, Ladakhi Sign
Athapaskan-Eyak-Tlingit <- Athapaskan-Eyak, Tlingit
Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel
Austroasiatic <- Nicobaric, Monic, Khmuic, Vietic, Mangic, Pearic, Bahnaric, Khasi-Palaung, Katuic, Mundaic, Aslian, Khmeric

Do the same for a specific languoid.

In [34]:
for l, c in graph.query("""BASE <http://glottolog.org/resource/languoid/id/>
SELECT
  ?label (group_concat(?o; separator=", ") as ?children)
WHERE
  { <atla1278> rdfs:label ?label ; skos:narrower/rdfs:label ?o }"""):
    print '%s <- %s' % (l, c)
Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel

Here's a SPARQL query that retrieves most of the functional properties of the languoids.

In [35]:
LANGUOIDS = """
SELECT
  (substr(str(?s), 43) AS ?id) ?label
  (substr(str(?type), 34) AS ?level)
  (substr(str(?broader), 43) AS ?parent)
  (if(bound(?change_note), 1, 0) AS ?obsolete)
  ?status ?iso639 ?latitude ?longitude
WHERE
  { ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label .
    ?s a ?type FILTER (strstarts(str(?type), "http://purl.org/linguistics/gold/"))
    OPTIONAL { ?s skos:broader ?broader }
    OPTIONAL { ?s skos:changeNote ?change_note FILTER (?change_note = "obsolete") }
    OPTIONAL { ?s skos:editorialNote ?status }
    OPTIONAL { ?s lexvo:iso639P3PCode ?iso639 }
    OPTIONAL { ?s geo:lat ?latitude; geo:long ?longitude } }"""

Display some results.

In [36]:
for row in itertools.islice(graph.query(LANGUOIDS), 20):
    print '%s %-20s %-17s %-8s %s %-11s %-4s %-8s %s' % row
pwon1235 Pwo Northern Karen   Language          nort2704 0 established pww  18.016   98.2709
kwes1244 Kwese                Language          phee1234 0 established kws  -5.60445 18.5759
abaw1238 Abawa                Dialect           gupa1248 0 None        None None     None
roto1247 Rotorua-Taupo        Dialect           maor1246 0 None        None None     None
nort2855 North Coast Mengen   Dialect           meng1267 0 None        None None     None
maca1260 Maca                 Language          mata1290 0 established mca  -25.0119 -57.3694
nyon1241 Nyong                Language          pere1234 0 established muo  7.27419  11.0615
fars1254 Farsic-Caucasian Tat LanguageSubfamily sout3157 0 established None None     None
yeng1243 Yengi Hissar         Dialect           uigh1240 0 None        None None     None
thui1238 Thui Phum            Dialect           ngal1291 0 None        None None     None
west2339 Western Asturian     Dialect           astu1245 0 None        None None     None
kele1254 Kele (C.60)          LanguageFamily    None     1 established None None     None
zumu1241 Zumu                 Dialect           bata1314 0 None        None None     None
nort2742 Northern Isan        Dialect           nort2741 0 None        None None     None
tezo1238 Tezoatlán Mixtec     Language          mixt1427 0 established mxb  17.6155  -97.9002
sund1254 Sundi-Kamba          LanguageSubfamily laad1234 0 established None None     None
long1404 Long Bento'          Dialect           moda1244 0 None        None None     None
pouy1238 Pouye                Language          ramm1241 0 established bye  -3.72704 141.864
gola1255 Gola                 Language          mela1257 0 established gol  7.06193  -10.8138
supp1238 Suppire-Mamara       LanguageFamily    None     1 established None None     None

Write the results into a CSV file. Show the beginning of the file.

In [37]:
CSV = 'glottolog.csv'

graph.query(LANGUOIDS).serialize(CSV, format='csv')

with open(CSV) as fd:
    sample = fd.read(500)

print sample + '...'
id,label,level,parent,obsolete,status,iso639,latitude,longitude
pwon1235,Pwo Northern Karen,Language,nort2704,0,established,pww,18.016,98.2709
kwes1244,Kwese,Language,phee1234,0,established,kws,-5.60445,18.5759
abaw1238,Abawa,Dialect,gupa1248,0,,,,
roto1247,Rotorua-Taupo,Dialect,maor1246,0,,,,
nort2855,North Coast Mengen,Dialect,meng1267,0,,,,
maca1260,Maca,Language,mata1290,0,established,mca,-25.0119,-57.3694
nyon1241,Nyong,Language,pere1234,0,established,muo,7.27419,11.0615
fars1254,Farsic-Cau...

Let's put that into a relational database so we can reuse it later.

Export to SQLite

Create an SQLite database file connecting with sqlite3 (docs). Activate foreign key checks so we notice if something is inconsistent.

In [38]:
import sqlite3

DB = 'glottolog.sqlite3'

conn = sqlite3.connect(DB)
conn.execute('PRAGMA foreign_keys = ON')

conn.execute('PRAGMA synchronous = OFF')
conn.execute('PRAGMA journal_mode = MEMORY')

conn
Out[38]:
<sqlite3.Connection at 0x28c60858>

Create a table for the results of the languoids query with some additional sanity checks. Insert the query rows. Count them.

In [39]:
conn.execute("""
CREATE TABLE languoid (
  id TEXT NOT NULL PRIMARY KEY,
  label TEXT NOT NULL,
  level TEXT NOT NULL,
  parent TEXT,
  obsolete BOOLEAN NOT NULL,
  status TEXT,
  iso TEXT UNIQUE,
  latitude REAL,
  longitude REAL,
  FOREIGN KEY(parent) REFERENCES languoid(id) DEFERRABLE INITIALLY DEFERRED,
  CHECK (level IN ('LanguageFamily', 'LanguageSubfamily', 'Language', 'Dialect')),
  CHECK (obsolete IN (0, 1)),
  CHECK (status IN ('established', 'spurious', 'spurious retired', 'unattested',
                    'provisional', 'retired'))
)""")

conn.executemany('INSERT INTO languoid VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
    graph.query(LANGUOIDS))
conn.commit()

conn.execute('SELECT count(*) FROM languoid').fetchone()
Out[39]:
(24393,)

Languoids may have n alternative labels.

Create a table for the labels and their language. Retrieve them with SPARQL. Insert the query results into the table. Count rows.

In [40]:
conn.execute("""
CREATE TABLE label (
  id TEXT NOT NULL,
  lang TEXT NOT NULL,
  label TEXT NOT NULL,
  PRIMARY KEY (id, lang, label),
  FOREIGN KEY(id) REFERENCES languoid(id)
)""")

LABELS = """
SELECT
  (substr(str(?s), 43) AS ?id) (lang(?label) AS ?lang) ?label
WHERE
  { ?s a dcterms:LinguisticSystem ; skos:altLabel ?label }"""

conn.executemany('INSERT INTO label VALUES (?, ?, ?)',
    graph.query(LABELS))
conn.commit()

conn.execute('SELECT count(*) FROM label').fetchone()
Out[40]:
(86463,)

Languoids may have n references.

Create a table for the references. Retrieve them with SPARQL. Insert the query results into the table. Count.

In [41]:
conn.execute("""
CREATE TABLE reference (
  id TEXT NOT NULL,
  reference INTEGER NOT NULL,
  PRIMARY KEY (id, reference),
  FOREIGN KEY(id) REFERENCES languoid(id)
)""")

REFERENCES = """
SELECT
  (substr(str(?s), 43) AS ?id) (substr(str(?o), 44) AS ?reference)
WHERE
  { ?s a dcterms:LinguisticSystem ; dcterms:description ?o
    FILTER (strstarts(str(?o), "http://glottolog.org/resource/reference/id/")) }"""

conn.executemany('INSERT INTO reference VALUES (?, ?)',
    graph.query(REFERENCES))
conn.commit()

conn.execute('SELECT count(*) FROM reference').fetchone()
Out[41]:
(212614,)

Querying with SQLite

Display the number of languoids. Break it down by type and check the proportion of superseded entries. Most of the family entries are obsolete.

In [42]:
print conn.execute('SELECT count(*) FROM languoid').fetchone()

conn.execute('SELECT level, count(*), sum(obsolete) FROM languoid GROUP BY level').fetchall()
(24393,)
Out[42]:
[(u'Dialect', 10599, 185),
 (u'Language', 8418, 21),
 (u'LanguageFamily', 1505, 1263),
 (u'LanguageSubfamily', 3871, 0)]

Check the distribution of status values by type. Only language entries distinguish it.

In [43]:
conn.execute("""SELECT level, status, count(*) AS n
FROM languoid GROUP BY level, status ORDER BY level, n DESC""").fetchall()
Out[43]:
[(u'Dialect', None, 10599),
 (u'Language', u'established', 7945),
 (u'Language', u'spurious', 199),
 (u'Language', u'spurious retired', 192),
 (u'Language', u'unattested', 61),
 (u'Language', u'retired', 19),
 (u'Language', u'provisional', 2),
 (u'LanguageFamily', u'established', 1505),
 (u'LanguageSubfamily', u'established', 3871)]

Display the number ISO codes. Break the proportions down by languoid type. ISO 639-3 also contains macrolanguages.

In [44]:
print conn.execute('SELECT count(*), count(iso) FROM languoid').fetchone()

conn.execute('SELECT level, count(*), count(iso) FROM languoid GROUP BY level').fetchall()
(24393, 7822)
Out[44]:
[(u'Dialect', 10599, 5),
 (u'Language', 8418, 7789),
 (u'LanguageFamily', 1505, 1),
 (u'LanguageSubfamily', 3871, 27)]

Check how many entries specify location. Only language entries do so.

In [45]:
conn.execute('SELECT level, count(latitude) FROM languoid GROUP BY level').fetchall()
Out[45]:
[(u'Dialect', 0),
 (u'Language', 7634),
 (u'LanguageFamily', 0),
 (u'LanguageSubfamily', 1)]

Display the first and last glottocodes and ISO codes.

In [46]:
GLOTTOCODES = 'SELECT id FROM languoid ORDER BY id %s LIMIT 10'

print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'ASC'))
print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'DESC'))

ISO_CODES = 'SELECT iso FROM languoid WHERE iso NOT NULL ORDER BY iso %s LIMIT 20'

print ', '.join(i for i, in conn.execute(ISO_CODES % 'ASC'))
print ', '.join(i for i, in conn.execute(ISO_CODES % 'DESC'))
aala1237, aant1238, aari1238, aari1239, aari1240, aari1244, aasa1238, aata1238, abaa1238, abab1239
zyud1238, zyph1238, zwal1238, zuwa1238, zuti1239, zurr1238, zuri1238, zura1238, zuoj1238, zuni1245
aaa, aab, aac, aad, aae, aaf, aag, aah, aai, aak, aal, aam, aan, aao, aap, aaq, aar, aas, aat, aau
zzj, zza, zyp, zyn, zyj, zyg, zyb, zwa, zuy, zun, zum, zul, zuh, zua, zty, ztx, ztu, ztt, zts, ztq

Labels

Display the number of labels. Break them down by language and entry type.

In [47]:
print conn.execute('SELECT count(*) FROM label').fetchone()

print conn.execute("""SELECT lang, count(*) AS n
FROM label GROUP BY lang ORDER BY n DESC LIMIT 5""").fetchall()

conn.execute("""SELECT languoid.level, count(*) AS n
FROM label JOIN languoid ON languoid.id=label.id
GROUP BY languoid.level""").fetchall()
(86463,)
[(u'en', 45862), (u'x-clld', 24393), (u'fr', 899), (u'br', 671), (u'ru', 589)]
Out[47]:
[(u'Dialect', 14827),
 (u'Language', 65685),
 (u'LanguageFamily', 1723),
 (u'LanguageSubfamily', 4228)]

Show the minimal, mean, and maximal number of labels per entry. Check the languoids with the most labels.

In [48]:
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM label GROUP BY id)""").fetchone()

conn.execute("""SELECT languoid.label, count(*) AS n
FROM label JOIN languoid ON languoid.id=label.id
GROUP BY label.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 3.5445824621817734, 174)
Out[48]:
[(u'Standard French', 174), (u'Standard Spanish', 154), (u'Russian', 144)]

Show the minimal, mean, and maximal label length. Check the frequencies of the most common lengths.

In [49]:
print conn.execute("""SELECT min(s), avg(s), max(s) FROM
(SELECT length(label) AS s FROM label)""").fetchall()

conn.execute("""SELECT length(label) AS l, count(*) AS n
FROM label GROUP BY l HAVING n > 3200 ORDER BY l""").fetchall()
[(1, 9.443553890103281, 65)]
Out[49]:
[(4, 3882), (5, 6174), (6, 7481), (7, 6887), (8, 29717), (9, 4054), (10, 3336)]

References

Display the number of references. Break them down by entry type. There are much less references for non-languages.

In [50]:
print conn.execute('SELECT count(*) FROM reference').fetchone()

conn.execute("""SELECT l.level, count(*) AS n
FROM reference AS r JOIN languoid AS l ON l.id=r.id GROUP BY l.level""").fetchall()
(212614,)
Out[50]:
[(u'Dialect', 43),
 (u'Language', 210178),
 (u'LanguageFamily', 1663),
 (u'LanguageSubfamily', 730)]

Show the minimal, mean, and maximal number of references per entry. Check the most referenced languoids.

In [51]:
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM reference GROUP BY id)""").fetchone()

conn.execute("""SELECT l.label, count(*) AS n FROM reference AS r
JOIN languoid AS l ON l.id=r.id GROUP BY r.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 25.640858658948385, 2728)
Out[51]:
[(u'Luxembourgish', 2728), (u'Standard French', 2160), (u'Swahili', 1840)]

Building the tree

The languoids table only specifies the direct parent of each entry. However, we want to be able to traverse the tree and query the whole path.

As SQLite supports hierarchical queries only with version 3.8.3+, we will use a more general approach and generate a table with all tree paths.

In other words, we will compute the transitive closure of the parent relation, a.k.a. tree closure table.

Since we won't use recursion inside the database, we will simply put together a bunch of SQL queries and feed the results back into a new table of our database.

In [52]:
PATH = """SELECT
  i0 AS child, %(depth)d AS steps, i%(depth)d AS parent, i%(next)d IS NULL AS terminal
FROM (
  SELECT %(select)s
  FROM languoid AS l0
  %(joins)s
) WHERE parent IS NOT NULL"""

def path_query(depth):
    select = ', '.join('l%(step)d.id AS i%(step)d' % {'step': i} for i in range(depth + 2))
    joins = ' '.join('LEFT JOIN languoid AS l%(next)d ON l%(step)d.parent = l%(next)d.id'
        % {'step': i, 'next': i + 1} for i in range(depth + 1))
    return PATH % {'depth': depth, 'next': depth + 1, 'select': select, 'joins': joins}

The path_query function generates a query for a tree walk of the length given by depth. Note that we will omit zero step (reflexive) walks.

In [53]:
print path_query(1)
SELECT
  i0 AS child, 1 AS steps, i1 AS parent, i2 IS NULL AS terminal
FROM (
  SELECT l0.id AS i0, l1.id AS i1, l2.id AS i2
  FROM languoid AS l0
  LEFT JOIN languoid AS l1 ON l0.parent = l1.id LEFT JOIN languoid AS l2 ON l1.parent = l2.id
) WHERE parent IS NOT NULL

Each query returns the start glottocode, number of steps, end glottocode and a boolean indicating if there is no grandparent.

In [54]:
print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(1)).fetchall()
print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(2)).fetchall()
[(u'aala1237', 1, u'ramo1244', 0), (u'aant1238', 1, u'nort2920', 0), (u'aari1238', 1, u'ahkk1235', 0)]
[(u'aala1237', 2, u'kand1307', 0), (u'aant1238', 2, u'tair1260', 0), (u'aari1238', 2, u'sout2845', 1)]

When all paths in the query are terminal, we have arrived at the maximal depth.

In [55]:
conn.execute(path_query(18)).fetchall()
Out[55]:
[(u'patw1249', 18, u'indo1319', 1),
 (u'yeri1239', 18, u'atla1278', 1),
 (u'cher1272', 18, u'atla1278', 1),
 (u'wile1238', 18, u'atla1278', 1),
 (u'biri1258', 18, u'atla1278', 1),
 (u'doli1238', 18, u'atla1278', 1),
 (u'fufu1238', 18, u'atla1278', 1),
 (u'bule1242', 18, u'atla1278', 1),
 (u'pato1243', 18, u'indo1319', 1)]

Create a table for the results. Insert path walks of increasing depth until all walks have ended. Count the walks.

In [56]:
conn.execute("""
CREATE TABLE tree (
  child TEXT NOT NULL,
  steps INTEGER NOT NULL,
  parent TEXT NOT NULL,
  terminal BOOLEAN NOT NULL,
  PRIMARY KEY (child, steps),
  UNIQUE (child, parent),
  UNIQUE (parent, child),
  FOREIGN KEY (child) REFERENCES languoid (id),
  FOREIGN KEY (parent) REFERENCES languoid (id),
  CHECK (terminal IN (0, 1))
)""")

depth = 1
while True:
    rows = conn.execute(path_query(depth)).fetchall()
    if not rows:
        break
    conn.executemany('INSERT INTO tree VALUES (?, ?, ?, ?)', rows)
    depth += 1
conn.commit()

conn.execute('SELECT count(*) FROM tree').fetchone()
Out[56]:
(145822,)

If the SQlite we use from Python is version 3.8.3 or later, we can also get the rows for the tree closure table with a single query:

WITH RECURSIVE tree(child, steps, parent, terminal) AS (
  SELECT l.id, 1, l.parent, 0
  FROM languoid AS l
  WHERE l.parent IS NOT NULL
UNION ALL
  SELECT t.child, t.steps + 1, p.parent, gp.parent IS NULL
  FROM languoid AS p
  JOIN tree AS t ON p.id=t.parent
  LEFT JOIN languoid AS gp ON gp.id=p.parent
  WHERE p.parent IS NOT NULL
) 
SELECT * FROM tree

Querying the tree

Show the minimal, mean, and maximal number of languages per family. Display the language familes with the most child languages.

In [57]:
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Language'
WHERE p.level='LanguageFamily' GROUP BY p.id)""").fetchone()

conn.execute("""SELECT p.label, count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Language'
WHERE p.level='LanguageFamily' GROUP BY p.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 33.781893004115226, 1430)
Out[57]:
[(u'Atlantic-Congo', 1430), (u'Austronesian', 1274), (u'Indo-European', 583)]

Determine the languages with the most dialects.

In [58]:
conn.execute("""SELECT p.label, count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Dialect'
WHERE p.level='Language' GROUP BY p.id  ORDER BY n DESC LIMIT 4""").fetchall()
Out[58]:
[(u'Gumuz', 19), (u'Basque', 11), (u'Kunama', 9), (u'Berta', 7)]

Display some of the longest paths.

In [59]:
for child, path in conn.execute("""SELECT c.label, (SELECT group_concat(parent, ' <- ')
  FROM (SELECT g.child AS child , p.label AS parent
    FROM tree AS g JOIN languoid AS p ON g.parent=p.id
    WHERE child=c.id ORDER BY g.steps DESC)
  GROUP BY child)
FROM languoid AS c JOIN tree AS w ON w.child=c.id AND w.terminal
ORDER BY w.steps DESC, c.id LIMIT 3"""):
    print '%s <= %s\n' % (path, child)
Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- North-West Dagaric <- Birifor <- Malba Birifor <= Birifor

Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= 'Bulengee

Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= Cherii

Note that with SPARQL the number of steps is not available, so it might be difficult to get the path in the right order like this.

Analysis with pandas

Activate inline plotting in this notebook.

In [60]:
%matplotlib inline

Load the language labels into a pandas (docs) DataFrame. Display the result.

In [61]:
import pandas as pd

pd.set_option('max_rows', 15)

labels = pd.read_sql_query("""SELECT label.*
FROM label JOIN languoid ON label.id=languoid.id
WHERE languoid.level='Language' ORDER BY label.id""", conn, index_col='id')

labels
Out[61]:
lang label
id
aari1239 an Luenga aari
aari1239 ar لغة آري
aari1239 en Aari language
aari1239 en Ara
aari1239 en Ari
aari1239 en Ari-Galila
aari1239 en Aro
... ... ...
zuoj1238 en Zuojiang
zuoj1238 x-clld zuoj1238
zyph1238 br Zac'hringeg
zyph1238 en Zophei
zyph1238 en Zoptei
zyph1238 en Zyphe language
zyph1238 x-clld zyph1238

65685 rows × 2 columns

Break the number of labels down by language.

In [62]:
labels_lang = labels.groupby('lang').size().sort_values(ascending=False)
labels_lang[labels_lang > 400].plot.bar();

Show summary statistics on the number of labels per languoid. Plot the more common label count frequencies.

In [63]:
nlabels = labels.groupby(level='id').size()
nlabels_hist = nlabels.value_counts().sort_index()

print nlabels.describe()
nlabels_hist[nlabels_hist > 30].plot.area();
count    8418.000000
mean        7.802922
std        10.729925
min         1.000000
25%         2.000000
50%         5.000000
75%         9.000000
max       174.000000
dtype: float64

Do statistics on the string length of the labels.

In [64]:
slabel = labels['label'].str.len()
slabel_hist = slabel.value_counts().sort_index()

print slabel.describe()
slabel_hist[slabel_hist > 30].plot.area();
count    65685.000000
mean         9.834315
std          5.168938
min          1.000000
25%          6.000000
50%          8.000000
75%         13.000000
max         65.000000
Name: label, dtype: float64

Load the languages and the full paths into data frames. Join them into one data frame and show the result.

In [65]:
languages = pd.read_sql_query("""SELECT * FROM languoid
WHERE level='Language' AND NOT obsolete ORDER BY id""", conn, index_col='id')

tree = pd.read_sql_query('SELECT * FROM tree WHERE terminal', conn, index_col='child')

langs = languages.join(tree, how='left', rsuffix='_tree')

langs
Out[65]:
label level parent obsolete status iso latitude longitude steps parent_tree terminal
id
aari1239 Aari Language aari1238 0 established aiw 5.95034 36.5721 3 sout2845 1
aari1240 Aariya Language book1242 0 spurious aay NaN NaN 1 book1242 1
aari1244 Aari Language book1242 0 spurious retired aiz NaN NaN 1 book1242 1
aasa1238 Aasax Language uncl1457 0 established aas -4.00679 36.8648 4 afro1255 1
abad1241 Abadi Language west2850 0 established kbt -9.03389 146.9920 11 aust1307 1
abag1245 Abaga Language kama1374 0 established abg -6.12028 145.6650 6 nucl1709 1
abai1240 Abai Sungai Language pait1248 0 established abf 5.55394 118.3060 7 aust1307 1
... ... ... ... ... ... ... ... ... ... ... ...
zulg1242 Zulgo-Gemzek Language meri1245 0 established gnd 10.82700 14.0578 7 afro1255 1
zulu1248 Zulu Language zulu1251 0 established zul -25.33050 31.3512 12 atla1278 1
zuma1239 Zumaya Language masa1324 0 established zuy 10.55800 14.4445 5 afro1255 1
zumb1240 Zumbun Language west2712 0 established jmb 10.82700 9.9683 5 afro1255 1
zuni1245 Zuni Language None 0 established zun 35.00560 -108.7820 NaN NaN NaN
zuoj1238 Zuojiang Zhuang Language nort3180 0 established zzj 21.83750 107.3620 5 taik1256 1
zyph1238 Zyphe Language nucl1757 0 established zyp 22.52400 93.2640 5 sino1245 1

8397 rows × 11 columns

Analyze the number of languages per top-level family.

In [66]:
famsizes = langs.groupby('parent_tree').size().sort_values(ascending=False)

print famsizes.describe()
famsizes[famsizes > 100].plot.bar();
count     243.000000
mean       33.781893
std       137.796044
min         1.000000
25%         2.000000
50%         5.000000
75%        12.000000
max      1430.000000
dtype: float64

Analyze the number of steps from languages to their top-level family.

In [67]:
langs['steps'] = langs['steps'].fillna(0)
print langs['steps'].describe()
langs['steps'].value_counts().sort_index().plot.area();
count    8397.000000
mean        5.691080
std         3.476092
min         0.000000
25%         3.000000
50%         5.000000
75%         8.000000
max        17.000000
Name: steps, dtype: float64

Inspect the geographical distribution of languages.

In [68]:
langs['latitude'].hist(bins=100);
In [69]:
langs['longitude'].hist(bins=100);
In [70]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.axis([-180, 180, -90, 90])
plt.xticks(range(-180, 181, 60))
plt.yticks(range(-90, 91, 30))

plt.scatter(langs['longitude'], langs['latitude'], 1);