Looking at French word frequency with Lexique

Lexique 3 is a French word database from the Université de Savoie. It includes:

  • Inflected word forms.
  • Lexemes.
  • Frequency data for movie subtitles and books.

It's a great database and a ton of fun. In this notebook, we use a copy of Lexique that has been loaded into the SQLite 3 database. But since we don't want to get into the messy details of the Python, we build the database using a Makefile, and we keep all of our Python utility functions in an external file named lexique.py. If you want to see all those details, or customize this analysis, check out this notebook on GitHub.

First, let's get everything loaded.

In [1]:
%matplotlib inline
%run lexique.py

First, let's take a look at what data we have available. First, we have the raw data from Lexique, which includes inflected forms in the ortho column:

In [2]:
sql("SELECT * FROM lexique WHERE lemme = 'avoir' LIMIT 5")
Out[2]:
ortho phon lemme cgram genre nombre freqlemfilms2 freqlemlivres freqfilms2 freqlivres
0 a a avoir AUX 18559.22 12800.81 6350.91 2926.69
1 a a avoir VER 13572.40 6426.49 5498.34 1669.39
2 ai E avoir AUX 18559.22 12800.81 4902.10 2119.12
3 ai E avoir VER 13572.40 6426.49 2475.78 619.05
4 aie E avoir AUX 18559.22 12800.81 31.75 21.69

We have another table, lemme, which sums over all the orthographies associated with a given lemma. The lemme column, however, is still not unique: If a given word can be either a noun or a verb, it will appear twice. And so on.

In [3]:
sql("SELECT * FROM lemme ORDER BY freqlivres DESC LIMIT 5")
Out[3]:
lemme cgram genre nombre freqfilms2 freqlivres
0 de PRE 25220.86 38928.92
1 la ART:def f s 14946.48 23633.92
2 et CON 12909.08 20879.73
3 à PRE 12190.40 19209.05
4 le ART:def m s 13652.76 18310.95

And finally, we collapse all the cgram, genre and nombre values associated with a given value of lemme, to give us unique lemmas with frequency data:

In [4]:
sql("SELECT * FROM lemme_simple ORDER BY freqlivres DESC LIMIT 5")
Out[4]:
lemme freqfilms2 freqlivres
0 de 25220.96 38928.92
1 la 16028.08 24877.30
2 être 40411.41 21709.87
3 et 12909.08 20879.73
4 le 16953.50 20735.14

Note that we have two sets of frequency data: freqfilms2, which is based on a corpus of film subtitles, and freqlivres, which is based on a corpus of books. There are some important differences. For example, French films use the passé composé far more often than books, which raises the frequencies of the auxiliary verbs être and avoir:

In [5]:
sql("SELECT * FROM lemme_simple ORDER BY freqfilms2 DESC LIMIT 5")
Out[5]:
lemme freqfilms2 freqlivres
0 être 40411.41 21709.87
1 avoir 32134.77 19230.64
2 je 25988.48 10862.77
3 de 25220.96 38928.92
4 ne 22297.51 13852.97

Parts of speech

Using the film dataset, let's take a look at the parts of speech:

In [6]:
cgram_freq = sql("""
SELECT cgram, SUM(freqfilms2) AS freqfilms2, SUM(freqlivres) AS freqlivres
  FROM lemme GROUP BY cgram
""", index_col='cgram')
cgram_freq
Out[6]:
freqfilms2 freqlivres
cgram
1.20 0.00
ADJ 42939.77 56548.13
ADJ:dem 6363.94 6802.23
ADJ:ind 2999.34 3737.10
ADJ:int 1273.62 582.91
ADJ:num 2525.93 4680.43
ADJ:pos 19106.03 20005.62
ADV 97693.38 69747.44
ART:def 54495.63 83470.15
ART:ind 26051.19 33763.58
AUX 26633.45 19302.64
CON 29730.47 38189.17
LIA 0.00 412.57
NOM 144894.66 186537.81
ONO 6291.15 1501.06
PRE 77439.28 118274.42
PRO:dem 15700.13 7549.99
PRO:ind 7538.14 5716.50
PRO:int 1612.64 736.37
PRO:per 133651.20 90995.84
PRO:pos 334.17 322.16
PRO:rel 11547.03 14483.19
VER 198390.74 150350.57
In [7]:
cgram_freq_summary = cgram_freq.groupby(lambda x: x[0:3]).sum()
plt.figure(figsize=(7,7))
plt.subplot(aspect=True)
plt.pie(cgram_freq_summary.freqfilms2, labels=cgram_freq_summary.index.values, colors=colors)
plt.title("Parts of speech")
Out[7]:
<matplotlib.text.Text at 0x41c2050>

Text coverage

How many words do we need to know to understand 98% of the individual words which appear on a given page?

In [8]:
coverage = sql("""
SELECT lemme, freqfilms2 FROM lemme_simple
  ORDER BY freqfilms2 DESC""")
coverage.index += 1
coverage['film_coverage'] = \
  100*coverage['freqfilms2'].cumsum() / coverage['freqfilms2'].sum()
del coverage['lemme']
del coverage['freqfilms2']
coverage[0:5]
Out[8]:
film_coverage
1 4.454456
2 7.996598
3 10.861248
4 13.641296
5 16.099099
In [9]:
book_coverage = sql("""
SELECT lemme, freqlivres FROM lemme_simple
  ORDER BY freqlivres DESC""")
book_coverage.index += 1
coverage['book_coverage'] = \
  100*book_coverage['freqlivres'].cumsum() / book_coverage['freqlivres'].sum()
coverage[0:5]
Out[9]:
film_coverage book_coverage
1 4.454456 4.260534
2 7.996598 6.983203
3 10.861248 9.359217
4 13.641296 11.644377
5 16.099099 13.913712
In [10]:
plt.plot(coverage.index.values, coverage.film_coverage, label="Film Coverage")
plt.plot(coverage.index.values, coverage.book_coverage, label="Book Coverage")
plt.legend(loc = 'lower right')
plt.title('Text Coverage')
plt.xlabel('Vocabulary size')
plt.ylabel('% coverage')
plt.xlim((0,10000))
Out[10]:
(0, 10000)

Or, in table form, here's how many words you need to know to get a given percentage of coverage:

In [11]:
coverage.loc[[250, 500, 1000, 2000, 4000, 8000, 16000], :]
Out[11]:
film_coverage book_coverage
250 76.164757 68.074024
500 82.792428 74.990087
1000 88.386602 81.450089
2000 93.004247 87.535363
4000 96.410076 92.756703
8000 98.554939 96.606475
16000 99.666146 99.017399

Text coverage by part of speech

We want to get a feel for how many nouns, verbs, etc., are required in a well-balanced vocabulary. This requires grouping words by part of speech, sorting them by frequency, and graphing the cumulative text coverge for a given number of words. This takes a fair bit of work to set up.

First, we need to do quite a bit of data munging:

In [12]:
# Merge related cgrams, sum frequency over (cgram, lemme) groups,
# and sort by (cgram,freqfilms2).
cgram_lemme_freq = sql("""
SELECT cgram, SUM(freqfilms2) AS freqfilms2
  FROM (SELECT CASE WHEN cgram='AUX' THEN 'VER'
                    ELSE SUBSTR(cgram, 1, 3)
                    END AS cgram,
               lemme, freqfilms2
          FROM lemme)
  GROUP BY cgram, lemme
  ORDER BY cgram, freqfilms2 DESC
""")

# Convert freqfilms2 to a cumulative percentage over each cgram group.
cgram_col = cgram_lemme_freq['cgram']
normalized_freq = cgram_lemme_freq.groupby(cgram_col).transform(lambda x: x/x.sum())
cumulative_freq = normalized_freq.groupby(cgram_col).cumsum()
cgram_lemme_freq['freqfilms2'] = 100.0*cumulative_freq

# Sequentially number the rows within each cgram group so we can see the
# vocabulary size corresponding to each cumulative percentage.
cgram_lemme_freq['rang'] = cgram_lemme_freq.groupby(cgram_col).cumcount()+1

# Index by cgram group, and vocabulary size within the group. Uncomment the
# last line to view the data.
cgram_lemme_freq.set_index(['cgram', 'rang'], inplace = True)
#cgram_lemme_freq

Now that we have the data, we can plot it using two different graphs: One for the "large" parts of speech (nouns, etc.), and one for the parts of speech which are either closed classes, or at least very small.

In [13]:
def plot_cgrams(labels):
    for key in labels.keys():
        cgram_group = cgram_lemme_freq.loc[key]
        plt.plot(cgram_group.index.values, cgram_group.freqfilms2, label=labels[key])
    plt.legend(loc = 'lower right')
    plt.title('Text Coverage by Part of Speech (films)')
    plt.xlabel('Words known by part of speech')
    plt.ylabel('% coverage')
    plt.ylim((0,100))
    plt.axhline(y=90, color='k', ls='dashed')

plt.figure(figsize=(12,4))
    
plt.subplot(121)
small_cgram_labels = {
    'PRO': 'Pronouns',
    'ADV': 'Adverbs',
    'PRE': 'Prepositions',
    'CON': 'Conjuctions',
    'ART': 'Articles'
}
plot_cgrams(small_cgram_labels)
plt.xlim((0,150))

plt.subplot(122)
large_cgram_labels = {
    'NOM': 'Nouns',
    'VER': 'Verbs',
    'ADJ': 'Adjectives'
}
plot_cgrams(large_cgram_labels)
plt.xlim((0,10000))
Out[13]:
(0, 10000)

It would be nice to have this as a table, too, so we can figure out—for example—how many nouns we need to get 75% coverage. Once again, this will require a fair bit of data munging.

In [14]:
# Only include the parts of speech used in our graph.
cgram_labels = small_cgram_labels.copy()
cgram_labels.update(large_cgram_labels)
interesting = cgram_lemme_freq.loc[cgram_labels.keys()]

# We'll use this to build a list of columns in our final table.
columns = []

# Calculate minimum number words for a given percentage of coverage.
for threshold in [75,90,95,98,99,99.5]:
    # Discard all the rows below our threshold.
    over_threshold = interesting[interesting['freqfilms2'] >= threshold]
    
    # Take the first row that remains.
    over_threshold.reset_index(inplace=True)
    over_threshold.set_index('cgram', inplace=True)
    first_over = over_threshold.groupby(level=0).first()
    
    # Keep only a single column named after our threashold.
    del first_over['freqfilms2']
    first_over.rename(columns={'rang': '%r%%' % threshold}, inplace=True)
    columns.append(first_over)

# Join all the columns together.
table = columns[0].join(columns[1:])

# Clean up the table a bit and add a total
table.index.names = ['Part of speech']
table.index = table.index.map(lambda i: cgram_labels[i])
table.loc['TOTAL'] = table.sum()
table
Out[14]:
75% 90% 95% 98% 99% 99.5%
Adjectives 136 620 1367 2686 3742 4736
Adverbs 17 42 69 118 182 277
Articles 6 8 9 9 10 10
Conjuctions 5 9 11 14 15 17
Nouns 1137 3115 5215 8454 10956 13347
Prepositions 6 9 14 21 26 30
Pronouns 16 24 29 40 50 65
Verbs 63 290 583 1108 1601 2149
TOTAL 1386 4117 7297 12450 16582 20631

Verb groups

We divide verbs into the three standard groups, -er, -ir and -re. We split aller into its own group, because it's the only irregular -er verb. For now, we treat the auxiliary versions of être and avoir in the passé composé as being ordinary verbs.

In [15]:
verbs = sql("SELECT * FROM verbe ORDER BY freqfilms2 DESC")
verbs[0:15]
Out[15]:
lemme groupe prototype conjugaison aux freqfilms2 freqlivres
0 être re être être avoir 40310.72 21587.31
1 avoir ir avoir avoir avoir 32131.64 19227.33
2 aller aller aller aller être 9992.78 2854.92
3 faire re .*faire faire avoir 8813.52 5328.99
4 dire re dire|redire dire avoir 5946.18 4832.51
5 pouvoir ir pouvoir pouvoir avoir 5524.46 2659.75
6 vouloir ir .*vouloir vouloir avoir 5249.31 1640.16
7 savoir ir .*savoir savoir avoir 4516.72 2003.59
8 voir ir .*voir|.*oir voir avoir 4119.47 2401.76
9 devoir ir .*devoir devoir avoir 3232.59 1318.20
10 venir ir .*venir venir être 2763.82 1514.53
11 suivre re .*suivre suivre avoir 2090.55 949.13
12 parler er .*er -er avoir 1970.53 1086.02
13 prendre re .*prendre prendre avoir 1913.84 1466.44
14 croire re .*croire croire avoir 1712.02 947.25

As we can see, all three groups have roughly equal text coverage, but there are actually far more -er verbs than all the others combined. This suggests that a small number of -ir and -re verbs are disproportionately common.

In [16]:
plt.figure(figsize=(8,8))

plt.subplot(121, aspect=True)
group_freq = verbs.groupby(verbs['groupe']).sum()
plt.pie(group_freq.freqfilms2, labels=group_freq.index.values, colors=colors)
plt.title("Verb group frequency")

plt.subplot(122, aspect=True)
group_size = verbs.groupby(verbs['groupe']).count()
plt.pie(group_size.lemme, labels=group_size.index.values, colors=colors)
plt.title("Verb group size (words)")
Out[16]:
<matplotlib.text.Text at 0x4eb8b10>
In [17]:
# Extract the columns we need, and get rid of 'aller'.
group_freq = verbs[['groupe', 'lemme', 'freqfilms2']].copy()
group_freq = group_freq[group_freq['groupe'] != 'aller']

# Calculate coverage percentages for frequency ranks in each group.
groupe_col = group_freq['groupe']
normalized_freq = group_freq.groupby(groupe_col).transform(lambda x: x/x.sum())
cumulative_freq = 100.0*normalized_freq.groupby(groupe_col).cumsum()
group_freq['freqfilms2'] = cumulative_freq
group_freq['rang'] = group_freq.groupby(groupe_col).cumcount()+1
group_freq.set_index(['groupe', 'rang'], inplace=True)
group_freq[0:10]
Out[17]:
lemme freqfilms2
groupe rang
re 1 être 53.707239
ir 1 avoir 44.826275
re 2 faire 65.449768
3 dire 73.372051
ir 2 pouvoir 52.533350
3 vouloir 59.856569
4 savoir 66.157764
5 voir 71.904763
6 devoir 76.414491
7 venir 80.270247
In [18]:
# Sigh. My database is in French, and my libraries are in English.
# There's no way to avoid coding in franglais, I fear.
for group in ['er', 'ir', 're']:
    g = group_freq.loc[group]
    plt.plot(g.index.values, g.freqfilms2, label=group)
plt.title('Verb Coverage by Group')
plt.legend(loc = 'lower right')
plt.xlabel('Verbs known in group')
plt.ylabel('% coverage')
plt.xlim((1,100))
plt.ylim((0,100))
Out[18]:
(0, 100)

If we take the first 40 -ir and -re verbs, we get better than 96% coverage. Even the first 20 in each group will give us better than 92% coverage. Here's a list for people who want to master all the high-frequency irregular verbs.

In [19]:
def html_for_group(groupe):
    lst = ', '.join(group_freq.loc[groupe].loc[1:40]['lemme'].tolist())
    return '<p><i>-%s</i> verbs: %s.</p>' % (groupe, lst)
HTML("<p><i>-er</i> verbs: aller.</p>" + html_for_group('ir') + html_for_group('re'))
Out[19]:

-er verbs: aller.

-ir verbs: avoir, pouvoir, vouloir, savoir, voir, devoir, venir, falloir, partir, mourir, sortir, revenir, finir, sentir, tenir, devenir, ouvrir, dormir, asseoir, souvenir, servir, valoir, agir, recevoir, mentir, offrir, choisir, revoir, courir, réussir, prévenir, découvrir, maintenir, réfléchir, souffrir, couvrir, obtenir, appartenir, ressentir, prévoir.

-re verbs: être, faire, dire, suivre, prendre, croire, attendre, mettre, connaître, comprendre, entendre, plaire, perdre, vivre, rendre, foutre, apprendre, boire, écrire, lire, répondre, descendre, suffire, vendre, battre, promettre, permettre, conduire, disparaître, taire, remettre, reconnaître, rire, reprendre, détruire, paraître, craindre, naître, rejoindre, défendre.

Verb groups, in horrifying detail

Of course, not all -er verbs are completely regular, and there are patterns among the other verb groups. Fortunately, there's a nice XML file of French verb conjugation rules that we can use to examine these hidden details. Combining that with quite a bit of custom code, we can assign a "conjugator" to each verb prototype, and verify that the generated forms match the XML data. This gives us a much shorter list of key forms.

In [20]:
verbs2 = sql("""
SELECT conjugaison.nom AS conjugaison, lemme, freqfilms2, resume
  FROM verbe
  LEFT OUTER JOIN conjugaison
    ON verbe.conjugaison = conjugaison.nom
  ORDER BY freqfilms2 DESC
""")
verbs2['freqfilms2'] = 100 * verbs2['freqfilms2'] / verbs2['freqfilms2'].sum()
def summarize_conjugator(grp):
    return pd.Series(dict(exemples=', '.join(grp.lemme[0:5]),
                          compte=grp.lemme.count(),
                          freqfilms2=grp.freqfilms2.sum(),
                          resume=grp.resume.iloc[0]))
conjugators = verbs2.groupby('conjugaison').apply(summarize_conjugator).sort('freqfilms2', ascending=False)
conjugators.reset_index(inplace=True)
conjugators.index.names = ['rang']
conjugators.reset_index(inplace=True)
conjugators['rang'] = conjugators['rang'] + 1
conjugators['freqfilms2'] = conjugators['freqfilms2'].cumsum()
conjugators.set_index('rang', inplace=True)
save_tsv('conjugators.tsv', conjugators)
conjugators
Out[20]:
conjugaison compte exemples freqfilms2 resume
rang
1 -er 5187 parler, aimer, passer, penser, trouver 27.057864 (p.p.) parlé, je parle, tu parles, il parle, n...
2 être 1 être 44.971814 (irregular)
3 avoir 1 avoir 59.251008 (irregular)
4 aller 1 aller 63.691766 (irregular)
5 faire 5 faire, refaire, satisfaire, défaire, contrefaire 67.651207 (irregular)
6 dire 2 dire, redire 70.297491 Like interdire, except: vous dites
7 pouvoir 1 pouvoir 72.752543 (irregular)
8 venir 26 venir, revenir, tenir, devenir, souvenir 75.130407 Like -ir, except: (p.p.) venu, tu viens, nous ...
9 vouloir 2 vouloir, revouloir 77.463383 (irregular)
10 savoir 3 savoir, non-savoir, assavoir 79.470603 (irregular)
11 -re 50 attendre, entendre, perdre, rendre, répondre 81.428246 (p.p.) attendu, tu attends, nous attendons, il...
12 voir 5 voir, revoir, entrevoir, ravoir, comparoir 83.332903 Like -ir, except: (p.p.) vu, nous voyons, ils ...
13 partir 18 partir, sortir, sentir, dormir, servir 84.950418 Like -ir, except: tu pars
14 prendre 12 prendre, comprendre, apprendre, reprendre, sur... 86.441360 Like -re, except: (p.p.) pris, nous prenons, i...
15 devoir 2 devoir, redevoir 87.877921 (irregular)
16 -ir (-iss-) 304 finir, agir, choisir, réussir, réfléchir 89.052373 (p.p.) fini, tu finis, nous finissons, ils fin...
17 suivre 2 suivre, poursuivre 90.010207 Like -re, except: (p.p.) suivi, tu suis
18 espérer 199 espérer, inquiéter, préférer, protéger, répéter 90.856970 Like -er, except: tu espères, ils espèrent
19 acheter 64 acheter, emmener, amener, ramener, enlever 91.628962 Like -er, except: tu achètes, ils achètent, il...
20 croire 1 croire 92.389778 Like -re, except: (p.p.) cru, nous croyons, (p...
21 appeler 112 appeler, rappeler, jeter, rejeter, projeter 93.148066 Like -er, except: tu appelles, ils appellent, ...
22 mettre 15 mettre, promettre, permettre, remettre, admettre 93.891337 Like battre, except: (p.p.) mis, (p.s.) il mit
23 falloir 1 falloir 94.626262 (irregular)
24 connaître 10 connaître, disparaître, reconnaître, paraître,... 95.267398 Like -re, except: (p.p.) connu, je connais, tu...
25 essayer 30 essayer, payer, effrayer, balayer, rayer 95.784982 Like ennuyer, except: tu essaies/tu essayes, i...
26 ouvrir 9 ouvrir, offrir, découvrir, souffrir, couvrir 96.201333 Like -ir, except: (p.p.) ouvert, j'ouvre, tu o...
27 mourir 1 mourir 96.608587 Like -ir, except: (p.p.) mort, tu meurs, ils m...
28 plaire 3 plaire, déplaire, complaire 96.884291 Like taire, except: il plaît/il plait
29 vivre 3 vivre, survivre, revivre 97.143889 Like suivre, except: (p.p.) vécu, (p.s.) il vécut
30 conduire 24 conduire, détruire, construire, produire, réduire 97.396542 Like interdire, except: (p.s.) il conduisit
... ... ... ... ... ...
34 écrire 12 écrire, décrire, inscrire, prescrire, réécrire 98.157882 Like -re, except: (p.p.) écrit, nous écrivons,...
35 boire 2 boire, reboire 98.308728 Like -re, except: (p.p.) bu, nous buvons, ils ...
36 asseoir 2 asseoir, rasseoir 98.453020 Like -ir, except: (p.p.) assis, tu assieds/tu ...
37 lire 4 lire, élire, relire, réélire 98.586161 Like interdire, except: (p.p.) lu, (p.s.) il lut
38 battre 9 battre, abattre, combattre, débattre, rabattre 98.718538 Like -re, except: tu bats
39 recevoir 9 recevoir, apercevoir, décevoir, concevoir, per... 98.845991 Like -ir, except: (p.p.) reçu, tu reçois, nous...
40 ennuyer 52 ennuyer, nettoyer, appuyer, noyer, employer 98.964911 Like -er, except: tu ennuies, ils ennuient, il...
41 valoir 2 valoir, équivaloir 99.070811 (irregular)
42 suffire 1 suffire 99.173942 Like interdire, except: (p.p.) suffi
43 rire 2 rire, sourire 99.260244 Like -re, except: (p.p.) ri, nous rions, ils r...
44 courir 9 courir, parcourir, secourir, accourir, recourir 99.339511 Like -ir, except: (p.p.) couru, tu cours, il c...
45 taire 1 taire 99.408153 Like -re, except: (p.p.) tu, nous taisons, ils...
46 fuir 2 fuir, enfuir 99.471239 Like -ir, except: tu fuis, nous fuyons, ils fu...
47 naître 1 naître 99.522834 Like connaître, except: (p.p.) né, (p.s.) il n...
48 ficher 1 ficher 99.566318 Like -er, except: (p.p.) fiché/fichu
49 convaincre 3 convaincre, vaincre, reconvaincre 99.603891 Like -re, except: je convaincs, tu convaincs, ...
50 interdire 6 interdire, prédire, contredire, médire, adire 99.639585 Like -re, except: (p.p.) interdit, nous interd...
51 prévoir 1 prévoir 99.674342 Like voir, except: il prévoira
52 pleuvoir 1 pleuvoir 99.704196 (defective)
53 parfaire 1 parfaire 99.730887 (defective)
54 haïr 1 haïr 99.755520 Like -ir (-iss-), except: tu hais, (p.s.) il h...
55 accueillir 3 accueillir, cueillir, recueillir 99.779695 Like -ir, except: j'accueille, tu accueilles, ...
56 bénir 1 bénir 99.801364 Like -ir (-iss-), except: (p.p.) béni/bénit
57 faillir 1 faillir 99.821006 (irregular)
58 résoudre 1 résoudre 99.839200 Like -re, except: (p.p.) résolu, tu résous, no...
59 conclure 2 conclure, exclure 99.856353 Like -re, except: (p.p.) conclu, (p.s.) il con...
60 conquérir 6 conquérir, acquérir, requérir, reconquérir, en... 99.870845 Like -ir, except: (p.p.) conquis, tu conquiers...
61 distraire 10 distraire, extraire, traire, soustraire, rentr... 99.885199 (defective)
62 pourvoir 1 pourvoir 99.894749 Like prévoir, except: (p.s.) il pourvut
63 douer 1 douer 99.903770 (defective)

63 rows × 5 columns

In [21]:
plt.plot(conjugators.index.values, conjugators.freqfilms2)
plt.title('Verb Coverage by Conjugator')
#plt.legend(loc = 'lower right')
plt.xlabel('Verb conjugations known')
plt.ylabel('% coverage')
plt.ylim((0,100))
plt.xlim((1,60))
Out[21]:
(1, 60)