'I' Before 'E' Except After 'C'

A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb "I before E except after C" was "total bullshit." This got me thinking: the "I before E except after C" rule was almost certainly developed without any research at all, just based on the subjective experience of educators. It's not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I'll be refering to these as "trigram rules"). We have the technology, and I have nothing better to do with my night off :)

Vocabulary trigram rules

To start with, we're going to need a corpus of words (i.e. a vocabulary) to analyze. I'll be using the wordlist in the NLTK package.

In [1]:
from nltk.corpus import words, brown
import string
from collections import Counter, defaultdict
from itertools import combinations
from __future__ import division

corpus = set(w.lower() for w in words.words())
print len(corpus)

This analysis will proceed by identifying all of the trigrams in my corpus, and then decomposing each trigram into its starting letter and the trailing bigram. For each bigram in the corpus, we want to count the occurence of the bigram, and the frequency with which that bigram is preceded by each respective "starting letter" in all occurences of trigrams. My favorite datatype for this sort of operation is a defaultdict initialized by Counters. What this means is that if I key into my dict for an key that isn't already there, the dict will initialize that key pointing to an empty Counter as the value. This way, I can key into my object for each occurence of a bigram and update the counter corresponding to the associated trigram's starting letter.

You'll notice that I refer to this data structure as a "trie" although I'm not positive this is technically accurate. At the very least, it's similar to a trie.

In [2]:
trigram_trie = defaultdict(Counter)

Next, I'll define a simple function to tokenize an input word into ngrams. I could specify a function to just trigram tokenize, but it's easy enough to generalize this to any ngram. I haven't gotten into it yet, but maybe at a later date I'll want to repeat this experiment with 4-grams or something instead of just focusing on trigrams. In any event, having a simple ngram tokenizing function handy certainly can't hurt.

In [3]:
def ngram_tokenize(word, n=3):
    m = len(word)
    grams = []
    for i in range(m-n+1):
        j = i + n
    return grams

Now that we've got all of the necessary machinery in place, let's process our corpus and tally our trigram rules (unigram->bigram relationshps)

In [4]:
# Process trigrams as unigram->bigram
for word in corpus:
    for trigram in ngram_tokenize(word, n=3):
        start, bigram = trigram[0], trigram[1:]
# Count occurence of bigrams
stats = Counter()
for k,v in trigram_trie.iteritems():
    stats[k] = sum(v.values())

Just to make my life a little easier, I define a function that prints some simple information for a given rule.

In [16]:
# Investigate what makes something a rule
def print_rule_stats(rule, trigram_trie=trigram_trie, stats=stats):
    Print summary statistics for a trigram rule
    ab,c = rule
    a,b = ab
    ba = b+a

    print ab, stats[ab], '\t', c+ab, trigram_trie[ab][c]
    print ba, stats[ba], '\t', c+ba, trigram_trie[ba][c]
    #print c+ab, trigram_trie[ab][c]
    #print c+ba, trigram_trie[ba][c]     

Our analysis corroborates the research that the "'i' before 'e' except after 'c'" rule is not effective from the naive perspective of frequency in the vocabulary (it might be valid if we consider the occurence of the trigram in written language and not just the vocabulary -- since the usage of different words is absolutely not uniform -- but we haven't investigated this). We can contrast this with a better trigram rule to get a sense of what we're looking for.

In [17]:
print "Weak rule"
print "\nStrong rule"
Weak rule
ie 3950 	cie 256
ei 2607 	cei 156

Strong rule
ie 3950 	aie 16
ei 2607 	aei 44

Given a trigram rule of the form "'a' then 'b' except after 'c'," the strength of the rule is relative to the extent with which it exhibits the following properties:

  • the bigram "ab" occurs more frequently in the vocabulary than the bigram "ba"
  • the trigram "cba" occurs more frequently than the trigram "cab"

To rank rules, we'll need some mechanism of scoring them. A simple method I've found reasonably effective for the purposes of this experiment is to filter out rules where the ratio of $count(ab)/count(ba)$ is below some threshhold, and then use the inverse ratio of trigram occurences as the score: $count(cba)/count(cab)$. In our best rules, the trigram cab may not occur in our vocabulary at all, so to mitigate division-by-zero issues, we can Laplace smooth the trigram occurence ratio by adding 1 to both the numerator and the denominator. Our scoring function then becomes:

$$ score([ab,c], k) = \left\{ \begin{array}{lr} \frac{count(cba)}{count(cab)} :& \frac{count(ab)}{count(ba)} \ge k \\ 0 : otherwise \end{array} \right. $$

In [20]:
def score_rule(rule, k=1/2):
    ab,c = rule
    a,b = ab
    if stats[ab]/(stats[ab] + stats[ba]) > k:
        return (trigram_trie[ba][c]+1)/(trigram_trie[ab][c]+1)
        return 0

With the notion of a trigram rule defined and a scoring function in place, we can pipeline the rule identification, scoring and ranking process into a function to allow us to easily identify and rank all trigram rules in a given vocabulary/corpus.

In [9]:
def extract_trigram_rules(corpus, scoring_function=score_rule):
    Extracts "I before E except after C" style 'trigram rules' from an input corpus,
    ranked by the specified scoring function.
    # Process trigrams as unigram->bigram
    trigram_trie = defaultdict(Counter)    
    for word in corpus:
        for trigram in ngram_tokenize(word, n=3):
            start, bigram = trigram[0], trigram[1:]
    # Tally bigrams
    stats = Counter()
    for k,v in trigram_trie.iteritems():
        stats[k] = sum(v.values())
    special_cases = []
    for a, b in combinations(string.lowercase,2):
        #print a, b
        comb1 = a+b
        comb2 = b+a
        if stats[comb2] > stats[comb1]:
            comb1, comb2 = comb2, comb1
        for c in trigram_trie[comb2].keys():
            if trigram_trie[comb2][c] > trigram_trie[comb1][c]:
                rule = [comb1, str(c)]
                special_cases.append([scoring_function(rule), rule])
    return special_cases, trigram_trie, stats

Here are the top 20 trigram rules. The notation ['ab','c'] should be interpreted as "a before b except after c."

Although the highest ranked rule is "P before E, except after C," my favorite is the second one: "E before U, except after Q."

In [21]:
vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=1/2))

for score, rule in vocab_rules[:5]:
    print rule
    print_rule_stats(rule, trigram_trie, stats)
['pe', 'c']
pe 8052 	cpe 0
ep 5053 	cep 955

['eu', 'q']
eu 2620 	qeu 0
ue 1981 	que 949

['ic', 'i']
ic 26140 	iic 1
ci 6561 	ici 1830

['te', 'm']
te 27265 	mte 2
et 11743 	met 2684

['rd', 'n']
rd 3641 	nrd 0
dr 2738 	ndr 808

If we modify the scoring heuristic to filter on a more aggressive cutoff threshold for the bigram ratio, different rules get highlighted.

In [24]:
vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=3/4))

for score, rule in vocab_rules[:5]:
    print rule
    print_rule_stats(rule, trigram_trie, stats)
['ic', 'i']
ic 26140 	iic 1
ci 6561 	ici 1830

['ns', 's']
ns 6442 	sns 0
sn 1121 	ssn 298

['og', 'a']
og 7578 	aog 2
go 2459 	ago 619

['ex', 'e']
ex 1688 	eex 0
xe 398 	exe 201

['ng', 'n']
ng 13024 	nng 1
gn 1767 	ngn 373

Corpus trigram rules

The above analysis determined trigram rules that are useful across the english vocabulary. But really, the aim of these rules is to help us with our day-to-day spelling, so it might be more useful to conduct the analysis scoring trigrams based not on their appearance in unique words, but in their actual occurence in written English. For this purpose, a useful corpus of reasonably size is the Brown corpus.

In [12]:
brown_corpus = [w.lower() for w in brown.words()]
In [13]:
print "Total words:", len(brown_corpus) # plus a few grammatical tokens
print "Unique words:", len(set(brown_corpus))
print "\nMost common tokens:"
Total words: 1161192
Unique words: 49815

Most common tokens:
[(u'the', 69971),
 (u',', 58334),
 (u'.', 49346),
 (u'of', 36412),
 (u'and', 28853),
 (u'to', 26158),
 (u'a', 23195),
 (u'in', 21337),
 (u'that', 10594),
 (u'is', 10109)]
In [25]:
brown_rules, brown_trigrams, brown_stats = extract_trigram_rules(brown_corpus)

for score, rule in brown_rules[:5]:
    print rule
    print_rule_stats(rule, brown_trigrams, brown_stats)
['pe', 'c']
pe 12235 	cpe 0
ep 6038 	cep 891

['ic', 'i']
ic 24054 	iic 0
ci 7575 	ici 1592

['te', 'm']
te 39262 	mte 2
et 15913 	met 1785

['rd', 'n']
rd 7135 	nrd 0
dr 1305 	ndr 377

['iv', 'i']
iv 9448 	iiv 0
vi 6956 	ivi 1874

We've established that the "I before E" rule doesn't hold up when we just look at the vocabulary, but in the context of the distribution of words in written language is the rule salvageable?

In [15]:
print_rule_stats(['ie','c'], brown_trigrams, brown_stats)
ie 13275
ei 5677
cie 1310
cei 485

We've been lied to for so long... I don't know what's real anymore.