Reading programming pearls. How do you compute fast anagrams?

In [3]:

def signature(word):
    "Computes the signature of a word."
    return "".join(sorted(word))

In [4]:

signature("deposit")

Out[4]:

'deiopst'

In [5]:

signature("dopiest")

Out[5]:

'deiopst'

In [6]:

signature("posited")

Out[6]:

'deiopst'

In [7]:

signature("topside")

Out[7]:

'deiopst'

Let's tackle the problem that is stated in the programming pearls column: let's get a big file full of words and find all anagram groups inside it.

We'll download the big file of words from Peter Norvig's website.

In [9]:

import requests

In [10]:

r = requests.get('http://norvig.com/ngrams/count_1w.txt')

In [11]:

Out[11]:

<Response [200]>

In [12]:

len(r.text)

Out[12]:

In [13]:

r.text[:100]

Out[13]:

'the\t23135851162\nof\t13151942776\nand\t12997637966\nto\t12136980858\na\t9081174698\nin\t8469404971\nfor\t5933321'

In [14]:

words = [line.split('\t')[0] for line in r.text.split('\n')]

In [15]:

len(words)

Out[15]:

In [16]:

words[:10]

Out[16]:

['the', 'of', 'and', 'to', 'a', 'in', 'for', 'is', 'on', 'that']

In [17]:

signatures = [signature(word) for word in words]

Now, let's build a dictionary using the signatures:

In [24]:

anagrams = {}
for word in words:
    s = signature(word)
    if s in anagrams:
        anagrams[s].append(word)
    else:
        anagrams[s] = [word]

In [35]:

sorted(anagrams.keys(), key=lambda k: len(anagrams[k]), reverse=True)[:120]

Out[35]:

['aerst',
 'acrs',
 'abcs',
 'acis',
 'acms',
 'acps',
 'acst',
 'aips',
 'cips',
 'aces',
 'ahins',
 'aeprs',
 'apst',
 'adir',
 'adnor',
 'ainst',
 'acir',
 'aeps',
 'aims',
 'aist',
 'acip',
 'acds',
 'cops',
 'airst',
 'ails',
 'ains',
 'acmr',
 'arst',
 'aorst',
 'gglloo',
 'eorst',
 'ceis',
 'einst',
 'aest',
 'aeimnr',
 'amps',
 'acmp',
 'eist',
 'cers',
 'adis',
 'amst',
 'aens',
 'aelst',
 'acls',
 'eimst',
 'cims',
 'acpr',
 'aems',
 'cmps',
 'ceps',
 'imst',
 'acert',
 'acfs',
 'cest',
 'acpt',
 'aemns',
 'abis',
 'eims',
 'acdp',
 'acim',
 'adenr',
 'acep',
 'acehrs',
 'eops',
 'acns',
 'cios',
 'aers',
 'eips',
 'aels',
 'airs',
 'aenrt',
 'aenrs',
 'adem',
 'acrt',
 'cdis',
 'aelps',
 'aegnr',
 'alps',
 'prst',
 'cems',
 'eirst',
 'aeiln',
 'amsu',
 'eggloo',
 'acenr',
 'acin',
 'imps',
 'afis',
 'aeirs',
 'acdi',
 'adin',
 'ades',
 'airt',
 'dein',
 'cprs',
 'acil',
 'enort',
 'aert',
 'aemnor',
 'cpst',
 'aelm',
 'aipr',
 'acet',
 'cdps',
 'acel',
 'afir',
 'eirt',
 'inos',
 'aprt',
 'eirs',
 'astu',
 'egilo',
 'achs',
 'eerst',
 'cdip',
 'agnor',
 'acfi',
 'eegnr',
 'aarst',
 'deir']

In [36]:

anagrams['adnor']

Out[36]:

['radon',
 'doran',
 'adorn',
 'ronda',
 'norad',
 'andro',
 'daron',
 'andor',
 'rodan',
 'rando',
 'nardo',
 'dorna',
 'drano',
 'narod',
 'nador',
 'donar',
 'ondra',
 'adron',
 'ardon',
 'drona',
 'arond']

Finally, let's do a little scatter plot of lenght of word vs number of anagrams. Let's use pandas for that.

In [37]:

import pandas as pd

In [39]:

df =

In [41]:

df.shape

Out[41]:

(243439,)

In [45]:

df = pd.DataFrame(pd.Series(anagrams),
            columns=['words'])

In [46]:

df['signature_length'] = [len(item) for item in df.index]

In [47]:

df['word_count'] = [len(item) for item in df.words]

In [49]:

df

Out[49]:

	words	signature_length	word_count
	[]	0	1
a	[a]	1	1
aa	[aa]	2	1
aaa	[aaa]	3	1
aaaa	[aaaa]	4	1
aaaaaaaaaahhhhhhhhhh	[hahahahahahahahahaha]	20	1
aaaaaaaabbllmm	[alabamaalabama]	14	1
aaaaaaaaddeeeeggggillmnnoprtuuuu	[grenadaguadeloupeguatemalaguiana]	32	1
aaaaaaaahhhhhhhh	[hahahahahahahaha]	16	1
aaaaaaahhhhhhh	[hahahahahahaha]	14	1
aaaaaaahnnnnnrty	[ananthanarayanan]	16	1
aaaaaaaknnnnrrsy	[sankaranarayanan]	16	1
aaaaaabbbehiimmrswzz	[saharazambiazimbabwe]	20	1
aaaaaabddeemsttt	[databasemetadata]	16	1
aaaaaaccddnn	[canadacanada]	12	1
aaaaaacceeefgilllmnnorrssstttyy	[acetylgalactosaminyltransferase]	31	1
aaaaaacefghiillmrrssty	[facialsmargaritaashley]	22	1
aaaaaagmnrsy	[arasanayagam]	12	1
aaaaaahhhhhh	[hahahahahaha]	12	1
aaaaaakkllss	[alaskaalaska]	12	1
aaaaaannnrstyy	[satyanarayanan]	14	1
aaaaaannrstyy	[satyanarayana]	13	1
aaaaabbcdrr	[abracadabra]	11	1
aaaaabbceiklnrs	[caribbeanalaska]	15	1
aaaaabbilmmnrsu	[balasubramaniam]	15	1
aaaaabbilmnnrsu	[balasubramanian]	15	1
aaaaabblmmnrsuy	[balasubramanyam]	15	1
aaaaabbmt	[bambaataa]	9	1
aaaaabccceeegghhiiilnoorrrtuvzz	[herzegovinabulgariacroatiaczech]	31	1
aaaaabcdddeeeglmnnnrrstu	[bermudacanadagreenlandst]	24	1
...	...	...	...
wxy	[wyx, wxy]	3	2
wxyz	[wxyz]	4	1
wxz	[zxw]	3	1
wy	[wy, yw]	2	2
wz	[wz, zw]	2	2
wzz	[zzw]	3	1
x	[x]	1	1
xx	[xx]	2	1
xxx	[xxx]	3	1
xxxx	[xxxx]	4	1
xxy	[xxy, xyx]	3	2
xxz	[zxx, xxz]	3	2
xy	[xy, yx]	2	2
xyy	[xyy, yxy]	3	2
xyyzz	[xyzzy]	5	1
xyz	[xyz, zyx, zxy]	3	3
xz	[zx, xz]	2	2
xzz	[zzx, zxz]	3	2
y	[y]	1	1
yy	[yy]	2	1
yyy	[yyy]	3	1
yyyy	[yyyy]	4	1
yyz	[yyz]	3	1
yyzz	[zzyy]	4	1
yz	[yz, zy]	2	2
yzz	[zzy, zyz, yzz]	3	3
z	[z]	1	1
zz	[zz]	2	1
zzz	[zzz]	3	1
zzzz	[zzzz]	4	1

243439 rows × 3 columns

In [50]:

%matplotlib inline

In [68]:

df.signature_length.plot.hist()

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x11e060fd0>

In [69]:

df.word_count.plot.hist()

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x11b603390>

In [53]:

df.plot.scatter(x='signature_length', y='word_count')

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x11b4944e0>

Let's conclude:

long words have few anagrams
too short words have few anagrams
words between 4 and 12 letters have a couple of anagrams