Reading programming pearls. How do you compute fast anagrams?
def signature(word):
"Computes the signature of a word."
return "".join(sorted(word))
signature("deposit")
'deiopst'
signature("dopiest")
'deiopst'
signature("posited")
'deiopst'
signature("topside")
'deiopst'
Let's tackle the problem that is stated in the programming pearls column: let's get a big file full of words and find all anagram groups inside it.
We'll download the big file of words from Peter Norvig's website.
import requests
r = requests.get('http://norvig.com/ngrams/count_1w.txt')
r
<Response [200]>
len(r.text)
4956241
r.text[:100]
'the\t23135851162\nof\t13151942776\nand\t12997637966\nto\t12136980858\na\t9081174698\nin\t8469404971\nfor\t5933321'
words = [line.split('\t')[0] for line in r.text.split('\n')]
len(words)
333334
words[:10]
['the', 'of', 'and', 'to', 'a', 'in', 'for', 'is', 'on', 'that']
signatures = [signature(word) for word in words]
Now, let's build a dictionary using the signatures:
anagrams = {}
for word in words:
s = signature(word)
if s in anagrams:
anagrams[s].append(word)
else:
anagrams[s] = [word]
sorted(anagrams.keys(), key=lambda k: len(anagrams[k]), reverse=True)[:120]
['aerst', 'acrs', 'abcs', 'acis', 'acms', 'acps', 'acst', 'aips', 'cips', 'aces', 'ahins', 'aeprs', 'apst', 'adir', 'adnor', 'ainst', 'acir', 'aeps', 'aims', 'aist', 'acip', 'acds', 'cops', 'airst', 'ails', 'ains', 'acmr', 'arst', 'aorst', 'gglloo', 'eorst', 'ceis', 'einst', 'aest', 'aeimnr', 'amps', 'acmp', 'eist', 'cers', 'adis', 'amst', 'aens', 'aelst', 'acls', 'eimst', 'cims', 'acpr', 'aems', 'cmps', 'ceps', 'imst', 'acert', 'acfs', 'cest', 'acpt', 'aemns', 'abis', 'eims', 'acdp', 'acim', 'adenr', 'acep', 'acehrs', 'eops', 'acns', 'cios', 'aers', 'eips', 'aels', 'airs', 'aenrt', 'aenrs', 'adem', 'acrt', 'cdis', 'aelps', 'aegnr', 'alps', 'prst', 'cems', 'eirst', 'aeiln', 'amsu', 'eggloo', 'acenr', 'acin', 'imps', 'afis', 'aeirs', 'acdi', 'adin', 'ades', 'airt', 'dein', 'cprs', 'acil', 'enort', 'aert', 'aemnor', 'cpst', 'aelm', 'aipr', 'acet', 'cdps', 'acel', 'afir', 'eirt', 'inos', 'aprt', 'eirs', 'astu', 'egilo', 'achs', 'eerst', 'cdip', 'agnor', 'acfi', 'eegnr', 'aarst', 'deir']
anagrams['adnor']
['radon', 'doran', 'adorn', 'ronda', 'norad', 'andro', 'daron', 'andor', 'rodan', 'rando', 'nardo', 'dorna', 'drano', 'narod', 'nador', 'donar', 'ondra', 'adron', 'ardon', 'drona', 'arond']
Finally, let's do a little scatter plot of lenght of word vs number of anagrams. Let's use pandas for that.
import pandas as pd
df =
df.shape
(243439,)
df = pd.DataFrame(pd.Series(anagrams),
columns=['words'])
df['signature_length'] = [len(item) for item in df.index]
df['word_count'] = [len(item) for item in df.words]
df
words | signature_length | word_count | |
---|---|---|---|
[] | 0 | 1 | |
a | [a] | 1 | 1 |
aa | [aa] | 2 | 1 |
aaa | [aaa] | 3 | 1 |
aaaa | [aaaa] | 4 | 1 |
aaaaaaaaaahhhhhhhhhh | [hahahahahahahahahaha] | 20 | 1 |
aaaaaaaabbllmm | [alabamaalabama] | 14 | 1 |
aaaaaaaaddeeeeggggillmnnoprtuuuu | [grenadaguadeloupeguatemalaguiana] | 32 | 1 |
aaaaaaaahhhhhhhh | [hahahahahahahaha] | 16 | 1 |
aaaaaaahhhhhhh | [hahahahahahaha] | 14 | 1 |
aaaaaaahnnnnnrty | [ananthanarayanan] | 16 | 1 |
aaaaaaaknnnnrrsy | [sankaranarayanan] | 16 | 1 |
aaaaaabbbehiimmrswzz | [saharazambiazimbabwe] | 20 | 1 |
aaaaaabddeemsttt | [databasemetadata] | 16 | 1 |
aaaaaaccddnn | [canadacanada] | 12 | 1 |
aaaaaacceeefgilllmnnorrssstttyy | [acetylgalactosaminyltransferase] | 31 | 1 |
aaaaaacefghiillmrrssty | [facialsmargaritaashley] | 22 | 1 |
aaaaaagmnrsy | [arasanayagam] | 12 | 1 |
aaaaaahhhhhh | [hahahahahaha] | 12 | 1 |
aaaaaakkllss | [alaskaalaska] | 12 | 1 |
aaaaaannnrstyy | [satyanarayanan] | 14 | 1 |
aaaaaannrstyy | [satyanarayana] | 13 | 1 |
aaaaabbcdrr | [abracadabra] | 11 | 1 |
aaaaabbceiklnrs | [caribbeanalaska] | 15 | 1 |
aaaaabbilmmnrsu | [balasubramaniam] | 15 | 1 |
aaaaabbilmnnrsu | [balasubramanian] | 15 | 1 |
aaaaabblmmnrsuy | [balasubramanyam] | 15 | 1 |
aaaaabbmt | [bambaataa] | 9 | 1 |
aaaaabccceeegghhiiilnoorrrtuvzz | [herzegovinabulgariacroatiaczech] | 31 | 1 |
aaaaabcdddeeeglmnnnrrstu | [bermudacanadagreenlandst] | 24 | 1 |
... | ... | ... | ... |
wxy | [wyx, wxy] | 3 | 2 |
wxyz | [wxyz] | 4 | 1 |
wxz | [zxw] | 3 | 1 |
wy | [wy, yw] | 2 | 2 |
wz | [wz, zw] | 2 | 2 |
wzz | [zzw] | 3 | 1 |
x | [x] | 1 | 1 |
xx | [xx] | 2 | 1 |
xxx | [xxx] | 3 | 1 |
xxxx | [xxxx] | 4 | 1 |
xxy | [xxy, xyx] | 3 | 2 |
xxz | [zxx, xxz] | 3 | 2 |
xy | [xy, yx] | 2 | 2 |
xyy | [xyy, yxy] | 3 | 2 |
xyyzz | [xyzzy] | 5 | 1 |
xyz | [xyz, zyx, zxy] | 3 | 3 |
xz | [zx, xz] | 2 | 2 |
xzz | [zzx, zxz] | 3 | 2 |
y | [y] | 1 | 1 |
yy | [yy] | 2 | 1 |
yyy | [yyy] | 3 | 1 |
yyyy | [yyyy] | 4 | 1 |
yyz | [yyz] | 3 | 1 |
yyzz | [zzyy] | 4 | 1 |
yz | [yz, zy] | 2 | 2 |
yzz | [zzy, zyz, yzz] | 3 | 3 |
z | [z] | 1 | 1 |
zz | [zz] | 2 | 1 |
zzz | [zzz] | 3 | 1 |
zzzz | [zzzz] | 4 | 1 |
243439 rows × 3 columns
%matplotlib inline
df.signature_length.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11e060fd0>
df.word_count.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11b603390>
df.plot.scatter(x='signature_length', y='word_count')
<matplotlib.axes._subplots.AxesSubplot at 0x11b4944e0>
Let's conclude: