Phonetic similarity lookup

(Incomplete!)

In a previous notebook, we discussed how to quickly find words with meanings similar to other words. In this notebook, I demonstrate how to find words that sound like other words.

I'm going to make use of some of my recent research in phonetic similarity. The algorithm I made uses phoneme transcriptions from the CMU pronouncing dictionary along with information about articulatory/acoustic features of those phonemes to produce vector representations of the sound of every word in the dictionary.

In this notebook, I show how to make a fast approximate nearest neighbor lookup of words by their phonetic similarity. Then I show a few potential applications in creative language generation using that lookup, plus a bit of vector arithmetic.

Prerequisites

You'll need the numpy, spacy and simpleneighbors packages to run this code.

In [260]:
import random
from collections import defaultdict
import numpy as np
import spacy
from simpleneighbors import SimpleNeighbors

You can download the phonetic similarity vectors using the following command:

In [275]:
!curl -L -s https://github.com/aparrish/phonetic-similarity-vectors/blob/master/cmudict-0.7b-simvecs?raw=true >cmudict-0.7b-simvecs

Loading the data

The vector file looks like this:

WARNER  -1.178800 1.883123 -1.101779 -0.698869 -0.109708 -0.482693 -0.291353 1.179281 0.191032 -1.192597 -0.684268 -1.132983 0.072473 -0.626924 0.569412 -1.639735 -3.000464 -1.414111 1.806220 -1.075352 1.274347 -0.111253 0.675737 -0.579840 -1.111530 -0.960682 -1.664172 0.872162 1.311749 -0.182414 3.062428 -1.333462 1.375817 0.947289 1.699605 1.799368 2.434342 0.382153 0.383062 2.583699 -0.756335 1.862328 -0.189235 -2.033432 -0.609034 -0.782589 0.394311 -1.056266 -1.288209 0.055472
In [41]:
word_vecs = []
for line in open("./cmudict-0.7b-simvecs", encoding='latin1'):
    line = line.strip()
    word, vec = line.split("  ")
    word = word.rstrip('(0123)').lower()
    vec = tuple(float(n) for n in vec.split())
    word_vecs.append((word, vec))
In [42]:
len(word_vecs)
Out[42]:
133859
In [43]:
group_by_vec = defaultdict(list)
for word, vec in word_vecs:
    group_by_vec[vec].append(word)
In [44]:
len(group_by_vec)
Out[44]:
113694
In [52]:
lookup = {}
for word, vec in word_vecs:
    if word in lookup:
        continue
    lookup[word] = np.array(vec)
In [53]:
len(lookup)
Out[53]:
125071
In [18]:
nlp = spacy.load('en_core_web_md')
In [37]:
nns = SimpleNeighbors(50)
lookup = {}
for vec, words in group_by_vec.items():
    sort_by_prob = sorted(words, key=lambda x: nlp.vocab[x].prob)
    nns.add_one(sort_by_prob[0], vec)
nns.build(50)
In [59]:
nns.nearest(lookup['parrish'])
Out[59]:
['parrish',
 'perished',
 'parish',
 'buresh',
 'parrishes',
 "paris'",
 'barrish',
 'marish',
 'cherish',
 'perishing',
 'garish',
 'maresh']

random walk

In [69]:
current = 'allison'
for i in range(50):
    print(current, end=" ")
    current = random.choice(nns.nearest(lookup[current])[1:])
allison allinson allison's ilalis's alissa isolate oscillate ocelot assad facade futch sutch suss tussle solicitous solicits tussle tussles suss genesis janus dishon zisson sissom cynicism nissei taisei chace chaste tastes chests sets stet stetz test's pests pastes missteps misstates pastes misstates misstates allstate's tastes pastes mists mistrust trust's strutz constructs 

replacement

In [70]:
frost_doc = nlp(open("frost.txt").read())
In [71]:
output = []
for word in frost_doc:
    if word.text.lower() in lookup:
        new_word = random.choice(nns.nearest(lookup[word.text.lower()]))
        output.append(new_word)
    else:
        output.append(word.text)
    output.append(word.whitespace_)
print(''.join(output))
thuy lodes diverging inning eh colello woodward,
unbend sarra eh toogood knoche travels boeve
ende gyi awan traveller, lall aah stowed
edmond tookes downtime urwin ass farb ige ee couldn't
khuu hewell h. belt engh leitha undergrowth;

then cookout the futher, as juts eye's fier,
anand misbehaving hypercard uther geter claymore,
because h. wah's glassey edmond wounded beware;
xio ahs four's jass yother gassing geniere
ahead whorl jim relay abide uther simm,

edmond goethe that norling equality loye
in. reeves' mono steppes hedge janardhan brakke.
ayo, eh speck rather thirst form otherness deady!
whet renewing haugh woy needs amman tucci byway,
aux undoubted f. oooh shooed ivor cupp gapp.

uhh schill bedke tailing matthes whiz oooh thigh
somewheres gauges ende eases rench:
toupee inroads diverged inning aue woodis, unland I—
aigner put leitha one letsch raveled baye,
odland sajak has mabe lall the difference.

tinting sound

In [73]:
frost_doc = nlp(open("frost.txt").read())
In [87]:
tint_word = 'soap'
tint_vec = lookup[tint_word]
tint_factor = 0.4
In [88]:
output = []
for word in frost_doc:
    if word.text.lower() in lookup:
        vec = lookup[word.text.lower()]
        target_vec = (vec * (1-tint_factor)) + (tint_vec * tint_factor)
        new_word = random.choice(nns.nearest(target_vec))
        output.append(new_word)
    else:
        output.append(word.text)
    output.append(word.whitespace_)
print(''.join(output))
tope rogues survivor's in. aue yellow woodwork,
anand sa aw kote topknot travel busch
england soapy urwin travenol, nall aw stowed
oakland choke downe youn ahs fart edge aue tooke
tope wickware chipote ghent ein posa choke;

them choke ertha suther, ige justo eye's serr,
umland heavy soper otha mater claymore,
soco's schoepf swatch grassi unland footnoted swimwear;
zhou aase fornoff that judge psychopath gehr
hieb sworn zemke nilly taub ertha simm,

earned putsch zag phoning emotionally sope
innate reeves' lomonaco stake heid radant black.
oooh, oie kepp judge scherf fuhr another's jade!
hait lowing hah wye ladd's nahm touche erway,
i. soot efface oooh photo's ever upham tabak.

aw chalet beebe sellick this' which ee saye
footware outages anand rages hentz:
souk rototilles sope ame i. woodwork, earned I—
ae cooke schoepf one selloff sope bip,
odland jap hass mib auld soak referenced.

picking synonyms based on sound

In [98]:
from scipy.spatial.distance import cosine
In [100]:
def cosine_similarity(a, b):
    return 1 - cosine([a], [b])
In [101]:
cosine_similarity(np.array([1,2,3]), np.array([4,5,6]))
Out[101]:
0.9746318461970761
In [89]:
semantic_nns = SimpleNeighbors(300)
for item in nlp.vocab:
    if item.has_vector and item.prob > -15 and item.is_lower:
        semantic_nns.add_one(item.text, item.vector)
semantic_nns.build(50)
In [255]:
def soundalike_synonym(word, target_vec, n=5):
    return sorted(
        [item for item in semantic_nns.nearest(nlp.vocab[word].vector, 50) if item in lookup],
        key=lambda x: cosine_similarity(target_vec, lookup[x]), reverse=True)[:n]
In [257]:
soundalike_synonym('mastodon', lookup['soap'])
Out[257]:
['chimp', 'hippo', 'toad', 'platypus', 'shark']
In [168]:
semantic_nns.nearest(nlp.vocab['mastodon'].vector, 5)
Out[168]:
['velociraptor', 'dinosaur', 'caveman', 'dino', 'skeleton']
In [259]:
target_vec = lookup['green']
words = random.sample(semantic_nns.corpus, 16)
for item in words:
    print(item, "→", soundalike_synonym(item, target_vec, 1)[0])
fog → grille
willingly → gladly
tolerates → gravitate
casino → grand
farmland → graze
micromanage → discretionary
grok → query
crappy → crap
arguably → greatest
naughty → brunette
prior → preceding
encountered → initially
dandruff → tanning
gendered → transcends
airborne → aircraft
natures → glamour

Soundalike synonym replacement

In [131]:
frost_doc = nlp(open("frost.txt").read())
In [151]:
target_word = 'soap'
target_vec = lookup[target_word]
In [152]:
output = []
for word in frost_doc:
    if word.is_alpha \
            and word.pos_ in ('NOUN', 'VERB', 'ADJ') \
            and word.text.lower() in lookup:
        new_word = random.choice(soundalike_synonym(word.text.lower(), target_vec))
        output.append(new_word)
    else:
        output.append(word.text)
    output.append(word.whitespace_)
print(''.join(output))
Two motorists emerged in a silver spruce,
And sorry I could not tours both
And not one oasis, long I looked
And thought down one as far as I not
To where it crook in the foliage;

Then stopped the particular, as just as but,
And thought perhaps the make purported,
Because it did tree and chose duds;
Though as for that the turning there
took pajama them really about the not,

And both that night equally stood
In fig no take took strut violet.
Oh, I stopped the same for another summer!
Yet knows how it resulted on to take,
I admit if I not ever say back.

I abide also surprised this with a sigh
Somewhere child and mos hence:
Two crossing transformed in a laminate, and I—
I saw the same less embark by,
And that could it most the because.

In [ ]: