(Note: Rough draft! Notes incomplete. FIXME: Needs example at the end for predicting from user-supplied text.)
In this notebook, I'm going to take you through a couple of simple, well-known techniques for exploring small sequences of text—like Tarot interpretations. These techniques include:
The goal is to better understand the semantic structure of oracle decks.
I'm going to use spaCy extensively, both as a way to parse text into sentences and also as a source for pre-trained word vectors. Make sure you have it installed, along with the en_core_web_md
or en_core_web_lg
models. If you're using Anaconda, you can install spaCy and the English language model with the following commands:
conda install -c conda-forge spacy
python -m spacy download en_core_web_md
This notebook also assumes that you have scikit-learn and numpy installed (these come with standard Anaconda installations).
We'll also be using a library called simpleneighbors. To install this, try:
pip install simpleneighbors[annoy]
If that doesn't work (i.e., you get an error about a missing compiler or something), try:
pip install simpleneighbors[sklearn]
Once you're done installing, you should be able to import the modules with the cells below:
import numpy as np
import spacy
from simpleneighbors import SimpleNeighbors
import gzip, json, random
The following cell loads spaCy's language model (this might take a sec):
nlp = spacy.load('en_core_web_md')
Download this file and put it into the same directory as this notebook. (If you cloned the repository, it's already there!) The file contains structured data about the 78 cards in the Tarot deck, and their interpretation. It's in JSON format, so we have to parse it with Python's json
library to make the data available:
import json
tarot_data = json.load(open("tarot_interpretations.json"))
The structure of the data is a little bit tricky, and there's more in there than we need for our experiments in this notebook. The cell below creates a list of tuples with each tarot card associated with its interpretation. (The interpretations are created by joining together the values for the fortune_telling
key in the JSON file for each card.)
tarot_cards = []
for item in tarot_data['tarot_interpretations']:
tarot_cards.append((item['name'], "; ".join(item['fortune_telling'])))
The following cell does a simple Tarot spread, sampling three cards at random and showing them with their interpretations:
random.sample(tarot_cards, 3)
[('nine of coins', "Until you appreciate what you have, you won't have any luck getting more"), ('The Emperor', 'A father figure arrives; A new employer or authority figure will give you orders; Expect discipline or correction in the near future'), ('four of wands', 'Someone is watching and evaluating your work; You may get a wedding invitation soon')]
For convenience, the following cell creates a dictionary that makes it easy to look up the interpretation for a given card:
tarot_lookup = dict(tarot_cards)
Look up the meaning of the two of coins:
tarot_lookup['two of coins']
"It's time to balance the budget; Avoid the temptation to spend critical funds on frivolous goods"
(Note that I don't especially agree with many of the interpretations in this JSON file! This is an especially poor interpretation in my opinion. But these interpretations are what we have to work with, so let's move forward.)
Finally, a list of just the card names:
tarot_labels = [item[0] for item in tarot_cards]
An immediate goal of this notebook is to make a determination about which Tarot cards have similar interpretations. To do this, we need a measure of semantic similarity: how close are two stretches of text in meaning? For computational purposes, we need to quantify this similarity, and ideally we'd like to be able to visualize it. But how to go about doing this?
Contemporary research in natural language processing offers a solution: the word vector. You can read and follow along with my notebook explaining how word vectors work before continuing, but the short version is that a word vector associates a word with a coordinate in space. Words with similar meanings will have coordinates that are close to each other. (You can think of these coordinates as being on a Cartesian plane (with an X/Y axis), though in reality each coordinate has many more than two dimensions.)
The word vectors that we're using are the Stanford GloVe vectors included with spaCy. These vectors were produced through an algorithmic process that looked at a large corpus of text and recorded all of the contexts that each word occurred in. The vectors are a compressed representation of this list of contexts for each word, so that words that occur in similar contexts have similar vectors. According to the distributional hypothesis, words that occur in similar contexts have similar meanings, so vectors representing contexts can serve as a way to represent meanings of words.
Word vectors are not perfect; they carry with them the biases of the corpora they're trained on. But they have several interesting properties that afford interesting techniques in computational language arts.
The following cell defines a function vec()
which makes it easy to get the word vector for a particular word:
def vec(s):
return nlp.vocab[s].vector
Here's what a vector looks like:
vec('cards')
array([-0.54786 , -0.060409 , -0.34341 , -0.78909 , 0.76913 , 0.062576 , -0.36196 , -0.067923 , -0.24321 , 1.4941 , -0.0050086, 0.13367 , 0.044146 , -0.15426 , -0.083489 , -0.085395 , -0.9413 , 1.6464 , -0.02296 , -0.08399 , -0.19905 , 0.33439 , -0.12289 , -0.030612 , 0.29691 , -0.35093 , -0.021799 , -0.036572 , -0.66472 , -0.0036923, -0.15867 , -0.22094 , -0.096175 , 0.021781 , -0.3915 , 0.54463 , -0.43797 , -0.21985 , -0.23609 , 0.13824 , -0.15661 , 0.14144 , 0.019459 , -0.66783 , -0.05219 , 0.012071 , -0.46731 , -0.13634 , -0.055697 , 0.39305 , -0.11922 , -0.061792 , 0.56273 , -0.53199 , -0.14907 , -0.11057 , 0.31247 , 0.30131 , -0.43884 , 0.11746 , 0.07021 , 0.33124 , 0.39057 , -0.069946 , 0.15441 , 0.46396 , 0.0040996, 1.0562 , 0.36647 , -0.28721 , 0.082222 , 0.00213 , 0.34582 , -0.0446 , 0.46781 , 0.66558 , 0.43319 , -0.76897 , -0.050817 , 0.90723 , -0.26064 , 0.41021 , 0.32019 , -0.018988 , 0.18413 , 0.027466 , 0.2929 , 0.66606 , 0.24767 , -0.42119 , 0.18377 , 0.18996 , -0.28091 , -0.38211 , 0.37067 , -0.46318 , 0.26522 , -0.072179 , -0.19687 , -0.074965 , -0.63569 , -0.66968 , -0.88611 , 0.14429 , 0.12875 , -0.64857 , -0.045062 , 0.35845 , 0.46038 , -0.0095807, 0.50231 , -0.066645 , 0.09579 , -0.27013 , 0.39316 , -0.13675 , 0.047418 , 0.36256 , 0.072575 , 0.096322 , 0.75857 , -0.3371 , 0.31418 , 0.09372 , 0.079163 , 0.2131 , 0.11307 , 0.46478 , 0.21549 , -0.04808 , -0.30878 , 0.20758 , 0.34154 , 0.20912 , -0.51854 , -0.39562 , 0.2113 , 0.57655 , -0.27473 , 0.4911 , -1.4924 , 0.49913 , 0.038791 , -0.28311 , -0.70614 , 0.36361 , 0.2137 , 0.14033 , -0.35302 , -0.21158 , 0.13034 , 0.32395 , -0.23138 , 0.39176 , 0.045725 , -0.39346 , 0.35702 , 0.48889 , 0.060095 , -0.069811 , -0.10339 , 0.40331 , 0.516 , 0.21622 , -0.92662 , -0.59798 , 0.30912 , 0.030597 , 0.19242 , -0.094897 , -0.14628 , 0.20332 , 0.48479 , 0.78987 , 0.1215 , -0.14864 , -0.21997 , -0.15274 , 0.44295 , 0.59229 , -0.20717 , 0.14929 , 0.10725 , -0.22542 , -0.1083 , 0.26113 , -0.55752 , -0.030635 , -0.33673 , -0.35509 , 0.20269 , 0.033527 , -0.17902 , -0.38712 , 0.080425 , -0.42007 , -0.93471 , 0.45989 , 0.13203 , 0.50639 , 0.95391 , -1.011 , 0.41906 , -0.18694 , 0.27447 , -0.17695 , 0.014367 , 0.21504 , -0.38839 , 0.60929 , 0.75328 , 0.26789 , -0.30174 , -0.12735 , 0.30466 , -0.11706 , 0.024288 , -0.10797 , -0.56404 , 0.13712 , -0.28683 , -0.24417 , 0.070113 , -0.19123 , -1.0564 , -0.1462 , 0.048439 , 0.58187 , 0.087668 , 0.31711 , 0.13831 , 0.24231 , 0.24348 , 0.54646 , 0.0016724, -0.0081518, 0.068331 , -0.21606 , -0.27379 , 0.55945 , 0.2848 , 1.1669 , -0.064809 , 0.29127 , -0.11288 , -0.53852 , 0.12104 , -0.49231 , -0.45929 , -0.0044095, 0.15616 , 0.18031 , -0.70871 , -0.34936 , 0.62035 , 0.4343 , -0.63038 , -0.14185 , -0.42332 , 0.24734 , 0.50841 , 0.066684 , -0.87131 , 0.22859 , 0.072485 , 0.40083 , 0.15505 , -0.047754 , 0.35373 , -0.057295 , -0.85158 , 0.37551 , 0.052956 , 0.53433 , 0.46134 , -0.072809 , -0.25514 , 0.3398 , -0.081816 , -0.58906 , 0.19547 , -0.38552 , -0.1366 , -0.10245 , 0.45415 , 0.57782 , -0.097346 , -0.17491 , -0.36691 , 0.13359 , -1.2082 , 0.92608 , 0.22259 , -0.091107 , 0.34454 , 0.058 , 0.78919 , -0.28031 , 0.37474 , 0.24557 ], dtype=float32)
The following code builds an index of common words and their vectors. You can use this index to look up words considered to be close in meaning to a given word. (We won't use this for anything in the rest of the notebook, but it's fun to play with.)
thesaurus = SimpleNeighbors(300)
for item in nlp.vocab:
if item.has_vector and item.prob > -15 and item.is_lower:
thesaurus.add_one(item.text, item.vector)
thesaurus.build(50)
thesaurus.nearest(vec('magician'), 5)
['genie', 'magic', 'magician', 'magical', 'enchantment']
Word vectors work great when we're interested in individual words. More often, though, we're interested in longer stretches of text, like sentences, lines, paragraphs. If we had a way to represent these longer stretches of text as vectors, we could perform all of the same operations on them that word vectors allow us to perform on words. But how to represent stretches of text as sequences?
There are lots of different ways! The classic technique in machine learning is to use the frequency of terms found in each sequence (methods like tfidf), or similar techniques like doc2vec. Another way is to train a recurrent neural network (like an LSTM) and use its hidden state. Another is to use a pre-trained model, like Google's Universal Sentence Encoder.
But a surprisingly effective technique is to simply average together the word vectors for each word in the sentence. A big advantage of this technique is that no further training is needed, beyond the training needed to calculate the word vectors; if you're using pre-trained vectors, even that step can be skipped. You won't get state-of-the-art results on NLP benchmarks with this technique, but it's a good baseline and still useful for many tasks.
The function below produces a vector that represents the meaning of a sentence with an average of the vectors for each word in the sentence. The function weights each word's contribution to the average based on the inverse of its frequency in English (according to spaCy's built-in word frequency information). The intention of weighting by inverse frequency is to reduce the contribution of common function words like "the" and "of" to the sentence vector. (This approach was inspired by Arora et al.'s Smooth Inverse Frequency technique, though my approach doesn't remove the principle component.)
def sentence_summary(sent, a=0.001):
s = nlp(sent, disable=['parser', 'tagger', 'ner'])
weights = [a / (a + np.exp(tok.prob)) for tok in s]
emb = np.average([tok.vector for tok in s], axis=0, weights=weights)
return emb
The code in the cell below computes and returns the embeddings every string in a list:
def embeddings(text_list):
return [sentence_summary(item) for item in text_list]
And here we calculate embeddings for every Tarot interpretation:
tarot_embeddings = embeddings([interpretation for card, interpretation in tarot_cards])
And here's what they look like:
rand_idx = random.randrange(len(tarot_embeddings))
tarot_cards[rand_idx], tarot_embeddings[rand_idx]
(('ace of coins', "Your health will improve; The check you're looking for really is in the mail"), array([-7.48437062e-02, 2.38817390e-01, -2.60202422e-01, 6.45209687e-03, -5.76540027e-02, 9.11737805e-04, -7.95287452e-02, -3.14694786e-01, -8.11702077e-02, 2.32743427e+00, -5.00491939e-01, 3.74548861e-02, 1.31327028e-02, -6.07544370e-02, -7.64834004e-02, 9.26227071e-04, -9.57607638e-02, 1.55665597e+00, -3.24453771e-01, 7.33627906e-02, 1.14888679e-02, 1.70832060e-01, -2.99182090e-02, -1.18720762e-01, 5.66484608e-02, 1.36779241e-02, -7.04350782e-02, -1.41377904e-01, 1.98921806e-01, -2.15919173e-01, -7.38209155e-02, -1.32633572e-02, 4.55063377e-02, -1.58357770e-02, 3.85797984e-02, -1.66530833e-03, -3.66343062e-02, 1.22332848e-01, 9.88227520e-03, -5.77911806e-02, -6.06664125e-02, 3.09903999e-02, -1.21920058e-01, -2.43932289e-01, -4.46682993e-02, 9.22677711e-02, -9.96118580e-02, 1.57244214e-01, 3.19178682e-02, 6.43189688e-02, -1.01735060e-01, -1.50214267e-01, 1.51780992e-01, -2.35615949e-01, 5.52883619e-02, 1.63287913e-01, -1.78592247e-02, -1.07295609e-01, -1.05380497e-02, -8.59099143e-02, 4.46896350e-02, -1.61179409e-01, -7.68931289e-02, 2.14964732e-01, 3.30428203e-01, 4.80253013e-02, 2.59878188e-02, 1.17989992e-01, 8.53953369e-02, 1.54478739e-01, -2.44244780e-02, -2.41791573e-02, 3.69427739e-01, -1.33175913e-01, 2.41763305e-01, 1.87899780e-01, 1.32443139e-01, -1.95086423e-01, -2.22632380e-02, 1.33292755e-01, -4.98899809e-02, 1.20693739e-01, -9.51892856e-02, 6.75185254e-02, 4.63139610e-02, -9.31672267e-02, -7.28366242e-02, 1.48728923e-01, 2.99043321e-01, -4.99964930e-03, -4.03147672e-02, 1.77066748e-01, 3.54631857e-02, -1.62081859e-02, 5.19220642e-02, -6.72302465e-02, -1.13579376e-01, -1.43683545e-01, 2.78564691e-02, 7.62348928e-02, -6.65593502e-02, -8.50503551e-02, 6.91155524e-02, -1.59170916e-01, 1.09488480e-01, -8.22551231e-01, 3.65457500e-01, 4.60299493e-02, -8.61159351e-03, -4.36428519e-02, 1.64247674e-01, -1.65095284e-01, 2.05229024e-01, -2.61950875e-02, 1.82527833e-01, -1.11650097e-01, 1.08528657e-01, -4.38461660e-02, -4.77332941e-02, -2.26382037e-02, 1.70719513e-01, 7.43769102e-03, -1.15183466e-01, 4.10232764e-03, 1.45234198e-01, -8.14227408e-02, 7.00700930e-02, 9.20888089e-03, -2.97235317e-02, -6.03460036e-02, 3.81184236e-02, -1.52229369e-01, -1.96107704e-01, -8.70210721e-03, 1.05284864e-01, 7.56332051e-02, -9.34177743e-02, -1.34808105e-01, -1.27328255e-01, -2.15593590e-02, -9.58447919e-01, 1.55509800e-01, 1.12895698e-01, -1.08795166e-01, -1.58813633e-03, -1.30173494e-01, 3.65858307e-02, 1.10479234e-01, -7.31942739e-02, 1.59150441e-01, -1.51762203e-02, 5.68230947e-02, 9.08937080e-02, 1.14627780e-02, -2.57060010e-02, 1.92023367e-01, 6.19790247e-02, -7.57231737e-02, -2.86259124e-02, -2.07441214e-01, 2.67854245e-02, 2.06206519e-01, -3.82097765e-02, -2.06149368e-01, 8.24402842e-02, -1.32642717e-01, 2.18515685e-01, -1.07890557e-01, 1.54641973e-01, -7.23135646e-02, -9.24776183e-02, -2.48391590e-02, 2.02703915e-01, -4.28118497e-02, -6.24633172e-02, 1.61729463e-01, -2.50933312e-02, 2.23573054e-01, 1.24342019e-01, 3.64926962e-01, -1.01829669e-01, -4.44452489e-02, -1.61869151e-01, 2.25040499e-02, 5.42481704e-02, 1.18308602e-01, -1.23432542e-01, -1.36318297e-01, -2.51408446e-01, -4.44131144e-02, 4.91594177e-02, -1.27359955e-02, 4.97160255e-02, -1.49556022e-01, 7.79283473e-02, 3.10716404e-01, -8.51481380e-02, -2.69857885e-02, 2.08288904e-01, 2.51421121e-01, 1.15344299e-01, -1.91247327e-01, -1.06143072e-01, 4.58277885e-02, 7.00997709e-02, 1.18081542e-01, -4.51600569e-02, 1.35772967e-01, -5.17282925e-03, 4.78049697e-02, -5.67102847e-02, -3.12424448e-01, -8.52369813e-02, -1.95575626e-01, -1.56246210e-01, -6.09950984e-02, 8.69768248e-03, -1.44713961e-02, -4.83891870e-01, -9.00313824e-02, -1.71125145e-01, -9.71028977e-02, -1.88515921e-01, 1.27193583e-01, 3.81317358e-02, -7.79134478e-02, 2.34144180e-02, 1.62707638e-01, 1.08522363e-02, -7.17969703e-02, -9.68405901e-02, 5.41182664e-02, 5.03943266e-02, 7.04710596e-02, -2.82413905e-02, 2.05238891e-02, -1.04265870e-04, -2.20987733e-01, -6.54041566e-02, -1.43803114e-02, 5.82879536e-02, -1.90525812e-03, 1.46807064e-01, 1.42797293e-02, 1.97468939e-01, -1.73688021e-01, -1.08334611e-01, -9.03258903e-02, -7.88496215e-02, 5.92143497e-02, 1.21125243e-01, -1.70939167e-01, -1.73801636e-01, -5.62711353e-02, 1.07790899e-01, 3.23222063e-01, -4.65322352e-02, -5.61135153e-02, -5.98254008e-03, -1.79172921e-02, -7.61352234e-02, 1.56493064e-01, 2.49916214e-01, -1.00796355e-02, 9.27975720e-02, -2.08178651e-01, -1.50246486e-01, 2.47436565e-01, 4.48720485e-01, -1.85201411e-02, -4.01863171e-02, -8.27071548e-02, -1.19647022e-01, -5.90888109e-02, 5.19629545e-02, 4.14311134e-02, 2.95730843e-03, 1.17222447e-01, -2.53124451e-02, 1.11358427e-01, -1.23865467e-01, 1.47237513e-02, 1.15925128e-01, 1.49301028e-01, 1.30670394e-02, -1.53877910e-01, 1.30660114e-01, -1.46171688e-01, 1.46446332e-01, -1.28922267e-01, -2.20237411e-01, 2.76028688e-02, 4.93983164e-02, -3.10195190e-02, 4.27544581e-02, -1.77525941e-01, 5.50127771e-02, -1.25325867e-01, -5.63722888e-05, 2.26680504e-01]))
Sentence vectors aren't especially interesting on their own! One thing we can use them for is to build a tiny "search engine" based on semantic similarity between cards. To do this, we need to be able to calculate the distance between a target sentence's vector (not necessarily a sentence from our corpus) and vectors of the sentences in the corpus, returning them ranked based on the distance between the two vectors. We'll use another SimpleNeighbors
object to perform fast lookups based on similarity.
First, I build an approximate nearest neighbors index using the vectors I got from the inverse frequency weighting technique:
tarot_nn = SimpleNeighbors(300)
for vec, (card, interpretation) in zip(tarot_embeddings, tarot_cards):
tarot_nn.add_one(card, vec)
tarot_nn.build(50)
The .nearest()
method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the sentence_summary()
function to return the interpretations most similar to the sentence you type in. The number controls how many sentences should be returned.
tarot_nn.nearest(sentence_summary("happiness and contentment"), 5)
['ten of cups', 'six of cups', 'The Lovers', 'The Hermit', 'queen of cups']
The same idea, but also prints out the interpretation for the card:
for item in tarot_nn.nearest(sentence_summary("happiness and contentment"), 5):
print(item)
print(tarot_lookup[item])
print()
ten of cups Marriage and family are in the cards; Expect a friendship to blossom into a romance six of cups A stingy spirit is strangling your enjoyment of life; Loosen up and think of others for once, why don't you? The Lovers A new personal or professional relationship blossoms; Sexual opportunities abound; Unexpectedly, a friend becomes a lover The Hermit A period of loneliness begins; One partner in a relationship departs; A search for love or money proves fruitless queen of cups This card represents a woman with an emotional, deeply spiritual nature, likely born between June 11th and July 11th, who uses emotional and spiritual appeals to sway others to her point of view
To get neighbors for a particular card:
tarot_nn.neighbors('The Sun', 5)
['The Sun', 'The Moon', 'five of wands', 'ten of wands', 'nine of wands']
To get neighbors for a random item in the corpus:
tarot_nn.neighbors(random.choice(tarot_nn.corpus), 5)
['Death', 'The Hermit', 'The Hanged Man', 'The Empress', 'six of swords']
A benefit of representing meaning as points in space is that we can perform vector arithmetic on those vectors. One operation of interest is finding the midpoint between two points (i.e., averaging two points). If we find the vectors for two cards, then average those vectors together and find the card nearest to that point, we'll get a card that is "between" the two other cards in terms of their meaning.
The following function averages two vectors:
def average(v1, v2):
return (np.array(v1) + np.array(v2)) / 2
And then this code prints out the five cards nearest to the midpoint of the two named cards. (Often the first two nearest cards are the source and destination themselves. Weirdly, The Moon
tends to be the midpoint in most of my experiments for any two cards!)
src = 'ace of coins'
dest = 'Death'
for item in tarot_nn.nearest(average(tarot_nn.vec(src), tarot_nn.vec(dest)), 5):
print(item)
print(tarot_lookup[item])
print()
ace of coins Your health will improve; The check you're looking for really is in the mail Death A relationship or illness ends suddenly; Limit travel and risk-taking; General gloom and doom The Moon Watch for problems at the end of the month; Someone you know needs to howl at the moon more often; Someone is about to change his or her mind about an important decision Judgement An old issue you thought was over will come up again today; Get ready for huge changes: break-ups, sudden calls from old friends, and unexpected setbacks; God's trying to get your attention The Magician A powerful man may play a role in your day; Your current situation must be seen as one element of a much larger plan
Another thing you can do with sentence vectors is visualize them. But the vectors are large (in our case, 300 dimensions), which doesn't have an obvious mapping to 2-dimensional space. Thankfully, there are a number of algorithms to reduce the dimensionality of vectors. We're going to use t-SNE ("t-distributed stochastic neighbor embedding"), but there are others to experiment with that might be just as good or better for your application (like PCA or UMAP.
from sklearn.manifold import TSNE
mapped_embeddings = TSNE(n_components=2,
metric='cosine',
init='pca',
learning_rate=75,
perplexity=15,
n_iter=5000,
verbose=1).fit_transform(tarot_embeddings)
[t-SNE] Computing 46 nearest neighbors... [t-SNE] Indexed 78 samples in 0.000s... [t-SNE] Computed neighbors for 78 samples in 0.005s... [t-SNE] Computed conditional probabilities for sample 78 / 78 [t-SNE] Mean sigma: 0.125879 [t-SNE] KL divergence after 250 iterations with early exaggeration: 75.816658 [t-SNE] KL divergence after 3000 iterations: 0.518916
The following function draws an image with the results of the t-SNE. (You can control-click or right-click to save it.)
%matplotlib inline
import matplotlib.pyplot as plt
def disp_tsne(embeddings, labels, figsize=12):
plt.figure(figsize=(figsize, figsize))
x = embeddings[:,0]
y = embeddings[:,1]
plt.scatter(x, y)
for i, item in enumerate(labels):
plt.annotate(labels[i], (x[i], y[i]))
And the following calls the function with the result of the TSNE:
disp_tsne(mapped_embeddings, tarot_labels)
with open("tarot-emb-proj-vecs.tsv", "w") as fh:
for item in tarot_embeddings:
fh.write("\t".join(["%0.5f" % val for val in item]))
fh.write("\n")
with open("tarot-emb-proj-labels.tsv", "w") as fh:
fh.write("\n".join(tarot_labels))
In the visualization above, you may have seen some evidence of "clustering"—groups of items that seem to be related. There are algorithms that facilitate finding such clusters automatically. This can be an interesting and valuable way to explore your data—you might find clusters of meaning that you didn't expect.
We're going to use the K-Means clustering algorithm (in particular, scikit-learn's MiniBatchKMeans).
K-Means is an unsupervised algorithm, meaning that you don't need to label the data for it to work. But you do need to specify how many clusters you expect to find:
cluster_count = 5 # adjust this until it starts giving you good results!
The code in the following cell computes clusters for the given sets of embeddings, labels and cluster count:
from sklearn.cluster import MiniBatchKMeans
from collections import defaultdict
def cluster_labels(embeddings, labels, cluster_n):
clusterer = MiniBatchKMeans(n_clusters=cluster_n)
clusters = clusterer.fit_predict(embeddings)
group_by_cluster = defaultdict(list)
for i, item in enumerate(clusters):
group_by_cluster[item].append(labels[i])
centers = [clusterer.cluster_centers_[i] for i in range(cluster_n)]
groups = [group_by_cluster[i] for i in range(cluster_n)]
return (centers, groups)
Let's calculate this for our Tarot embeddings and cards. The function also returns the center of each cluster:
centers, groups = cluster_labels(tarot_embeddings, tarot_labels, cluster_count)
The code in the following cell takes this information and prints it out. Each cluster is shown along with at most five cards that belong to it and the card closest to the center of the cluster.
for i in range(cluster_count):
print(f"Cluster {i} ({len(groups[i])} items)")
print("Closest to center: ", list(tarot_nn.nearest_matching(centers[i], 1, lambda x: x in groups[i]))[0])
print("All cards: ", ", ".join(groups[i]))
print()
for card_label in random.sample(groups[i], min(5, len(groups[i]))):
print(card_label)
print(tarot_lookup[card_label])
print()
print("\n---")
Cluster 0 (4 items) Closest to center: ace of cups All cards: The Lovers, ace of cups, three of cups, ten of cups ace of cups Romance is in the cards; A new relationship or marriage is just around the corner; Prayers are answered three of cups Unconventional romance is coming your way: a love affair with someone you've always dismissed ten of cups Marriage and family are in the cards; Expect a friendship to blossom into a romance The Lovers A new personal or professional relationship blossoms; Sexual opportunities abound; Unexpectedly, a friend becomes a lover --- Cluster 1 (30 items) Closest to center: The Moon All cards: The Fool, The Wheel, Justice, The Tower, The Moon, The Sun, Judgement, three of wands, four of wands, five of wands, seven of wands, nine of wands, ten of wands, eight of cups, nine of cups, ace of swords, two of swords, four of swords, six of swords, seven of swords, eight of swords, nine of swords, ten of swords, knight of swords, ace of coins, four of coins, six of coins, seven of coins, nine of coins, ten of coins six of swords You'll soon go on a long journey over water; Actions have unexpected consequences, so be prepared nine of coins Until you appreciate what you have, you won't have any luck getting more seven of swords Don't assume people around you are worthy of your trust; Ask for an accounting of where people have been, and what they've been doing Justice A legal verdict will be rendered soon; Someone is making a decision; You need to get the facts knight of swords A blunder leads someone to say something he or she regrets; If this was you, be prepared to apologize and move on --- Cluster 2 (15 items) Closest to center: four of cups All cards: The Devil, The Star, The World, ace of wands, two of wands, six of wands, eight of wands, two of cups, four of cups, five of cups, six of cups, seven of cups, three of swords, five of swords, eight of coins two of cups Someone has a secret crush on you; Relationships should be mutual; get rid of a leech five of cups A breakup looms; Don't cry over spilt milk; Take your lumps and get back in the saddle five of swords Someone is stealing from you, financially or romantically; Be wary of friends who talk behind your back eight of coins Stop over-analyzing, researching, and outlining; Buckle down and get the work done six of cups A stingy spirit is strangling your enjoyment of life; Loosen up and think of others for once, why don't you? --- Cluster 3 (14 items) Closest to center: knight of cups All cards: page of wands, knight of wands, queen of wands, king of wands, page of cups, knight of cups, queen of cups, king of cups, page of swords, queen of swords, king of swords, page of coins, queen of coins, king of coins knight of wands This card represents a man with a bold, passionate personality, likely born between July 12th and August 11th, who wants to sweep you off your feet page of cups This card represents a young man or woman with a watery, dreamy demeanor, likely born a Libra, Scorpio, or Sagittarius, who wants to start a new relationship with you knight of cups This card represents a man with an emotional, sensitive personality, likely born between October 13th and November 11th, who wants you to rally around his latest passionate cause queen of swords This card represents a woman with an artistic, intellectual nature, likely born between September 12th and October 12th, who uses clever, positive communication to sway others to her point of view king of swords This card represents an older man with an insightful, deliberate spirit, likely born between May 11th and June 10th, who is known for his integrity and sharp decision-making ability --- Cluster 4 (15 items) Closest to center: The Hanged Man All cards: The Magician, The Papess/High Priestess, The Empress, The Emperor, The Pope/Hierophant, The Chariot, Strength, The Hermit, The Hanged Man, Death, Temperance, two of coins, three of coins, five of coins, knight of coins The Hermit A period of loneliness begins; One partner in a relationship departs; A search for love or money proves fruitless The Magician A powerful man may play a role in your day; Your current situation must be seen as one element of a much larger plan three of coins A high-dollar contract is in your future; If you work hard, you'll succeed Death A relationship or illness ends suddenly; Limit travel and risk-taking; General gloom and doom The Papess/High Priestess A mysterious woman arrives; A sexual secret may surface; Someone knows more than he or she will reveal ---
K-Means clustering has a random component, so you won't necessarily end up with the same clusters every time. With higher numbers of clusters, you tend to get clusters that only have one or two items (not ideal). A repeating pattern for me is that the court cards tend to end up in the same cluster and The Moon tends to be the center of large clusters. Not sure why!
We can consider Tarot to be one exemplar of the category of "oracle decks."
In this repository, there's a file called oracle-corpus.tsv
. This consists of a tab-separated file with several thousand potential oracle cards, drawn from the Tarot interpretations file and a book of dream interpretations.
This section of the notebook performs the same analysis we performed on the Tarot deck, but with this TSV instead. It'll work with any TSV though!
Because this TSV has some duplicate cards, I added some code that adds random numbers to the end of the card's name (in order to disambiguate between duplicates). The numbers don't actually mean anything.
already_seen = set()
deck = []
for line in open("./oracle-corpus.tsv"):
line = line.strip()
card, interp = line.split("\t")
# dealing with duplicates
if card in already_seen:
card += "/%04d" % random.randrange(10000)
already_seen.add(card)
deck.append((card, interp))
random.sample(deck, 10)
[('petticoat', 'your reluctance in revealing something about yourself'), ('blog', 'your popularity'), ('photo_booth', 'things that are thought to be private may not be so private'), ('tractor', 'your resourcefulness and ingenuity'), ('mink', 'value, warmth, riches, or luxury'), ('sushi', 'you need to acknowledge your spiritual side'), ('monster/8338', 'aspects of yourself that you find repulsive and ugly'), ('post-it_note', 'there is something that you need to make a mental note of'), ('devil/6041', 'you will succeed in defeating your enemies'), ('erosion', 'a situation or relationship that is wearing away')]
A dictionary to look up interpretations for cards:
deck_lookup = dict(deck)
Look up the meaning of the two of coins:
deck_lookup['flesh']
'a heightened sense of feeling and vitality'
Finally, a list of just the card names:
deck_labels = [item[0] for item in deck]
Building a nearest neighbors index:
deck_embeddings = embeddings([interpretation for card, interpretation in deck])
deck_nn = SimpleNeighbors(300)
for vec, (card, interpretation) in zip(deck_embeddings, deck):
deck_nn.add_one(card, vec)
deck_nn.build(50)
The .nearest()
method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the sentence_summary()
function to return the interpretations most similar to the sentence you type in. The number controls how many sentences should be returned.
deck_nn.nearest(sentence_summary("everything will be fine!"), 5)
['tart', 'face/5958', 'funeral', 'comfort', 'luck']
The same idea, but also prints out the interpretation for the card:
for item in deck_nn.nearest(sentence_summary("everything will be fine!"), 5):
print(item)
print(deck_lookup[item])
print()
tart things are going well for you face/5958 you need to come clean about some matter funeral something in your life needs to put to rest or put aside so that you can make room for something new comfort Appreciating fine food, fine wine, beautiful art, beautiful bodies, or any of the better things in life luck things will look up for you
To get neighbors for a particular card:
deck_nn.neighbors('luck', 5)
['luck', 'lens', 'tart', 'joint', 'staring']
To get neighbors for a random item in the corpus:
deck_nn.neighbors(random.choice(deck_nn.corpus), 5)
['pole_vaulting', 'trough', 'pineapple', 'roommate', 'boxing_glove']
Because we used the same embedding procedure for both our oracle deck cards and the Tarot deck, we can find oracle cards that are close in meaning to Tarot cards:
for item in deck_nn.nearest(tarot_nn.vec('The Fool'), 5):
print(item)
print(deck_lookup[item])
print()
sit_up you need to pay better attention to something in your life, like a relationship, school, work, family, or project gear/8969 you are ready to move forward with a new project in your life astral_projection you are looking at things from a whole new perspective new_year prosperity, hope, new beginnings and an opportunity to make a fresh start originality/9359 Putting old things together in new and exciting ways
Looking up Oracle cards halfway between the Tarot cards ace of coins
and Death
:
src = 'ace of coins'
dest = 'Death'
for item in deck_nn.nearest(average(tarot_nn.vec(src), tarot_nn.vec(dest)), 5):
print(item)
print(deck_lookup[item])
print()
path you need to give serious attention to the direction you are heading in your personal and/or business life comic you refuse to see the problems that exist in your life and only want to focus on the good times bet you are taking a risk in a relationship or work situation which may not be such a wise choice outbound Taking what you want without concern for the needs of others camper you need to move on with some situation or some aspect of your life
This code picks two cards at random and then prints the card that is between the two in meaning.
src = deck_nn.vec(random.choice(deck_nn.corpus))
dest = deck_nn.vec(random.choice(deck_nn.corpus))
avg = average(src, dest)
src_card = deck_nn.nearest(src, 1)[0]
avg_card = deck_nn.nearest(avg, 1)[0]
dest_card = deck_nn.nearest(dest, 1)[0]
print(src_card)
print(deck_lookup[src_card])
print()
print(avg_card)
print(deck_lookup[avg_card])
print()
print(dest_card)
print(deck_lookup[dest_card])
herd you are a follower sunglasses you are having a hard time getting to know this person pacifier/1560 you are trying to "suck up to" someone in your waking life
This is the same T-SNE code as above, but applied to the oracle card embeddings.
Note in the code below, I'm using an even smaller subset of the data. (That's what the [:2000]
is doing—just using the first 2000 samples.) This is because t-SNE is slow, as is drawing the results of a t-SNE).
from sklearn.manifold import TSNE
deck_mapped_embeddings = TSNE(n_components=2,
metric='cosine',
init='pca',
learning_rate=75,
perplexity=15,
n_iter=1000,
verbose=1).fit_transform(deck_embeddings[:2000])
[t-SNE] Computing 46 nearest neighbors... [t-SNE] Indexed 2000 samples in 0.012s... [t-SNE] Computed neighbors for 2000 samples in 0.164s... [t-SNE] Computed conditional probabilities for sample 1000 / 2000 [t-SNE] Computed conditional probabilities for sample 2000 / 2000 [t-SNE] Mean sigma: 0.137747 [t-SNE] KL divergence after 250 iterations with early exaggeration: 84.509010 [t-SNE] KL divergence after 1000 iterations: 1.874479
Because there are more labels, we need to make the image larger so we can see them all clearly. You may need to save this and open in an image viewing tool (like macOS Preview) to see everything comfortably.
disp_tsne(deck_mapped_embeddings, deck_labels[:2000], 32)
Exporting for Google embedding projector: upload these files here.
with open("deck-emb-proj-vecs.tsv", "w") as fh:
for item in deck_embeddings:
fh.write("\t".join(["%0.5f" % val for val in item]))
fh.write("\n")
with open("deck-emb-proj-labels.tsv", "w") as fh:
fh.write("\n".join(deck_labels))
I think this might actually be a valid and interesting way to generate ideas for an oracle deck! We cluster all 8000+ cards into groups. One way of thinking about this is that we've found a way to "factor out" similar cards, getting at the underlying core ideas of the deck. Set the cluster count to the number of cards you want, and voila—computer-generated oracle deck that maximally covers all possible divinatory semantics.
cluster_count = 13 # adjust this until it starts giving you good results!
centers, groups = cluster_labels(deck_embeddings, deck_labels, cluster_count)
The code in the following cell takes this information and prints it out. Each cluster is shown along with at most five cards that belong to it:
for i in range(cluster_count):
print(f"Cluster {i} ({len(groups[i])} items)")
print("Closest to center: ", list(deck_nn.nearest_matching(centers[i], 1, lambda x: x in groups[i]))[0])
# print("All cards: ", ", ".join(groups[i])) # uncomment this if you want to see all of the cards
print()
for card_label in random.sample(groups[i], min(5, len(groups[i]))):
print(card_label)
print(deck_lookup[card_label])
print()
print("\n---")
Cluster 0 (1245 items) Closest to center: street_sweeper quay/6462 you are moving forward into a new phase of your life productivity Celebrating your body adulation/8823 you are willing to part with something near and dear to you in the hopes of material advancement false_tooth/6042 someone in your life is not who they say they are sensuality/4547 Reveling in the good things life has to offer --- Cluster 1 (961 items) Closest to center: food/1995 food_poisoning there is something harming or interfering with your emotional well being doctor/3096 there is some problem that you need to patch up or some emotional wound you need that to bandage up adieu an end to your worries job/2464 Being ineffectual or lazy swiss a metaphor for holes or flaws in your way of thinking --- Cluster 2 (500 items) Closest to center: pretzel sprinkler enlightenment, rejuvenation and cleansing tea/4911 satisfaction and contentment in your life tortilla wholeness river/0989 purification and cleansing head/3361 wisdom, intellect, understanding and rationality --- Cluster 3 (857 items) Closest to center: garret seminar you are expanding your knowledge and understanding Gemini Launching a diet, a weight-lifting program, or a health-related effort tower/2975 high hopes and aspirations chalk school and learning revolt the influence of peer pressure working against you --- Cluster 4 (301 items) Closest to center: post-apocalypse sex/8487 uncertainty about what is ahead detour you have encountered an obstacle in some aspect of your life dying/9716 you are experience a relapse of sorts jealousy/7905 Breaking through barriers frame/3184 limitations and boundaries --- Cluster 5 (111 items) Closest to center: abbey/8798 milk/6835 Wallowing in unhealthy grief or self-pity vacuum feelings of emptiness wooden_shoe solitude and unfaithfulness behind feelings of rejection grief you are repressing your grief --- Cluster 6 (1683 items) Closest to center: comic trash_compactor you are in denial about some issue or problem groan others will take advantage of the situation silhouette some aspect of your life that is not clearly defined furby you are not being clear in how your express yourself hiding you are keeping some secret or withholding some information --- Cluster 7 (346 items) Closest to center: panic tort Recognizing you cannot always be in control evening the end of a cycle, aging or death dracula/4498 you are draining the energy of others breast_implant your body image issues jet/7413 speed, pride or power --- Cluster 8 (624 items) Closest to center: ghost/5604 hometown you may be experiencing some unexpressed feelings water_cooler you are literally bottling up your emotions gum you need to confront some fear or depression father/5835 you are feeling disconnected with one of your parents epilepsy you are suppressing your feelings --- Cluster 9 (20 items) Closest to center: fleur_de_lis prime_minister authority, power and control secret/6533 hidden power furnace power and energy horserace the power electricity_tower a distribution of power --- Cluster 10 (167 items) Closest to center: timber tractor your resourcefulness and ingenuity tuna stamina and agility rottweiler confidence, protection, and courage poplar life and vitality passion_fruit life or vitality --- Cluster 11 (981 items) Closest to center: people/6655 race_car/8018 your hard driving and headstrong attitudes blending Bringing opposites together sorcerer your talents, inner strengths, and creative ability stencil a lack of freedom in some aspect of your waking life crater aspects of your subconscious are being slowly revealed to you --- Cluster 12 (238 items) Closest to center: symbol serial_killer/6291 fear and insecurity pig dirtiness, greediness, stubbornness or selfishness vehicle your hastiness and quick temper fox/9061 insight, cleverness, cunningness and resourcefulness objection Calmly expressing a dissenting opinion ---
With the larger oracle card file, we can use a language model to generate new cards. A Markov chain is probably the easiest way to go; you can read my whole big tutorial on Markov chains and neural networks for text generation for the details. But the quick version is: there's a good Markov chain library for Python that will do the work for you. At the command line, type:
pip install markovify
Markovify does word-level Markov chains by default, but I prefer character-level. This is a little class that implements character-level Markov and also allows you to train on a list of strings (instead of a single big text file).
import markovify
import re
class MarkovLinesByChar(markovify.Text):
def word_split(self, sentence):
return list(sentence)
def word_join(self, words):
return "".join(words)
def sentence_split(self, text):
return re.split(r"\n", text)
We'll create separate models for the card names and the card interpretations:
name_model = MarkovLinesByChar("\n".join([item[0] for item in deck]).lower(), state_size=6)
interpretations_model = MarkovLinesByChar("\n".join([item[1] for item in deck]).lower(), state_size=7)
And then try them out:
name_model.make_sentence(tries=1000)
'surprising/2581'
interpretations_model.make_sentence(tries=1000)
'balance, self-denial'
The following cell prints out ten randomly generated cards:
for i in range(10):
print(name_model.make_sentence(tries=1000))
print(interpretations_model.make_sentence(tries=1000))
print()
fainting/8320 breaking on the goal shaving_credit_card you will achieving protection armored_carpet/4262 you are expand your life president/2302 you are not earned destitutional_anthemums your need for greatness hot_spring/6141 ruggedness, persona you share and joy combination/7431 you have over others sand_castle/6592 refusing the issue or in your life skydiving_boy you have to protection reconcilementations a looming ruthlessness
Note that the card names and the card interpretations here are totally independent of each other! Markov chains are good at generating language that follows statistical characteristics of a source text, but aren't good at generating text about a particular topic. For that, you'll need to use either (a) a model that can take into account long-distance semantic dependencies, like GPT-2, or (b) a model that can generate text shorter snippets of text conditionally, like a captioning model or a variational autoencoder. But that's a subject for a different tutorial!