Semantic space of cartomancy

By Allison Parrish

(Note: Rough draft! Notes incomplete. FIXME: Needs example at the end for predicting from user-supplied text.)

In this notebook, I'm going to take you through a couple of simple, well-known techniques for exploring small sequences of text—like Tarot interpretations. These techniques include:

  • Making vectors for text sequences
  • Nearest-neighbor lookups for semantic similarity
  • Visualizing corpora with t-SNE
  • Clustering sentence vectors to discover similar items

The goal is to better understand the semantic structure of oracle decks.

Installation and preliminaries

I'm going to use spaCy extensively, both as a way to parse text into sentences and also as a source for pre-trained word vectors. Make sure you have it installed, along with the en_core_web_md or en_core_web_lg models. If you're using Anaconda, you can install spaCy and the English language model with the following commands:

conda install -c conda-forge spacy
python -m spacy download en_core_web_md


This notebook also assumes that you have scikit-learn and numpy installed (these come with standard Anaconda installations).

We'll also be using a library called simpleneighbors. To install this, try:

pip install simpleneighbors[annoy]

If that doesn't work (i.e., you get an error about a missing compiler or something), try:

pip install simpleneighbors[sklearn]

Once you're done installing, you should be able to import the modules with the cells below:

In [13]:
import numpy as np
import spacy
from simpleneighbors import SimpleNeighbors
In [2]:
import gzip, json, random

The following cell loads spaCy's language model (this might take a sec):

In [3]:
nlp = spacy.load('en_core_web_md')

Tarot cards and interpretations

Download this file and put it into the same directory as this notebook. (If you cloned the repository, it's already there!) The file contains structured data about the 78 cards in the Tarot deck, and their interpretation. It's in JSON format, so we have to parse it with Python's json library to make the data available:

In [4]:
import json
tarot_data = json.load(open("tarot_interpretations.json"))

The structure of the data is a little bit tricky, and there's more in there than we need for our experiments in this notebook. The cell below creates a list of tuples with each tarot card associated with its interpretation. (The interpretations are created by joining together the values for the fortune_telling key in the JSON file for each card.)

In [5]:
tarot_cards = []
for item in tarot_data['tarot_interpretations']:
    tarot_cards.append((item['name'], "; ".join(item['fortune_telling'])))

The following cell does a simple Tarot spread, sampling three cards at random and showing them with their interpretations:

In [6]:
random.sample(tarot_cards, 3)
Out[6]:
[('nine of coins',
  "Until you appreciate what you have, you won't have any luck getting more"),
 ('The Emperor',
  'A father figure arrives; A new employer or authority figure will give you orders; Expect discipline or correction in the near future'),
 ('four of wands',
  'Someone is watching and evaluating your work; You may get a wedding invitation soon')]

For convenience, the following cell creates a dictionary that makes it easy to look up the interpretation for a given card:

In [7]:
tarot_lookup = dict(tarot_cards)

Look up the meaning of the two of coins:

In [8]:
tarot_lookup['two of coins']
Out[8]:
"It's time to balance the budget; Avoid the temptation to spend critical funds on frivolous goods"

(Note that I don't especially agree with many of the interpretations in this JSON file! This is an especially poor interpretation in my opinion. But these interpretations are what we have to work with, so let's move forward.)

Finally, a list of just the card names:

In [9]:
tarot_labels = [item[0] for item in tarot_cards]

Semantic similarity with word vectors

An immediate goal of this notebook is to make a determination about which Tarot cards have similar interpretations. To do this, we need a measure of semantic similarity: how close are two stretches of text in meaning? For computational purposes, we need to quantify this similarity, and ideally we'd like to be able to visualize it. But how to go about doing this?

Contemporary research in natural language processing offers a solution: the word vector. You can read and follow along with my notebook explaining how word vectors work before continuing, but the short version is that a word vector associates a word with a coordinate in space. Words with similar meanings will have coordinates that are close to each other. (You can think of these coordinates as being on a Cartesian plane (with an X/Y axis), though in reality each coordinate has many more than two dimensions.)

The word vectors that we're using are the Stanford GloVe vectors included with spaCy. These vectors were produced through an algorithmic process that looked at a large corpus of text and recorded all of the contexts that each word occurred in. The vectors are a compressed representation of this list of contexts for each word, so that words that occur in similar contexts have similar vectors. According to the distributional hypothesis, words that occur in similar contexts have similar meanings, so vectors representing contexts can serve as a way to represent meanings of words.

Word vectors are not perfect; they carry with them the biases of the corpora they're trained on. But they have several interesting properties that afford interesting techniques in computational language arts.

The following cell defines a function vec() which makes it easy to get the word vector for a particular word:

In [10]:
def vec(s):
    return nlp.vocab[s].vector

Here's what a vector looks like:

In [11]:
vec('cards')
Out[11]:
array([-0.54786  , -0.060409 , -0.34341  , -0.78909  ,  0.76913  ,
        0.062576 , -0.36196  , -0.067923 , -0.24321  ,  1.4941   ,
       -0.0050086,  0.13367  ,  0.044146 , -0.15426  , -0.083489 ,
       -0.085395 , -0.9413   ,  1.6464   , -0.02296  , -0.08399  ,
       -0.19905  ,  0.33439  , -0.12289  , -0.030612 ,  0.29691  ,
       -0.35093  , -0.021799 , -0.036572 , -0.66472  , -0.0036923,
       -0.15867  , -0.22094  , -0.096175 ,  0.021781 , -0.3915   ,
        0.54463  , -0.43797  , -0.21985  , -0.23609  ,  0.13824  ,
       -0.15661  ,  0.14144  ,  0.019459 , -0.66783  , -0.05219  ,
        0.012071 , -0.46731  , -0.13634  , -0.055697 ,  0.39305  ,
       -0.11922  , -0.061792 ,  0.56273  , -0.53199  , -0.14907  ,
       -0.11057  ,  0.31247  ,  0.30131  , -0.43884  ,  0.11746  ,
        0.07021  ,  0.33124  ,  0.39057  , -0.069946 ,  0.15441  ,
        0.46396  ,  0.0040996,  1.0562   ,  0.36647  , -0.28721  ,
        0.082222 ,  0.00213  ,  0.34582  , -0.0446   ,  0.46781  ,
        0.66558  ,  0.43319  , -0.76897  , -0.050817 ,  0.90723  ,
       -0.26064  ,  0.41021  ,  0.32019  , -0.018988 ,  0.18413  ,
        0.027466 ,  0.2929   ,  0.66606  ,  0.24767  , -0.42119  ,
        0.18377  ,  0.18996  , -0.28091  , -0.38211  ,  0.37067  ,
       -0.46318  ,  0.26522  , -0.072179 , -0.19687  , -0.074965 ,
       -0.63569  , -0.66968  , -0.88611  ,  0.14429  ,  0.12875  ,
       -0.64857  , -0.045062 ,  0.35845  ,  0.46038  , -0.0095807,
        0.50231  , -0.066645 ,  0.09579  , -0.27013  ,  0.39316  ,
       -0.13675  ,  0.047418 ,  0.36256  ,  0.072575 ,  0.096322 ,
        0.75857  , -0.3371   ,  0.31418  ,  0.09372  ,  0.079163 ,
        0.2131   ,  0.11307  ,  0.46478  ,  0.21549  , -0.04808  ,
       -0.30878  ,  0.20758  ,  0.34154  ,  0.20912  , -0.51854  ,
       -0.39562  ,  0.2113   ,  0.57655  , -0.27473  ,  0.4911   ,
       -1.4924   ,  0.49913  ,  0.038791 , -0.28311  , -0.70614  ,
        0.36361  ,  0.2137   ,  0.14033  , -0.35302  , -0.21158  ,
        0.13034  ,  0.32395  , -0.23138  ,  0.39176  ,  0.045725 ,
       -0.39346  ,  0.35702  ,  0.48889  ,  0.060095 , -0.069811 ,
       -0.10339  ,  0.40331  ,  0.516    ,  0.21622  , -0.92662  ,
       -0.59798  ,  0.30912  ,  0.030597 ,  0.19242  , -0.094897 ,
       -0.14628  ,  0.20332  ,  0.48479  ,  0.78987  ,  0.1215   ,
       -0.14864  , -0.21997  , -0.15274  ,  0.44295  ,  0.59229  ,
       -0.20717  ,  0.14929  ,  0.10725  , -0.22542  , -0.1083   ,
        0.26113  , -0.55752  , -0.030635 , -0.33673  , -0.35509  ,
        0.20269  ,  0.033527 , -0.17902  , -0.38712  ,  0.080425 ,
       -0.42007  , -0.93471  ,  0.45989  ,  0.13203  ,  0.50639  ,
        0.95391  , -1.011    ,  0.41906  , -0.18694  ,  0.27447  ,
       -0.17695  ,  0.014367 ,  0.21504  , -0.38839  ,  0.60929  ,
        0.75328  ,  0.26789  , -0.30174  , -0.12735  ,  0.30466  ,
       -0.11706  ,  0.024288 , -0.10797  , -0.56404  ,  0.13712  ,
       -0.28683  , -0.24417  ,  0.070113 , -0.19123  , -1.0564   ,
       -0.1462   ,  0.048439 ,  0.58187  ,  0.087668 ,  0.31711  ,
        0.13831  ,  0.24231  ,  0.24348  ,  0.54646  ,  0.0016724,
       -0.0081518,  0.068331 , -0.21606  , -0.27379  ,  0.55945  ,
        0.2848   ,  1.1669   , -0.064809 ,  0.29127  , -0.11288  ,
       -0.53852  ,  0.12104  , -0.49231  , -0.45929  , -0.0044095,
        0.15616  ,  0.18031  , -0.70871  , -0.34936  ,  0.62035  ,
        0.4343   , -0.63038  , -0.14185  , -0.42332  ,  0.24734  ,
        0.50841  ,  0.066684 , -0.87131  ,  0.22859  ,  0.072485 ,
        0.40083  ,  0.15505  , -0.047754 ,  0.35373  , -0.057295 ,
       -0.85158  ,  0.37551  ,  0.052956 ,  0.53433  ,  0.46134  ,
       -0.072809 , -0.25514  ,  0.3398   , -0.081816 , -0.58906  ,
        0.19547  , -0.38552  , -0.1366   , -0.10245  ,  0.45415  ,
        0.57782  , -0.097346 , -0.17491  , -0.36691  ,  0.13359  ,
       -1.2082   ,  0.92608  ,  0.22259  , -0.091107 ,  0.34454  ,
        0.058    ,  0.78919  , -0.28031  ,  0.37474  ,  0.24557  ],
      dtype=float32)

A distributional thesaurus

The following code builds an index of common words and their vectors. You can use this index to look up words considered to be close in meaning to a given word. (We won't use this for anything in the rest of the notebook, but it's fun to play with.)

In [14]:
thesaurus = SimpleNeighbors(300)
for item in nlp.vocab:
    if item.has_vector and item.prob > -15 and item.is_lower:
        thesaurus.add_one(item.text, item.vector)
thesaurus.build(50)
In [15]:
thesaurus.nearest(vec('magician'), 5)
Out[15]:
['genie', 'magic', 'magician', 'magical', 'enchantment']

Sentence vectors

Word vectors work great when we're interested in individual words. More often, though, we're interested in longer stretches of text, like sentences, lines, paragraphs. If we had a way to represent these longer stretches of text as vectors, we could perform all of the same operations on them that word vectors allow us to perform on words. But how to represent stretches of text as sequences?

There are lots of different ways! The classic technique in machine learning is to use the frequency of terms found in each sequence (methods like tfidf), or similar techniques like doc2vec. Another way is to train a recurrent neural network (like an LSTM) and use its hidden state. Another is to use a pre-trained model, like Google's Universal Sentence Encoder.

But a surprisingly effective technique is to simply average together the word vectors for each word in the sentence. A big advantage of this technique is that no further training is needed, beyond the training needed to calculate the word vectors; if you're using pre-trained vectors, even that step can be skipped. You won't get state-of-the-art results on NLP benchmarks with this technique, but it's a good baseline and still useful for many tasks.

The function below produces a vector that represents the meaning of a sentence with an average of the vectors for each word in the sentence. The function weights each word's contribution to the average based on the inverse of its frequency in English (according to spaCy's built-in word frequency information). The intention of weighting by inverse frequency is to reduce the contribution of common function words like "the" and "of" to the sentence vector. (This approach was inspired by Arora et al.'s Smooth Inverse Frequency technique, though my approach doesn't remove the principle component.)

In [16]:
def sentence_summary(sent, a=0.001):
    s = nlp(sent, disable=['parser', 'tagger', 'ner'])
    weights = [a / (a + np.exp(tok.prob)) for tok in s]
    emb = np.average([tok.vector for tok in s], axis=0, weights=weights)
    return emb

The code in the cell below computes and returns the embeddings every string in a list:

In [17]:
def embeddings(text_list):
    return [sentence_summary(item) for item in text_list]

And here we calculate embeddings for every Tarot interpretation:

In [18]:
tarot_embeddings = embeddings([interpretation for card, interpretation in tarot_cards])

And here's what they look like:

In [19]:
rand_idx = random.randrange(len(tarot_embeddings))
tarot_cards[rand_idx], tarot_embeddings[rand_idx]
Out[19]:
(('ace of coins',
  "Your health will improve; The check you're looking for really is in the mail"),
 array([-7.48437062e-02,  2.38817390e-01, -2.60202422e-01,  6.45209687e-03,
        -5.76540027e-02,  9.11737805e-04, -7.95287452e-02, -3.14694786e-01,
        -8.11702077e-02,  2.32743427e+00, -5.00491939e-01,  3.74548861e-02,
         1.31327028e-02, -6.07544370e-02, -7.64834004e-02,  9.26227071e-04,
        -9.57607638e-02,  1.55665597e+00, -3.24453771e-01,  7.33627906e-02,
         1.14888679e-02,  1.70832060e-01, -2.99182090e-02, -1.18720762e-01,
         5.66484608e-02,  1.36779241e-02, -7.04350782e-02, -1.41377904e-01,
         1.98921806e-01, -2.15919173e-01, -7.38209155e-02, -1.32633572e-02,
         4.55063377e-02, -1.58357770e-02,  3.85797984e-02, -1.66530833e-03,
        -3.66343062e-02,  1.22332848e-01,  9.88227520e-03, -5.77911806e-02,
        -6.06664125e-02,  3.09903999e-02, -1.21920058e-01, -2.43932289e-01,
        -4.46682993e-02,  9.22677711e-02, -9.96118580e-02,  1.57244214e-01,
         3.19178682e-02,  6.43189688e-02, -1.01735060e-01, -1.50214267e-01,
         1.51780992e-01, -2.35615949e-01,  5.52883619e-02,  1.63287913e-01,
        -1.78592247e-02, -1.07295609e-01, -1.05380497e-02, -8.59099143e-02,
         4.46896350e-02, -1.61179409e-01, -7.68931289e-02,  2.14964732e-01,
         3.30428203e-01,  4.80253013e-02,  2.59878188e-02,  1.17989992e-01,
         8.53953369e-02,  1.54478739e-01, -2.44244780e-02, -2.41791573e-02,
         3.69427739e-01, -1.33175913e-01,  2.41763305e-01,  1.87899780e-01,
         1.32443139e-01, -1.95086423e-01, -2.22632380e-02,  1.33292755e-01,
        -4.98899809e-02,  1.20693739e-01, -9.51892856e-02,  6.75185254e-02,
         4.63139610e-02, -9.31672267e-02, -7.28366242e-02,  1.48728923e-01,
         2.99043321e-01, -4.99964930e-03, -4.03147672e-02,  1.77066748e-01,
         3.54631857e-02, -1.62081859e-02,  5.19220642e-02, -6.72302465e-02,
        -1.13579376e-01, -1.43683545e-01,  2.78564691e-02,  7.62348928e-02,
        -6.65593502e-02, -8.50503551e-02,  6.91155524e-02, -1.59170916e-01,
         1.09488480e-01, -8.22551231e-01,  3.65457500e-01,  4.60299493e-02,
        -8.61159351e-03, -4.36428519e-02,  1.64247674e-01, -1.65095284e-01,
         2.05229024e-01, -2.61950875e-02,  1.82527833e-01, -1.11650097e-01,
         1.08528657e-01, -4.38461660e-02, -4.77332941e-02, -2.26382037e-02,
         1.70719513e-01,  7.43769102e-03, -1.15183466e-01,  4.10232764e-03,
         1.45234198e-01, -8.14227408e-02,  7.00700930e-02,  9.20888089e-03,
        -2.97235317e-02, -6.03460036e-02,  3.81184236e-02, -1.52229369e-01,
        -1.96107704e-01, -8.70210721e-03,  1.05284864e-01,  7.56332051e-02,
        -9.34177743e-02, -1.34808105e-01, -1.27328255e-01, -2.15593590e-02,
        -9.58447919e-01,  1.55509800e-01,  1.12895698e-01, -1.08795166e-01,
        -1.58813633e-03, -1.30173494e-01,  3.65858307e-02,  1.10479234e-01,
        -7.31942739e-02,  1.59150441e-01, -1.51762203e-02,  5.68230947e-02,
         9.08937080e-02,  1.14627780e-02, -2.57060010e-02,  1.92023367e-01,
         6.19790247e-02, -7.57231737e-02, -2.86259124e-02, -2.07441214e-01,
         2.67854245e-02,  2.06206519e-01, -3.82097765e-02, -2.06149368e-01,
         8.24402842e-02, -1.32642717e-01,  2.18515685e-01, -1.07890557e-01,
         1.54641973e-01, -7.23135646e-02, -9.24776183e-02, -2.48391590e-02,
         2.02703915e-01, -4.28118497e-02, -6.24633172e-02,  1.61729463e-01,
        -2.50933312e-02,  2.23573054e-01,  1.24342019e-01,  3.64926962e-01,
        -1.01829669e-01, -4.44452489e-02, -1.61869151e-01,  2.25040499e-02,
         5.42481704e-02,  1.18308602e-01, -1.23432542e-01, -1.36318297e-01,
        -2.51408446e-01, -4.44131144e-02,  4.91594177e-02, -1.27359955e-02,
         4.97160255e-02, -1.49556022e-01,  7.79283473e-02,  3.10716404e-01,
        -8.51481380e-02, -2.69857885e-02,  2.08288904e-01,  2.51421121e-01,
         1.15344299e-01, -1.91247327e-01, -1.06143072e-01,  4.58277885e-02,
         7.00997709e-02,  1.18081542e-01, -4.51600569e-02,  1.35772967e-01,
        -5.17282925e-03,  4.78049697e-02, -5.67102847e-02, -3.12424448e-01,
        -8.52369813e-02, -1.95575626e-01, -1.56246210e-01, -6.09950984e-02,
         8.69768248e-03, -1.44713961e-02, -4.83891870e-01, -9.00313824e-02,
        -1.71125145e-01, -9.71028977e-02, -1.88515921e-01,  1.27193583e-01,
         3.81317358e-02, -7.79134478e-02,  2.34144180e-02,  1.62707638e-01,
         1.08522363e-02, -7.17969703e-02, -9.68405901e-02,  5.41182664e-02,
         5.03943266e-02,  7.04710596e-02, -2.82413905e-02,  2.05238891e-02,
        -1.04265870e-04, -2.20987733e-01, -6.54041566e-02, -1.43803114e-02,
         5.82879536e-02, -1.90525812e-03,  1.46807064e-01,  1.42797293e-02,
         1.97468939e-01, -1.73688021e-01, -1.08334611e-01, -9.03258903e-02,
        -7.88496215e-02,  5.92143497e-02,  1.21125243e-01, -1.70939167e-01,
        -1.73801636e-01, -5.62711353e-02,  1.07790899e-01,  3.23222063e-01,
        -4.65322352e-02, -5.61135153e-02, -5.98254008e-03, -1.79172921e-02,
        -7.61352234e-02,  1.56493064e-01,  2.49916214e-01, -1.00796355e-02,
         9.27975720e-02, -2.08178651e-01, -1.50246486e-01,  2.47436565e-01,
         4.48720485e-01, -1.85201411e-02, -4.01863171e-02, -8.27071548e-02,
        -1.19647022e-01, -5.90888109e-02,  5.19629545e-02,  4.14311134e-02,
         2.95730843e-03,  1.17222447e-01, -2.53124451e-02,  1.11358427e-01,
        -1.23865467e-01,  1.47237513e-02,  1.15925128e-01,  1.49301028e-01,
         1.30670394e-02, -1.53877910e-01,  1.30660114e-01, -1.46171688e-01,
         1.46446332e-01, -1.28922267e-01, -2.20237411e-01,  2.76028688e-02,
         4.93983164e-02, -3.10195190e-02,  4.27544581e-02, -1.77525941e-01,
         5.50127771e-02, -1.25325867e-01, -5.63722888e-05,  2.26680504e-01]))

A Tarot search engine

Sentence vectors aren't especially interesting on their own! One thing we can use them for is to build a tiny "search engine" based on semantic similarity between cards. To do this, we need to be able to calculate the distance between a target sentence's vector (not necessarily a sentence from our corpus) and vectors of the sentences in the corpus, returning them ranked based on the distance between the two vectors. We'll use another SimpleNeighbors object to perform fast lookups based on similarity.

First, I build an approximate nearest neighbors index using the vectors I got from the inverse frequency weighting technique:

In [20]:
tarot_nn = SimpleNeighbors(300)
for vec, (card, interpretation) in zip(tarot_embeddings, tarot_cards):
    tarot_nn.add_one(card, vec)
tarot_nn.build(50)

The .nearest() method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the sentence_summary() function to return the interpretations most similar to the sentence you type in. The number controls how many sentences should be returned.

In [25]:
tarot_nn.nearest(sentence_summary("happiness and contentment"), 5)
Out[25]:
['ten of cups', 'six of cups', 'The Lovers', 'The Hermit', 'queen of cups']

The same idea, but also prints out the interpretation for the card:

In [26]:
for item in tarot_nn.nearest(sentence_summary("happiness and contentment"), 5):
    print(item)
    print(tarot_lookup[item])
    print()
ten of cups
Marriage and family are in the cards; Expect a friendship to blossom into a romance

six of cups
A stingy spirit is strangling your enjoyment of life; Loosen up and think of others for once, why don't you?

The Lovers
A new personal or professional relationship blossoms; Sexual opportunities abound; Unexpectedly, a friend becomes a lover

The Hermit
A period of loneliness begins; One partner in a relationship departs; A search for love or money proves fruitless

queen of cups
This card represents a woman with an emotional, deeply spiritual nature, likely born between June 11th and July 11th, who uses emotional and spiritual appeals to sway others to her point of view

To get neighbors for a particular card:

In [27]:
tarot_nn.neighbors('The Sun', 5)
Out[27]:
['The Sun', 'The Moon', 'five of wands', 'ten of wands', 'nine of wands']

To get neighbors for a random item in the corpus:

In [28]:
tarot_nn.neighbors(random.choice(tarot_nn.corpus), 5)
Out[28]:
['Death', 'The Hermit', 'The Hanged Man', 'The Empress', 'six of swords']

Finding cards "in between"

A benefit of representing meaning as points in space is that we can perform vector arithmetic on those vectors. One operation of interest is finding the midpoint between two points (i.e., averaging two points). If we find the vectors for two cards, then average those vectors together and find the card nearest to that point, we'll get a card that is "between" the two other cards in terms of their meaning.

The following function averages two vectors:

In [29]:
def average(v1, v2):
    return (np.array(v1) + np.array(v2)) / 2

And then this code prints out the five cards nearest to the midpoint of the two named cards. (Often the first two nearest cards are the source and destination themselves. Weirdly, The Moon tends to be the midpoint in most of my experiments for any two cards!)

In [30]:
src = 'ace of coins'
dest = 'Death'
for item in tarot_nn.nearest(average(tarot_nn.vec(src), tarot_nn.vec(dest)), 5):
    print(item)
    print(tarot_lookup[item])
    print()
ace of coins
Your health will improve; The check you're looking for really is in the mail

Death
A relationship or illness ends suddenly; Limit travel and risk-taking; General gloom and doom

The Moon
Watch for problems at the end of the month; Someone you know needs to howl at the moon more often; Someone is about to change his or her mind about an important decision

Judgement
An old issue you thought was over will come up again today; Get ready for huge changes: break-ups, sudden calls from old friends, and unexpected setbacks; God's trying to get your attention

The Magician
A powerful man may play a role in your day; Your current situation must be seen as one element of a much larger plan

Visualize tarot space in two dimensions

Another thing you can do with sentence vectors is visualize them. But the vectors are large (in our case, 300 dimensions), which doesn't have an obvious mapping to 2-dimensional space. Thankfully, there are a number of algorithms to reduce the dimensionality of vectors. We're going to use t-SNE ("t-distributed stochastic neighbor embedding"), but there are others to experiment with that might be just as good or better for your application (like PCA or UMAP.

In [31]:
from sklearn.manifold import TSNE
mapped_embeddings = TSNE(n_components=2,
                         metric='cosine',
                         init='pca',
                         learning_rate=75,
                         perplexity=15,
                         n_iter=5000,
                         verbose=1).fit_transform(tarot_embeddings)
[t-SNE] Computing 46 nearest neighbors...
[t-SNE] Indexed 78 samples in 0.000s...
[t-SNE] Computed neighbors for 78 samples in 0.005s...
[t-SNE] Computed conditional probabilities for sample 78 / 78
[t-SNE] Mean sigma: 0.125879
[t-SNE] KL divergence after 250 iterations with early exaggeration: 75.816658
[t-SNE] KL divergence after 3000 iterations: 0.518916

The following function draws an image with the results of the t-SNE. (You can control-click or right-click to save it.)

In [32]:
%matplotlib inline
import matplotlib.pyplot as plt
def disp_tsne(embeddings, labels, figsize=12):
    plt.figure(figsize=(figsize, figsize))
    x = embeddings[:,0]
    y = embeddings[:,1]
    plt.scatter(x, y)
    for i, item in enumerate(labels):
        plt.annotate(labels[i], (x[i], y[i]))

And the following calls the function with the result of the TSNE:

In [33]:
disp_tsne(mapped_embeddings, tarot_labels)

Exporting for Google embedding projector

Upload these files here.

In [34]:
with open("tarot-emb-proj-vecs.tsv", "w") as fh:
    for item in tarot_embeddings:
        fh.write("\t".join(["%0.5f" % val for val in item]))
        fh.write("\n")
In [35]:
with open("tarot-emb-proj-labels.tsv", "w") as fh:
    fh.write("\n".join(tarot_labels))

Finding clusters

In the visualization above, you may have seen some evidence of "clustering"—groups of items that seem to be related. There are algorithms that facilitate finding such clusters automatically. This can be an interesting and valuable way to explore your data—you might find clusters of meaning that you didn't expect.

We're going to use the K-Means clustering algorithm (in particular, scikit-learn's MiniBatchKMeans).

K-Means is an unsupervised algorithm, meaning that you don't need to label the data for it to work. But you do need to specify how many clusters you expect to find:

In [36]:
cluster_count = 5 # adjust this until it starts giving you good results!

The code in the following cell computes clusters for the given sets of embeddings, labels and cluster count:

In [37]:
from sklearn.cluster import MiniBatchKMeans
from collections import defaultdict

def cluster_labels(embeddings, labels, cluster_n):
    clusterer = MiniBatchKMeans(n_clusters=cluster_n)
    clusters = clusterer.fit_predict(embeddings)
    group_by_cluster = defaultdict(list)
    for i, item in enumerate(clusters):
        group_by_cluster[item].append(labels[i])
    centers = [clusterer.cluster_centers_[i] for i in range(cluster_n)]
    groups = [group_by_cluster[i] for i in range(cluster_n)]
    return (centers, groups)

Let's calculate this for our Tarot embeddings and cards. The function also returns the center of each cluster:

In [38]:
centers, groups = cluster_labels(tarot_embeddings, tarot_labels, cluster_count)

The code in the following cell takes this information and prints it out. Each cluster is shown along with at most five cards that belong to it and the card closest to the center of the cluster.

In [39]:
for i in range(cluster_count):
    print(f"Cluster {i} ({len(groups[i])} items)")
    print("Closest to center: ", list(tarot_nn.nearest_matching(centers[i], 1, lambda x: x in groups[i]))[0])
    print("All cards: ", ", ".join(groups[i]))
    print()
    for card_label in random.sample(groups[i], min(5, len(groups[i]))):
        print(card_label)
        print(tarot_lookup[card_label])
        print()
    print("\n---")
Cluster 0 (4 items)
Closest to center:  ace of cups
All cards:  The Lovers, ace of cups, three of cups, ten of cups

ace of cups
Romance is in the cards; A new relationship or marriage is just around the corner; Prayers are answered

three of cups
Unconventional romance is coming your way: a love affair with someone you've always dismissed

ten of cups
Marriage and family are in the cards; Expect a friendship to blossom into a romance

The Lovers
A new personal or professional relationship blossoms; Sexual opportunities abound; Unexpectedly, a friend becomes a lover


---
Cluster 1 (30 items)
Closest to center:  The Moon
All cards:  The Fool, The Wheel, Justice, The Tower, The Moon, The Sun, Judgement, three of wands, four of wands, five of wands, seven of wands, nine of wands, ten of wands, eight of cups, nine of cups, ace of swords, two of swords, four of swords, six of swords, seven of swords, eight of swords, nine of swords, ten of swords, knight of swords, ace of coins, four of coins, six of coins, seven of coins, nine of coins, ten of coins

six of swords
You'll soon go on a long journey over water; Actions have unexpected consequences, so be prepared

nine of coins
Until you appreciate what you have, you won't have any luck getting more

seven of swords
Don't assume people around you are worthy of your trust; Ask for an accounting of where people have been, and what they've been doing

Justice
A legal verdict will be rendered soon; Someone is making a decision; You need to get the facts

knight of swords
A blunder leads someone to say something he or she regrets; If this was you, be prepared to apologize and move on


---
Cluster 2 (15 items)
Closest to center:  four of cups
All cards:  The Devil, The Star, The World, ace of wands, two of wands, six of wands, eight of wands, two of cups, four of cups, five of cups, six of cups, seven of cups, three of swords, five of swords, eight of coins

two of cups
Someone has a secret crush on you; Relationships should be mutual; get rid of a leech

five of cups
A breakup looms; Don't cry over spilt milk; Take your lumps and get back in the saddle

five of swords
Someone is stealing from you, financially or romantically; Be wary of friends who talk behind your back

eight of coins
Stop over-analyzing, researching, and outlining; Buckle down and get the work done

six of cups
A stingy spirit is strangling your enjoyment of life; Loosen up and think of others for once, why don't you?


---
Cluster 3 (14 items)
Closest to center:  knight of cups
All cards:  page of wands, knight of wands, queen of wands, king of wands, page of cups, knight of cups, queen of cups, king of cups, page of swords, queen of swords, king of swords, page of coins, queen of coins, king of coins

knight of wands
This card represents a man with a bold, passionate personality, likely born between July 12th and August 11th, who wants to sweep you off your feet

page of cups
This card represents a young man or woman with a watery, dreamy demeanor, likely born a Libra, Scorpio, or Sagittarius, who wants to start a new relationship with you

knight of cups
This card represents a man with an emotional, sensitive personality, likely born between October 13th and November 11th, who wants you to rally around his latest passionate cause

queen of swords
This card represents a woman with an artistic, intellectual nature, likely born between September 12th and October 12th, who uses clever, positive communication to sway others to her point of view

king of swords
This card represents an older man with an insightful, deliberate spirit, likely born between May 11th and June 10th, who is known for his integrity and sharp decision-making ability


---
Cluster 4 (15 items)
Closest to center:  The Hanged Man
All cards:  The Magician, The Papess/High Priestess, The Empress, The Emperor, The Pope/Hierophant, The Chariot, Strength, The Hermit, The Hanged Man, Death, Temperance, two of coins, three of coins, five of coins, knight of coins

The Hermit
A period of loneliness begins; One partner in a relationship departs; A search for love or money proves fruitless

The Magician
A powerful man may play a role in your day; Your current situation must be seen as one element of a much larger plan

three of coins
A high-dollar contract is in your future; If you work hard, you'll succeed

Death
A relationship or illness ends suddenly; Limit travel and risk-taking; General gloom and doom

The Papess/High Priestess
A mysterious woman arrives; A sexual secret may surface; Someone knows more than he or she will reveal


---

K-Means clustering has a random component, so you won't necessarily end up with the same clusters every time. With higher numbers of clusters, you tend to get clusters that only have one or two items (not ideal). A repeating pattern for me is that the court cards tend to end up in the same cluster and The Moon tends to be the center of large clusters. Not sure why!

Oracle decks

We can consider Tarot to be one exemplar of the category of "oracle decks."

In this repository, there's a file called oracle-corpus.tsv. This consists of a tab-separated file with several thousand potential oracle cards, drawn from the Tarot interpretations file and a book of dream interpretations.

This section of the notebook performs the same analysis we performed on the Tarot deck, but with this TSV instead. It'll work with any TSV though!

Because this TSV has some duplicate cards, I added some code that adds random numbers to the end of the card's name (in order to disambiguate between duplicates). The numbers don't actually mean anything.

In [40]:
already_seen = set()
deck = []
for line in open("./oracle-corpus.tsv"):
    line = line.strip()
    card, interp = line.split("\t")
    # dealing with duplicates
    if card in already_seen:
        card += "/%04d" % random.randrange(10000)
    already_seen.add(card)
    deck.append((card, interp))
In [41]:
random.sample(deck, 10)
Out[41]:
[('petticoat', 'your reluctance in revealing something about yourself'),
 ('blog', 'your popularity'),
 ('photo_booth',
  'things that are thought to be private may not be so private'),
 ('tractor', 'your resourcefulness and ingenuity'),
 ('mink', 'value, warmth, riches, or luxury'),
 ('sushi', 'you need to acknowledge your spiritual side'),
 ('monster/8338', 'aspects of yourself that you find repulsive and ugly'),
 ('post-it_note', 'there is something that you need to make a mental note of'),
 ('devil/6041', 'you will succeed in defeating your enemies'),
 ('erosion', 'a situation or relationship that is wearing away')]

A dictionary to look up interpretations for cards:

In [42]:
deck_lookup = dict(deck)

Look up the meaning of the two of coins:

In [43]:
deck_lookup['flesh']
Out[43]:
'a heightened sense of feeling and vitality'

Finally, a list of just the card names:

In [44]:
deck_labels = [item[0] for item in deck]

Oracle nearest neighbors

Building a nearest neighbors index:

In [45]:
deck_embeddings = embeddings([interpretation for card, interpretation in deck])
In [46]:
deck_nn = SimpleNeighbors(300)
for vec, (card, interpretation) in zip(deck_embeddings, deck):
    deck_nn.add_one(card, vec)
deck_nn.build(50)

The .nearest() method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the sentence_summary() function to return the interpretations most similar to the sentence you type in. The number controls how many sentences should be returned.

In [47]:
deck_nn.nearest(sentence_summary("everything will be fine!"), 5)
Out[47]:
['tart', 'face/5958', 'funeral', 'comfort', 'luck']

The same idea, but also prints out the interpretation for the card:

In [48]:
for item in deck_nn.nearest(sentence_summary("everything will be fine!"), 5):
    print(item)
    print(deck_lookup[item])
    print()
tart
things are going well for you

face/5958
you need to come clean about some matter

funeral
something in your life needs to put to rest or put aside so that you can make room for something new

comfort
Appreciating fine food, fine wine, beautiful art, beautiful bodies, or any of the better things in life

luck
things will look up for you

To get neighbors for a particular card:

In [49]:
deck_nn.neighbors('luck', 5)
Out[49]:
['luck', 'lens', 'tart', 'joint', 'staring']

To get neighbors for a random item in the corpus:

In [50]:
deck_nn.neighbors(random.choice(deck_nn.corpus), 5)
Out[50]:
['pole_vaulting', 'trough', 'pineapple', 'roommate', 'boxing_glove']

Oracle deck interpretations

Because we used the same embedding procedure for both our oracle deck cards and the Tarot deck, we can find oracle cards that are close in meaning to Tarot cards:

In [51]:
for item in deck_nn.nearest(tarot_nn.vec('The Fool'), 5):
    print(item)
    print(deck_lookup[item])
    print()
sit_up
you need to pay better attention to something in your life, like a relationship, school, work, family, or project

gear/8969
you are ready to move forward with a new project in your life

astral_projection
you are looking at things from a whole new perspective

new_year
prosperity, hope, new beginnings and an opportunity to make a fresh start

originality/9359
Putting old things together in new and exciting ways

Looking up Oracle cards halfway between the Tarot cards ace of coins and Death:

In [52]:
src = 'ace of coins'
dest = 'Death'
for item in deck_nn.nearest(average(tarot_nn.vec(src), tarot_nn.vec(dest)), 5):
    print(item)
    print(deck_lookup[item])
    print()
path
you need to give serious attention to the direction you are heading in your personal and/or business life

comic
you refuse to see the problems that exist in your life and only want to focus on the good times

bet
you are taking a risk in a relationship or work situation which may not be such a wise choice

outbound
Taking what you want without concern for the needs of others

camper
you need to move on with some situation or some aspect of your life

This code picks two cards at random and then prints the card that is between the two in meaning.

In [67]:
src = deck_nn.vec(random.choice(deck_nn.corpus))
dest = deck_nn.vec(random.choice(deck_nn.corpus))
avg = average(src, dest)
src_card = deck_nn.nearest(src, 1)[0]
avg_card = deck_nn.nearest(avg, 1)[0]
dest_card = deck_nn.nearest(dest, 1)[0]
print(src_card)
print(deck_lookup[src_card])
print()
print(avg_card)
print(deck_lookup[avg_card])
print()
print(dest_card)
print(deck_lookup[dest_card])
herd
you are a follower

sunglasses
you are having a hard time getting to know this person

pacifier/1560
you are trying to "suck up to" someone in your waking life

Visualizing oracle space

This is the same T-SNE code as above, but applied to the oracle card embeddings.

Note in the code below, I'm using an even smaller subset of the data. (That's what the [:2000] is doing—just using the first 2000 samples.) This is because t-SNE is slow, as is drawing the results of a t-SNE).

In [68]:
from sklearn.manifold import TSNE
deck_mapped_embeddings = TSNE(n_components=2,
                         metric='cosine',
                         init='pca',
                         learning_rate=75,
                         perplexity=15,
                         n_iter=1000,
                         verbose=1).fit_transform(deck_embeddings[:2000])
[t-SNE] Computing 46 nearest neighbors...
[t-SNE] Indexed 2000 samples in 0.012s...
[t-SNE] Computed neighbors for 2000 samples in 0.164s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2000
[t-SNE] Computed conditional probabilities for sample 2000 / 2000
[t-SNE] Mean sigma: 0.137747
[t-SNE] KL divergence after 250 iterations with early exaggeration: 84.509010
[t-SNE] KL divergence after 1000 iterations: 1.874479

Because there are more labels, we need to make the image larger so we can see them all clearly. You may need to save this and open in an image viewing tool (like macOS Preview) to see everything comfortably.

In [69]:
disp_tsne(deck_mapped_embeddings, deck_labels[:2000], 32)