PyGotham Talk, 10/6/17

Run to build: jupyter nbconvert PyGothamTalk.ipynb --to slides --post serve

In [33]:
## 1
from lxml import html
from utils import *
import requests
import calendar
import json
import string
import re

## 2
from __future__ import print_function
from PIL import Image
#brew install tesseract
import pytesseract
import sys

## 3

## 4
import spacy

import urllib
import functools
import itertools
import numpy as np
import nltk
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
stemmer = PorterStemmer()

from PyDictionary import PyDictionary
dictionary=PyDictionary()

## 5
import pandas as pd

In [34]:
def sortVocab(maxlen):
sortedvocab = {}
keys = []
for i in [w for w in nlp.vocab if w.has_vector and w.orth_ == re.sub('[^A-Za-z]','',w.orth_) and w.orth_.islower() and len(w.orth_) <= maxlen]:
k = len(i.orth_)
if k not in keys:
sortedvocab[k] = []
keys.append(k)
sortedvocab[k].append(i)
return sortedvocab
cosine = lambda v1, v2: np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
vocab = sortVocab(15)

In [92]:
def getHitCountDemo(clues):
success = []
tally = []
for i in clues:
score = [0 for x in cand_methods]
for index,j in enumerate(cand_methods):
if i.get(j) and re.sub(" ","",i['answer'].lower()) in [x.lower() for x in i[j]]:
score[index] = 1
success.append(i)
tally.append(score)
print('hits:',len(success),'/',len(clues),'=',len(success)/len(clues))
return success,tally

def getSpacyFormulationsDemo(clue):
formulations = []
formulations.append([nlp.vocab[x] for x in clue.split() if x not in stopwords.words('english')])
formulations.append([nlp.vocab[x.lower_] for x in nlp(clue) if x.pos_ == "NOUN" or x.pos_ == "PROPN"])
formulations.append([nlp.vocab[x.lower_] for x in nlp(clue) if x.pos_ is not "PART"])
formulations.append([nlp.vocab[x.lemma_] for x in nlp(clue.lower()) if x.tag_ in ["NNS","NNPS","NN","NNP"]])
#     if clue.find(',') != -1:
#         formulations += getSpacyFormulationsDemo(clue[:clue.find(',')])
#     if clue.find(' (') != -1:
#         formulations += getSpacyFormulationsDemo(clue[:clue.find(' (')])
return formulations

In [80]:
import warnings; warnings.simplefilter('ignore')
import time


# Solving a Crossword Puzzle the Hard Way¶

## Alec Barrett¶

### October 7, 2017¶

Thank you for being here. Thanks to the organizers of PyGotham for putting on this conference and having me speak. My name is Alec Barrett and by day I'm a JavaScript developer at Two-N, a data visualization firm here in NYC.

The project I’d like to talk to you about originated about eight months ago, when I was studying at the Recurse Center. In accordance with RC's mission, I wanted to become a better programmer, and I wanted to learn Python.

Meanwhile, I’ve been regularly doing the New York Times crossword puzzle for a couple of years now. Raise your hand if you do a crossword puzzle at least once a week. I also wanted to get better at solving crossword puzzles, and spend some time thinking about how the clues and answers work. So I set out to build a Python program that could solve the New York Times crossword puzzle.

### Outline¶

1. NYT Crossword
3. NLP
4. The Solver
5. Conclusion

## I. New York Times Crossword¶

### Puzzle features¶

• "American-style"
• Get more difficult as the week goes on
• 15 x 15 every day except Sunday (21 x 21)

Some of you may already be familiar with the New York Times and other “American” style crossword puzzles (i.e., every letter is part of a word that goes Across and one that goes Down). There’s a puzzle each day of the week, and they get more difficult as the week goes on, with Monday being the easiest and Saturday the hardest. Sunday is in between but is also larger, at 21x21 squares rather than the 15x15 grid for the rest. The number of clue-answer pairs in a typical 15x15 grid is in the high 70s or low 80s. This past Monday’s puzzle had 76.

### Relevant skills¶

• Large vocabulary
• Encyclopedic knowledge
• Lateral thinking/word association
• Filter by length and known letters

What makes a crossword puzzle difficult? Length of words, obscurity of answers, and use of wordplay or words with multiple meanings.

Conversely, what makes a solver good at crosswords is a large vocabulary, an encyclopedic knowledge of names and terms from a wide range of topics, the ability to think laterally and across multiple associations, and the ability to constrain one’s thoughts based on the length of words and letters, if any, already in the puzzle.

A few more idiosyncrasies about crossword puzzles are worth mentioning: clues contain contextual information about the corresponding answer: singular or plural, verb tense, abbreviations. Clues can refer to the answer to other clues, e.g., “28-Across: Slippery 1-Across”. That's impossible to answer confidently without knowing first what 1-Across's answer is, or at least what its clue is.

### Computer strengths...¶

• Searching a corpus for a word
• Filtering a list based on a set of conditions

### ...and weaknesses¶

• Interpreting puns
• Considering pairs of answers together

Some of these constituent tasks are better suited to a computer than a human: searching a thesaurus, for example, or filtering a long list of words based on length and matching letters in specific places. Others, like interpreting puns or considering pairs of answers together, are not impossible, but certainly more of a stretch for the machine.

### Strategy¶

• Generate a list of candidates
• Identify the correct answer based on intersecting letters

### Sources of candidates¶

• Google Knowledge Graph search API
• Wikipedia search API
• Dictionary (PyDictionary)
• WordNet (NLTK)
• Word Vectors (spaCy)

My strategy was to write a program that would generate, for each clue in the puzzle, a list of candidate answers that fit the size of the corresponding space. I believe that given a list of possible answers for each word that includes the correct answer, the program should be able to identify the correct answer based on the unique combination of other candidates that contain the same letters. For example, we would be more confident about the word “FISH” in 1-Across if we also found candidates starting with “F”,”I”,”S”, and “H” in the intersecting Down spaces.

I used five different methods to build up this list of candidates:

• Google Knowledge Graph search API
• Wikipedia search API
• Dictionary (PyDictionary)
• Wordnet (NLTK)
• Vectors (spaCy)

### Sources of candidates¶

#### Encyclopedias¶

• Google Knowledge Graph search API
• Wikipedia search API

#### Dictionaries¶

• Dictionary (PyDictionary)
• WordNet (NLTK)
• Word Vectors (spaCy)

I think about these methods in two groups, encyclopedias and dictionaries. The first two sources will try to find answers that are informationally related to their clues: that the Taj Mahal is made of marble, or that there is a physicist Enrico whose last name is Fermi. These will often involve proper nouns, either in the clue or answer or both.

The other three will try to find answers that are semantically related to their clues: a term for "bundles of hay" is "bales", or an "artery" is another word for a "major highway". These will tend to be words that could be found in a dictionary, though it is with these words that tenses and pluralization matter more.

### When to use each¶

Condition Example Tools
References another cell Slippery 1-Across Skip
Single word Stops Dictionary, WordNet, Vectors
Space in the clue ___ of Sandwich KG, Wiki, Vectors
Else... Gas company famous for its toy trucks KG, Wiki, Vectors
... and if we can extract one word Distort, as data + Dictionary, WordNet

I also made some determinations about which methods to use for which types of clues. First of all, there are practical limitations: it doesn’t make sense to search for synonyms if there are multiple meaningful words in the clue. Secondly, some clues provide a blank space, suggesting that there is either a common phrase or a proper name that the answer will complete.

This table shows the different conditions my program checks for when deciding which methods to apply to each clue.

I think the most interesting, not to mention trendy, of these approaches are the natural language processing libraries, NLTK and spaCy, and I’ll talk about those methods in detail. And I don’t want to give anyone false hope, so I’ll spoil the surprise right now: my program, in its current form, does not solve a full puzzle. I’ll explain later what some of its limitations are. But some of the answers it is able to generate are pretty remarkable, and it gives me hope that these methods, with continued iteration and refinement, could get there eventually.

## III. Natural Language Processing (NLP)¶

### Two NLP Libraries¶

1. Wordnet (NLTK)
2. Word Vectors (spaCy)

The first of these two libraries, Natural Language Toolkit (NLTK), provides access to Wordnet, self-described as a “lexical database of English.” I first learned about NLTK from a fellow Recurser building a project with Markov chains, a different sort of word association task. NLTK has a repository of large English language corpuses for tasks like these.

### NLTK: Relationships between words¶

• Hypernyms (less specific)
• Root hypernyms (much less specific)
• Hyponyms (more specific)
• Member holonyms (more/equally specific)

For my purposes, Wordnet is useful in that it exposes relationships between words more complex than just synonyms and antonyms. Wordnet lets my program take a one-word clue and explore a literal network of words

• Hypernyms: things of which this is an example (less specific)
• Root hypernyms: the most general thing of which this is an example (less specific)
• Hyponyms: examples of this (more specific)
• Member holonyms: things of which this is a part (more/equal specificity)
In [83]:
def noSpace(word):
return re.sub('_','',word)

def searchWordnetSpaceDemo(synset, length, candidates=set(), iteration=2):
print('--> ',synset,"(iteration",iteration,")")
if synset.lemma_names():
candidates.update({
noSpace(lemmaname)
for lemmaname
in synset.lemma_names() if noSpace(lemmaname) == length
})
for attr in ['root_hypernyms','member_holonyms','hyponyms','hypernyms']:
if getattr(synset,attr)():
for nym in getattr(synset,attr)():
results = { noSpace(lemma.name())
for lemma in nym.lemmas()
if len(noSpace(lemma.name())) == length }
if len(results):
print(attr,results)
if iteration < 3:
searchWordnetSpaceDemo(nym,length,candidates,iteration+1)


In [36]:
synonyms = wn.synsets('somersault')
synonyms

Out[36]:
[Synset('somersault.n.01'), Synset('somersault.v.01')]

In [37]:
synonyms[1].hypernyms()

Out[37]:
[Synset('roll_over.v.01')]
In [38]:
synonyms[1].root_hypernyms()

Out[38]:
[Synset('move.v.03')]
In [58]:
for syn in synonyms:
searchWordnetSpaceDemo(syn,4)

-->  Synset('somersault.n.01') (iteration 2 )
-->  Synset('entity.n.01') (iteration 3 )
-->  Synset('flip-flop.n.04') (iteration 3 )
hypernyms {'flip'}
-->  Synset('tumble.n.01') (iteration 3 )
hyponyms {'flip'}
-->  Synset('somersault.v.01') (iteration 2 )
root_hypernyms {'move'}
-->  Synset('move.v.03') (iteration 3 )
root_hypernyms {'move'}
hyponyms {'stir'}
hyponyms {'take'}
hyponyms {'beat'}
hyponyms {'flap', 'beat'}
hyponyms {'bolt'}
hyponyms {'buck', 'jerk'}
hyponyms {'tilt', 'cant'}
hyponyms {'tilt'}
hyponyms {'chop'}
hyponyms {'roil', 'moil', 'boil'}
hyponyms {'duck'}
hyponyms {'exit'}
hyponyms {'flex', 'bend'}
hyponyms {'funk'}
hyponyms {'flip'}
hyponyms {'flux', 'flow'}
hyponyms {'grab'}
hyponyms {'jerk'}
hyponyms {'jolt'}
hyponyms {'jump', 'leap'}
hyponyms {'jump', 'leap'}
hyponyms {'lean', 'list'}
hyponyms {'hurl'}
hyponyms {'mill'}
hyponyms {'mope'}
hyponyms {'give'}
hyponyms {'beat'}
hyponyms {'flap', 'roll', 'wave'}
hyponyms {'feed', 'flow'}
hyponyms {'part'}
hyponyms {'slip'}
hyponyms {'snap'}
hyponyms {'snap'}
hyponyms {'snap'}
hyponyms {'jump'}
hyponyms {'slip'}
hyponyms {'stir'}
hyponyms {'trip'}
hyponyms {'sail'}
hyponyms {'turn'}
hyponyms {'turn'}
hyponyms {'worm'}
-->  Synset('roll_over.v.01') (iteration 3 )
root_hypernyms {'move'}
hypernyms {'turn'}


### NLTK: Stopwords¶

• Remove common words
In [40]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']


Another application of NLP for this project was parsing the clues to separate high-signal words from filler material. NLP has this concept of stop words, common words that could be removed while preserving a lot of the original meaning. This is NLTK's built-in list of English stopwords.

### NLTK: Parts of Speech¶

• Extract most relevant information

In [41]:
print(nltk.pos_tag(nltk.word_tokenize("Butcher's offerings")))

[('Butcher', 'NNP'), ("'s", 'POS'), ('offerings', 'NNS')]

In [42]:
print(nltk.pos_tag(nltk.word_tokenize("butcher's offerings")))

[('butcher', 'NN'), ("'s", 'POS'), ('offerings', 'NNS')]


In [43]:
clue = "Jump in an ice rink"
[nlp.vocab[x.lemma_].lower_ for x in nlp(clue.lower()) if x.tag_ in ["NNS","NNPS","NN","NNP"]]

Out[43]:
['jump', 'ice', 'rink']

Both NLTK and spaCy also provide part-of-speech tagging, which is useful because it lets me weed out words that were unlikely to be directly related to the answer, but also because crossword clues are carefully written with hints about the answer. As I mentioned earlier, a plural clue indicates a plural answer, past tense indicates past tense, so it is important for the program to know these details. Take for example, this Monday’s 41-Down, “Butcher’s offerings”. Given strict instructions to filter out anything that isn’t 5 letters, the program wouldn’t give “MEAT” a second thought unless it realized that it should be counting the “S” on the end.

This was an important level of processing not just for the “dictionary” methods but also for the “encyclopedia” methods. The search terms I pass to Knowledge Graph and Wikipedia include different formulations of the clues -- one without stop words, one with only nouns, one with only nouns, adjectives, and adverbs -- in order to create the greatest likelihood that those searches return text that contains the correct answer.

### Two NLP Libraries¶

1. Wordnet (NLTK)
2. Word Vectors (spaCy)

### spaCy: Word Vectors¶

• Million-word dictionary
• GloVe
• 300-element numpy array

The second NLP library I used, spaCy, describes itself as “industrial-strength natural language processing.” Among its many features is an API for word vectors, trained using the word2vec algorithm that you may have heard of. It has a million-word dictionary built on a project out of Stanford called Glove (Global Vectors for Word Representation). In order to perform mathematical operations on words, they are represented as vectors in 300-dimensional space, which for practical purposes means an array of 300 numbers. None of these values has any inherent meaning, but they can be added and subtracted.

### spacy: Word Vectors¶

• King - Man + Woman = Queen
In [44]:
nlp('king').vector

Out[44]:
array([  3.15420002e-01,  -3.50679994e-01,   4.29230005e-01,
-5.38250029e-01,  -1.84799999e-01,  -3.10820013e-01,
2.91960001e-01,  -7.10300028e-01,  -2.38670006e-01,
1.84710002e+00,  -3.64459991e-01,  -5.12820005e-01,
1.22100003e-01,   3.89090002e-01,  -7.32040033e-02,
3.54619995e-02,   3.32890004e-01,   6.64659977e-01,
2.71749999e-02,   4.20210004e-01,  -1.45199999e-01,
3.79909992e-01,  -6.05199993e-01,   1.06950000e-01,
-6.47159994e-01,  -1.07389996e-02,  -3.97540003e-01,
3.88570011e-01,  -2.01340005e-01,   6.98130012e-01,
-3.24110001e-01,   7.30849981e-01,  -1.09300002e-01,
-2.35110000e-01,   1.84819996e-01,  -1.15950003e-01,
-7.10030019e-01,  -2.29739994e-01,  -4.19790000e-01,
8.10039975e-03,  -1.05039999e-01,  -4.48020011e-01,
-7.39279985e-02,  -4.23799992e-01,   2.84819990e-01,
-7.45169967e-02,   9.81609970e-02,   6.46019995e-01,
-2.58320004e-01,  -2.04520002e-02,  -6.68630004e-02,
5.15009999e-01,   1.67579994e-01,   1.23290002e-01,
1.96360007e-01,   1.19580001e-01,  -1.82960004e-01,
-1.43250003e-01,  -2.77579993e-01,   5.05970009e-02,
-6.61220029e-02,  -1.89199999e-01,   3.33000004e-01,
2.53190011e-01,   6.63550019e-01,   6.67349994e-01,
4.99689996e-01,   1.54809996e-01,  -8.42470005e-02,
-2.29470000e-01,  -6.83669984e-01,  -2.97829986e-01,
-1.86509997e-01,  -4.71210003e-01,   1.82720006e-01,
-3.26040000e-01,  -6.80299997e-02,   7.00730026e-01,
3.31589997e-01,   7.03930035e-02,  -7.69869983e-01,
5.90690017e-01,   2.05919996e-01,   1.79759994e-01,
6.95249997e-03,   5.78549989e-02,   7.20470011e-01,
-7.72490025e-01,  -5.41880012e-01,  -1.21890001e-01,
-3.17339995e-03,  -1.59600005e-01,   1.69699997e-01,
-1.25459999e-01,   8.70689988e-01,  -4.64780003e-01,
-1.93020001e-01,  -4.56180006e-01,  -1.54190004e-01,
8.11900020e-01,  -2.05440000e-01,   3.94540012e-01,
-3.11780006e-01,  -6.43180013e-02,  -4.44430001e-02,
-5.83379984e-01,  -1.47919998e-01,   1.70830004e-02,
8.32390010e-01,  -1.12800002e-01,   5.78260012e-02,
1.70240000e-01,  -1.36350006e-01,  -2.88940012e-01,
-4.05900002e-01,  -5.06849997e-02,   4.98560011e-01,
6.08850010e-02,   1.94370002e-01,  -1.98109999e-01,
-2.23350003e-01,  -2.59089991e-02,   3.98460001e-01,
4.40869987e-01,   2.31950004e-02,   9.86659974e-02,
-1.30040005e-01,  -2.03390002e-01,  -4.29580003e-01,
-7.97600020e-03,  -3.20160002e-01,  -4.10939991e-01,
-1.03040002e-01,  -7.55649984e-01,   1.77480001e-02,
-2.00369999e-01,   1.71849996e-01,   2.17869997e-01,
-3.16850007e-01,   2.20679995e-02,  -2.55590010e+00,
-9.91149992e-02,   1.84340000e-01,   1.24480002e-01,
-5.94130009e-02,  -4.56489995e-02,   7.90180027e-01,
2.45560005e-01,  -1.50589999e-02,  -7.89960027e-01,
2.90870011e-01,  -3.94190013e-01,   3.76170009e-01,
1.57179996e-01,   5.13559997e-01,  -3.42189997e-01,
5.06279990e-02,  -3.32540005e-01,  -1.41570002e-01,
3.33550006e-01,   4.43980008e-01,  -2.54509985e-01,
-3.32010016e-02,  -2.09580004e-01,   3.88700008e-01,
-2.45649993e-01,   5.23909986e-01,   4.32469994e-01,
-4.17010009e-01,   2.90309995e-01,  -7.80009985e-01,
3.00999992e-02,  -6.14459999e-02,  -1.40290007e-01,
-5.53539991e-01,  -1.91750005e-01,   6.72789991e-01,
-1.11040004e-01,  -3.54860008e-01,  -2.86009997e-01,
1.17200002e-01,  -4.50210005e-01,   1.40039995e-01,
-5.74840009e-01,  -2.25309998e-01,   4.15719986e-01,
-1.59500003e-01,  -2.78770000e-01,   7.97849968e-02,
1.91200003e-02,  -9.83569980e-01,  -5.69980025e-01,
-3.40230018e-02,   1.73819996e-02,  -1.71569996e-02,
-2.82110006e-01,   1.55729994e-01,  -1.35560006e-01,
-2.62959987e-01,  -7.45710015e-01,   1.20150000e-01,
5.42339981e-01,   5.67829981e-02,  -7.56750032e-02,
2.18199998e-01,  -2.56790012e-01,   2.35520005e-01,
-2.71109995e-02,  -1.93419993e-01,  -3.10880005e-01,
-1.05999999e-01,   4.95119989e-01,   5.79320006e-02,
3.87730002e-01,   9.31600034e-02,  -1.37820005e-01,
2.42440000e-01,   3.80980015e-01,   9.11089999e-04,
8.83379996e-01,   4.38230008e-01,  -7.70410001e-02,
1.15410000e-01,   3.47020000e-01,   5.97850025e-01,
6.70120001e-01,  -6.09529987e-02,  -4.38719988e-02,
-4.07999992e-01,   7.57210016e-01,   2.47730002e-01,
8.89260024e-02,  -1.84929997e-01,  -5.23389995e-01,
8.58089998e-02,  -6.08799994e-01,  -7.74630010e-02,
-2.68290013e-01,  -3.90210003e-01,  -1.50020003e-01,
5.42970002e-01,  -4.10759985e-01,  -9.52150002e-02,
-2.97870010e-01,   1.00409999e-01,  -3.77739996e-01,
7.55110025e-01,  -4.39099997e-01,  -6.17219985e-01,
-1.03600001e+00,   6.96510017e-01,   1.41570002e-01,
-4.45329994e-01,   3.27019989e-01,   3.83060016e-02,
2.67650008e-01,   5.42420000e-02,  -3.02419998e-02,
-4.51330006e-01,   6.25050021e-03,   2.75040001e-01,
-5.24130017e-02,  -1.98699996e-01,  -1.78690001e-01,
-2.46580005e-01,  -3.73690009e-01,   2.61739999e-01,
4.14819986e-01,  -5.92769980e-01,   6.14459999e-02,
6.62610009e-02,   1.09700002e-01,  -1.43879995e-01,
-3.24420005e-01,  -3.90160014e-04,  -2.13919997e-01,
3.29629987e-01,   5.04019976e-01,   1.34540007e-01,
-5.61330020e-01,   1.04219997e+00,   5.89850008e-01,
1.44730002e-01,   1.77450001e-01,   1.61599994e-01,
3.32300007e-01,   2.29090005e-01,   1.57739997e-01,
-3.54629993e-01,  -4.76419985e-01,  -2.58219987e-01,
2.36770004e-01,  -4.02550012e-01,  -3.53639990e-01,
-1.66970000e-01,   7.06770003e-01,   8.42719972e-02,
1.14270002e-01,   5.82210004e-01,  -1.05590001e-01], dtype=float32)

The most famous illustration of word vectors, maybe at this point a little clichéd, is the arithmetic expression King - Man + Woman = Queen. Has anyone come across this before?

In [77]:
queen1 = nlp('king').vector - nlp('man').vector + nlp('woman').vector
queen2 = nlp('queen').vector


I think it’s worth unpacking this a little. Of course, if you take the vector for King, subtract the vector for Man, and add the vector for Woman, the result isn’t a word, it’s a vector. Is it the exact vector for Queen? We can show that it is not.

In [78]:
queen1 - queen2

Out[78]:
array([ 0.10458702, -0.05152999, -0.01085299,  0.40603995,  0.111525  ,
0.03181005, -0.18277001,  0.10793996,  0.22586   ,  0.42549992,
-0.62051803,  0.09305897, -0.0758817 , -0.29067168, -0.29784101,
-0.43369001, -0.44859397,  0.21168   , -0.17273501,  0.24211   ,
0.20211001, -0.15502006, -0.04844499, -0.202636  , -0.21129996,
0.45776799,  0.03138995,  0.13294101, -0.53480601, -0.07134694,
-0.157518  , -0.05403006, -0.14246997, -0.77390599,  0.15866998,
-0.12601201, -0.19204   , -0.40347007,  0.05978   ,  0.52036041,
0.37191999, -0.252379  , -0.097138  , -0.40504098,  0.25123   ,
-0.03785798, -0.11933102, -0.00672996,  0.40257999,  0.02721703,
-0.29956898,  0.34834102, -0.15371901, -0.14056298,  0.17291501,
0.73967993, -0.0257776 , -0.28438202, -0.33745399,  0.12431702,
0.063307  , -0.39151499, -0.24294749,  0.3378177 ,  0.37893206,
0.14127994,  0.70388097,  0.021424  ,  0.142003  ,  0.20465   ,
-0.36599994, -0.14310999, -0.17243698, -0.00424001,  0.67148   ,
-0.17920549,  0.45753998,  0.17486003, -0.23000398,  0.06431001,
0.13716793, -0.17282701, -0.32512403,  0.22375101, -0.3474555 ,
0.44771501,  0.28867   , -0.14638105, -0.04995   , -0.437648  ,
-0.2236634 , -0.14245   ,  0.03281999, -0.16247103,  0.51248991,
-0.40227997, -0.150479  , -0.38445002,  0.359772  ,  0.30387995,
0.577236  ,  0.53445101,  0.281598  ,  0.126359  , -0.019406  ,
-0.26014996, -0.15996996, -0.15767002,  0.00154799,  0.195612  ,
-0.13352397,  0.01087999, -0.080301  , -0.20445602, -0.11846301,
-0.371925  ,  0.39347702,  0.26368502,  0.39265701,  0.48374   ,
0.06531   ,  0.068128  ,  0.11742002,  0.04229499,  0.10026699,
0.30375999,  0.06063001,  0.39369851, -0.10366529,  0.065814  ,
0.14065003,  0.17174399, -0.20236002, -0.55088001, -0.72287202,
-0.48885   , -0.37717   ,  0.07013199, -0.52825999,  0.096489  ,
0.59859991, -0.13812901, -0.11418399, -0.190035  ,  0.06799701,
0.02872499,  0.38754201,  0.00787   , -0.62338901, -0.09111011,
-0.22363999, -0.1886197 , -0.20118999,  0.22608899, -0.24934301,
0.08535001, -0.27039596,  0.30038005, -0.090203  , -0.14802799,
0.14603001,  0.21248001,  0.118833  , -0.07153228, -0.12797996,
-0.274443  ,  0.30433598,  0.29837996, -0.01640302,  0.11600998,
-0.33268997, -0.056754  ,  0.13773698, -0.18801799, -0.51105094,
-0.25610259, -0.07734999, -0.457643  ,  0.12696004, -0.25476858,
0.01485402, -0.27168003, -0.09315271, -0.18197   ,  0.46563497,
0.34944999,  0.27662   , -0.138596  ,  0.200928  , -0.34992003,
-0.48564997, -0.60399902, -0.18144301, -0.11616989,  0.129803  ,
0.02417099,  0.05545059,  0.117446  , -0.03544599, -0.57339001,
0.44310898,  0.33150995,  0.01238599, -0.21157703, -0.03491596,
0.26410997, -0.22768001, -0.25299799, -0.23517999,  0.48754001,
0.19483501, -0.27316999, -0.44070199,  0.36702901,  0.09925799,
-0.06908001, -0.14320281,  0.22666103,  0.2794511 ,  0.29843   ,
0.21248499, -0.63584298,  0.20785001,  0.48329499, -0.47914696,
-0.03455502,  0.34644902, -0.37480602, -0.15627   ,  0.12277907,
-0.04933499,  0.005468  ,  0.00519997, -0.37172398, -0.175451  ,
-0.18385059, -0.21175501, -0.31394401,  0.07360198, -0.01590204,
-0.17416   , -0.00090003,  0.11262399, -0.48282   , -0.10517   ,
0.05565304,  0.32160503, -0.24056101, -0.30389994, -0.50732309,
0.33911803, -0.23648998,  0.06108901,  0.23029798, -0.02688998,
0.08346   ,  0.17561206,  0.331848  , -0.09330803,  0.2918205 ,
0.277062  , -0.32242298, -0.002744  ,  0.36982   ,  0.51170999,
-0.39322001, -0.16557002, -0.18774   , -0.01507998, -0.28465101,
-0.07072806, -0.05853601, -0.06321001, -0.09849399, -0.09514015,
-0.23703995, -0.17930999,  0.38357297,  0.01018202,  0.10888296,
0.29964393,  0.12595999,  0.60580498,  0.04320699,  0.18855999,
0.63618499, -0.18775499,  0.42126399, -0.15406296, -0.36692598,
0.094318  ,  0.02511001,  0.06609299, -0.17440999,  0.00357999,
0.08757752,  0.04765201,  0.27466798,  0.74391007, -0.01412702], dtype=float32)

### spaCy: Word Vectors¶

• Find n closest vectors in our vocabulary
• Vocabulary is already nested by word length
• cosine = lambda v1, v2: np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

What we’re actually left with is a brand new vector that does not represent any word in our vocabulary -- it just represents a position in 300-dimensional space. The task then becomes how to find the vocabulary words whose vectors we know that are closest to that position. To do this we use something called cosine similarity, the formula for which I found online, so I’m not going to try to explain it to you here. The point is that when someone gives the example of King - Man + Woman = Queen, what they should say "the difference between '(King - Man + Woman)'" and 'Queen' is less than the difference between that expression and any other word’s vector." But it gets wordy.

In [71]:
def getSpacyCandidatesDemo(clue,length,vocab,ret_count):
# vector for each tagged Lexeme
vecs = [x.vector for x in clue]
# sum the vectors
# exclude words already in clue
# exclude contractions
vocab = [w
for w in vocab
if w not in clue
and re.search("\'", w.lower_)==None]
# sort vocab by cosine similarity
vocab.sort(key=lambda w: cosine(w.vector, vecsum))
print({w.orth_.lower() for w in vocab[-1*ret_count:]})


In the crossword solving application, I don’t have anything to subtract, but I can add together word vectors and find other words in the vocabulary that are close by. Remember, we’re just searching vocabulary words of a given length, but that still entails taking the entire vocabulary list and sorting it in order of closeness, as determined by cosine similarity. This doesn’t feel efficient though it is plenty fast for my needs.

In [93]:
clue = "Plan that's hatched"
for formulation in getSpacyFormulationsDemo(clue):
print('--> ',[x.lower_ for x in formulation])
getSpacyCandidatesDemo(formulation,6,vocab[6],10)

-->  ['plan', "that's", 'hatched']
{'decide', 'effort', 'agenda', 'agreed', 'intend', 'scheme', 'policy', 'future', 'budget', 'should'}
-->  ['plan']
{'decide', 'effort', 'agenda', 'agreed', 'intend', 'scheme', 'policy', 'future', 'budget', 'should'}
-->  ['plan', 'that']
{'decide', 'likely', 'simply', 'reason', 'really', 'though', 'should', 'rather', 'enough', 'future'}
-->  ['plan', 'that', "'s", 'hatched']
{'wanted', 'simply', 'reason', 'really', 'saying', 'should', 'though', 'rather', 'enough', 'future'}
-->  ['plan']
{'decide', 'effort', 'agenda', 'agreed', 'intend', 'scheme', 'policy', 'future', 'budget', 'should'}


Unfortunately in this Monday’s puzzle, the only clue that was solved with word vectors was 33-Across, "Plan that's 'hatched'", though none of the other methods found the correct answer, "SCHEME", which means it’s adding unique value.

In [94]:
clue = "Counterpart to 'if', in computer science"

In [95]:
for formulation in getSpacyFormulationsDemo(clue):
print('--> ',[x.lower_ for x in formulation])
getSpacyCandidatesDemo(formulation,4,vocab[4],8)

-->  ['counterpart', "'if',", 'computer', 'science']
{'arts', 'idea', 'data', 'mind', 'kind', 'work', 'math', 'tech'}
-->  ['counterpart', 'computer', 'science']
{'arts', 'idea', 'data', 'mind', 'kind', 'work', 'math', 'tech'}
-->  ['counterpart', 'computer', 'science']
{'arts', 'idea', 'data', 'mind', 'kind', 'work', 'math', 'tech'}
-->  ['counterpart', 'to', "'", 'if', "'", ',', 'in', 'computer', 'science']
{'even', 'when', 'want', 'else', 'they', 'that', 'kind', 'what'}
-->  ['counterpart', 'computer', 'science']
{'arts', 'idea', 'data', 'mind', 'kind', 'work', 'math', 'tech'}

In [96]:
getSpacyCandidatesDemo(nlp('if'),4,vocab[4],8)

{'even', 'when', 'else', 'then', 'that', 'sure', 'what', 'does'}

In [97]:
getSpacyCandidatesDemo(nlp('counterpart'),4,vocab[4],8)

{'ikke', 'mobo', 'derp', 'mech', 'gifs', 'prob', 'mais', 'zerg'}

In [65]:
print(nlp('counterpart').vector)

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


And some weeks, word vectors do a lot better. A couple of weeks ago, in the September 18 puzzle, there was an example that I think is especially well suited to this audience. 41-Across, “Counterpart to 'if,' in computer science”. Four letters, anyone want to take a guess?

The word vector method also solved this one, and again it was alone among the five methods I used on it. It turns out that a couple of phrases get you close to the vector for "ELSE", including “if computer science” and, as it turns out, “if”. So it seems like at least in some cases, all that matters is that one of the words has a similar vector to the answer and the rest of the words don’t pull it too far in another direction away from the answer.

Some of you may have noticed this very curious set of words for "counterpart". Refusing to believe that these were actually semantically related to each other, I inspected the vector, and found...so not every word has a vector, and obviously this method won't work for them.

In [51]:
getSpacyCandidatesDemo(nlp('dairy animal'),3,vocab[3],10)

{'eat', 'soy', 'dog', 'pet', 'pig', 'fed', 'cat', 'cow', 'egg', 'zoo'}

In [52]:
getSpacyCandidatesDemo(nlp('dairy'),3,vocab[3],10)

{'eat', 'soy', 'oil', 'pig', 'hay', 'fed', 'raw', 'cow', 'egg', 'fat'}

In [53]:
getSpacyCandidatesDemo(nlp('animal'),3,vocab[3],10)

{'dog', 'pig', 'pet', 'cat', 'toy', 'rat', 'cow', 'fox', 'fur', 'zoo'}


In another case from the same puzzle, 12-Down, “Dairy animal”, 3 letters, it turns out the vectors for “dairy” and “animal” are both pretty close to the vectors for the correct answer, which is “COW”.

## IV. The Solver¶

### Utility Functions¶

• Web scraping via BeautifulSoup
• Optical character recognition (OCR) via pytesseract

I want to briefly share some of the utility functions I built in order to create the data pipeline that gave me access to clues, answers, and puzzle grids for the New York Times crossword. I should acknowledge that the Times does not, as far as I know, provide public access to its archive, so I am grateful for the blogger behind nytcrossword.com, which I scraped using BeautifulSoup to download text and images.

There’s an obvious fragility to this much reliance on a third party resource, and in fact, the blogger behind nytcrossword.com upgraded the site and changed the format of the URLs for new posts, which caused me a brief moment of panic before I figured out what had changed.

### OCR via pytesseract¶

The images of the grids are hosted on the blog in PNG format, and in order to locate each clue and determine which answers intersected with each other, I needed to parse the image to locate the black squares. I even had some fun with Tesseract, an optical character recognition library, which allowed me to extract answers directly from the image of a completed puzzle.

While I didn’t necessarily need to use OCR to extract the answers, because I got them from the blog the same way I got the clues, I did need at least to scan the image somehow, and slice it into 15 rows and columns, in order to determine where the black squares fall. Once you know their locations, it’s fairly easy to determine the X,Y coordinates of each answer; it’s like the game Battleship: if you know the start and length, you know where the rest of it is.

In [73]:
def parseAndPrintImageDemo(n_squares, img_path):
im = Image.open(img_path)
im = Image.composite(im, Image.new('RGB', im.size, 'white'), im)
pps = im.size[0]/n_squares
for i in range(n_squares):
row = list()
for j in range(n_squares):
boxOuter = (j*pps,i*pps,j*pps+pps,i*pps+pps)
tile = im.crop(boxOuter)
if not tile.getbbox():
tileText = " "
else:
# Crop out numbers
tileInner = tile.crop((7,9,27,27))
tileText = pytesseract.image_to_string(tileInner,config="-psm 10 -l eng -c tessedit_char_whitelist="+string.ascii_uppercase)
# Hack: if unrecognized, assume 'I'
tileText = "I" if tileText == "" else tileText
row.append(tileText)
print(" ".join(row))

In [82]:
start = time.time()
parseAndPrintImageDemo(15, './images/1002-17.png')
print('elapsed: ',time.time() - start)

F I S H   F L I P   H A R S H
O H H I   E O N S   A L I K E
C O O P   M U S I C N O T E S
A P T   F U N     U S N E W S
L E T T E R G R A D E S
O E R   E E L   L O F T S
S C H E M E   C A N     R I A
C H E M I C A L S Y M B O L S
A I L     O U I   C E A S E S
B A L E S   D N A   A L T
M O V I E R A T I N G S
M A R B L E     T I S   I R A
B L O O D T Y P E S   A X E L
A B O D E   E A R L   H O E S
S A T Y R   S T Y E   A N K A
elapsed:  36.54445219039917


### How far it gets with one puzzle¶

• Less than half of clues have the correct answer among their candidates
• Monday, October 2, 2017: 42%
• Difficult to find the correct answer by cross-referencing
In [74]:
cand_methods = ['cand_kg','cand_wk','cand_vec','cand_wn','cand_dct']

clues = puzzle['clues']
success,tally = getHitCountDemo(clues)

hits: 32 / 76 = 0.42105263157894735


As I mentioned earlier, the program does not currently solve a full puzzle. The feat I have accomplished is to generate a set of candidates that contains the correct answer for about 42% of clues, in the case of this past Monday. Some weeks it's a few points higher, some a few points lower.

### Which methods are generating hits?¶

In [56]:
df = pd.DataFrame(tally,index=map(lambda x: x['answer'],success),columns=cand_methods)
df.sum(0)

Out[56]:
cand_kg      7
cand_wk     24
cand_vec     1
cand_wn      5
cand_dct     3
dtype: int64

I did some analytics on the five methods that I’m using. The first takeaway is that Wikipedia is by far the strongest. This makes some sense, because it's going to find factual clues through its encyclopedia-like function, but it also knows some things about language because of expressions or idioms that might be present in its pages, or might even have their own page.

The second takeaway is that every method is making some unique contribution.

In [57]:
df

Out[57]:
cand_kg cand_wk cand_vec cand_wn cand_dct
FISH 0 1 0 0 0
FLIP 0 0 0 1 0
APT 0 0 0 0 1
FUN 0 0 0 1 1
SCHEME 0 0 1 0 0
CAN 0 1 0 0 0
CEASES 0 0 0 1 0
BALES 0 1 0 0 0
DNA 0 1 0 0 0
MARBLE 1 1 0 0 0
IRA 1 1 0 0 0
EARL 1 1 0 0 0
ANKA 1 1 0 0 0
FEMUR 0 1 0 0 0
INS 0 1 0 0 0
PSI 0 1 0 0 0
HANSEL 0 1 0 0 0
ALONSO 1 1 0 0 0
RITE 0 1 0 0 0
HESS 0 1 0 0 0
CUD 0 1 0 0 0
FROSTNIXON 0 1 0 0 0
SASS 0 0 0 0 1
AUDI 0 1 0 0 0
MEATS 0 1 0 0 0
BALI 0 1 0 0 0
EMBODY 0 0 0 1 0
GREEK 0 1 0 0 0
MBAS 0 1 0 0 0
ALBA 1 1 0 0 0
ROOT 1 1 0 0 0
YES 0 0 0 1 0

I know how well my program is doing because I can automatically search for the correct answer in the candidates list, but my program doesn’t have access to that information -- that would defeat the purpose of creating a solver. My solver has to rely on a different method for inferring whether the answer is correct, the same one a human solver would use. It has to look at all the intersecting words and see if there is a plausible answer for each of them with the same letters in the corresponding places.

This means it's hard to determine which of the candidates is in fact the correct answer if we cannot necessarily find its letters in intersecting correct answers.

It's a little like the push-up square: everyone needs to be lifting at the same time or the whole thing never gets off the ground. There is a circular dependency where finding the correct answer depends on being able to find other correct answers, and we can only accomplish this if a substantial number of the correct answers are present.

The academic term for this is "constraint problem", and for those who are interested, there is some research out there applying this to crossword puzzles.

## V. Conclusion¶

### Cross-referencing¶

• Limited if less than 50% of answers are in the candidates
• Is there a threshold above which I can start winnowing down?

### Improving the hit rate¶

• More formulations (previously added ~10%)
• Pattern to the non-hits
• Other APIs? ML?

### Am I approaching this the wrong way?¶

My top priority for this project is to get the hit rate above 50%. Until I can do that, I don't have much faith in the program's ability to actually start solving. With that said, I don't think I need to have 100% of the answers in the candidates, because some answers can be found entirely based on the intersecting letters, the same way a human solver would do it. Given a 15-letter multi-word answer with 10 or 12 letters already solved, the program could just narrow down solely based on the real English words that fit, ignoring the clue altogether.

Some ways I might be able to improve the hit rate: first, just try more formulations. These things can be finicky, and sometimes including or excluding a punctuation mark makes the difference between whether these methods can find the answer or not.

I haven't yet analyzed if there's some systematic pattern to the clues that aren't getting found. It's possible that a different kind of API or technique for parsing them would make them more manageable for machines.

Another avenue that I hope to pursue is machine learning. With access to years’ worth of clue-answer combinations as training data, I believe the program could make some of the more indirect associations between clues and answers. The downside of this approach is that the black box of the model obscures how the program made the connections. It also intuitively seems really really hard, if you compare a task like image classification, where you might have one or a hundred image categories, to this task where the possible outputs are every English dictionary word of a certain length, plus names, phrases, abbreviations. I don't yet know enough about ML to know what the implications for the model would be.

This might also help with handling multi-word answers, clues that reference other clues, and puzzle themes. These are the kinds of things that are very hard to write explicit instructions for.

Finally, I can't help but wonder if I'm approaching this the wrong way. I'm trying to write a program to solve a crossword puzzle the way a human solves a crossword puzzle: incrementally, by looking up clues in whatever "brain" the solver has access to and finding corresponding answers that fit. But maybe I should be considering approaches that are virtually impossible for a human but much easier for a computer, which has access to the entire English language, like examining the large but finite number of crossword puzzle states and narrowing them down through a process of elimination for ungrammatical, nonsense, or nonexistent words.