Using Python to solve Regexp CrossWord Puzzles

Have a look at the amazing https://regexcrossword.com/ website.

I played during about two hours, and could manually solve almost all problems, quite easily for most of them. But then I got stucked on this one.

Soooooo. I want to use Python3 regular expressions and try to solve any such cross-word puzzles.

Warning: This notebook will not explain the concept and syntax of regular expressions, go read on about it on Wikipedia or in a good book. The Python documentation gives a nice introduction here.

Representation of a problem

Here is a screenshot from the game webpage.

As you can see, an instance of this game is determined by its rectangular size, let's denote it $(m, n)$, so here there are $m=5$ lines and $n=5$ columns.

I'll also use this easy problem:

Let's define both, in a small dictionnary containing two to four lists of regexps.

Easy problem of size $(2,2)$ with four constraints

In [1]:
problem1 = {
    "left_lines": [
        r"HE|LL|O+",   # HE|LL|O+   line 1
        r"[PLEASE]+",  # [PLEASE]+  line 2
    ],
    "right_lines": None,
    "top_columns": [
        r"[^SPEAK]+",  # [^SPEAK]+  column 1
        r"EP|IP|EF",   # EP|IP|EF   column 2
    ],
    "bottom_columns": None,
}

The keys "right_lines" and "bottom_columns" can be empty, as for easier problems there are no constraints on the right and bottom.

Each line and column (but not each square) contains a regular expression, on a common alphabet of letters and symbols. Let's write $\Sigma$ this alphabet, which in the most general case is $\Sigma=\{$ A, B, ..., Z, 0, ..., 9, :, ?, ., $, -$\}$.

For the first beginner problem, the alphabet can be shorten:

In [2]:
alphabet1 = {
    'H', 'E', 'L', 'O',
    'P', 'L', 'E', 'A', 'S', 'E',
    'S', 'P', 'E', 'A', 'K',
    'E', 'P', 'I', 'P', 'I', 'F',
}

print(f"alphabet1 = \n{sorted(alphabet1)}")
alphabet1 = 
['A', 'E', 'F', 'H', 'I', 'K', 'L', 'O', 'P', 'S']

Difficult problem of size $(5,5)$ with 20 constraints

Defining the second problem is just a question of more copy-pasting:

In [3]:
problem2 = {
    "left_lines": [
        r"(N3|TRA|N7)+",  # left line 1
        r"[1LOVE2?4]+.",  # left line 2
        r"(A|D)M[5-8$L]+",  # left line 3
        r"[^\s0ILAD]+",  # left line 4
        r"[B-E]+(.)\1.",  # left line 5
    ],
    "right_lines": [
        r"[^OLD\s]+",  # right line 1
        r"(\d+)[LA\s$?]+",  # right line 2
        r"(\-P|5\$|AM|Z|L)+",  # right line 3
        r"(\-D|\-WE)+[^L4-9N$?]+",  # right line 4
        r"[FED$?]+",  # right line 5
    ],
    "top_columns": [
        r"[2TAIL\-D]+",  # top column 1
        r"(WE|R4|RY|M)+",  # top column 2
        r"[FEAL3-5S]+",  # top column 3
        r"[^FA\sT1-2]+F",  # top column 4
        r"[LO\s\?5-8]+",  # top column 5
    ],
    "bottom_columns": [
        r"[^ILYO]+",  # top column 1
        r".+[MURDEW]+",  # top column 2
        r"[1ALF5$E\s]+",  # top column 3
        r"[\dFAN$?]+",  # top column 4
        r".+\s.+\?",  # top column 5
    ],
}

And its alphabet:

In [4]:
import string
In [5]:
alphabet2 = set(string.digits) \
    | set(string.ascii_uppercase) \
    | { ':', '?', '.', '$', '-' }

print(f"alphabet2 = \n{sorted(alphabet2)}")
alphabet2 = 
['$', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

An intermediate problem of size $(3,3)$ with 12 constraints

Defining the third problem is just a question of more copy-pasting:

In [6]:
problem3 = {
    "left_lines": [
        r"[ONE]*[SKA]",  # left line 1
        r".*(RE|ER)",  # left line 2
        r"A+[TUB]*",  # left line 3
    ],
    "right_lines": [
        r".*(O|S)*",  # right line 1
        r"[^GOA]*",  # right line 2
        r"[STUPA]+",  # right line 3
    ],
    "top_columns": [
        r".*[GAF]*",  # top column 1
        r"(P|ET|O|TEA)*",  # top column 2
        r"[RUSH]+",  # top column 3
    ],
    "bottom_columns": [
        r"(NF|FA|A|FN)+",  # top column 1
        r".*(A|E|I).*",  # top column 2
        r"[SUPER]*",  # top column 3
    ],
}

And its alphabet:

In [7]:
alphabet3 = {
    'O', 'N', 'E', 'S', 'K', 'A',
    'R', 'E', 'E', 'R',
    'A', 'T', 'U', 'B',
    
    'O', 'S',
    'G', 'O', 'A',
    'S', 'T', 'U', 'P', 'A',
    
    'G', 'A', 'F',
    'P', 'E', 'T', 'O', 'T', 'E', 'A',
    'R', 'U', 'S', 'H',

    'N', 'F', 'F', 'A', 'A', 'F', 'N',
    'A', 'E', 'I',
    'S', 'U', 'P', 'E', 'R',
}

print(f"alphabet3 = \n{sorted(alphabet3)}")
alphabet3 = 
['A', 'B', 'E', 'F', 'G', 'H', 'I', 'K', 'N', 'O', 'P', 'R', 'S', 'T', 'U']

A few useful functions

Let's first extract the dimension of a problem:

In [8]:
def dimension_problem(problem):
    m = len(problem['left_lines'])
    if problem['right_lines'] is not None:
        assert m == len(problem['right_lines'])
    n = len(problem['top_columns'])
    if problem['bottom_columns'] is not None:
        assert n == len(problem['bottom_columns'])
    return (m, n)
In [9]:
problem1
Out[9]:
{'left_lines': ['HE|LL|O+', '[PLEASE]+'],
 'right_lines': None,
 'top_columns': ['[^SPEAK]+', 'EP|IP|EF'],
 'bottom_columns': None}
In [10]:
dimension_problem(problem1)
Out[10]:
(2, 2)

Now let's write a representation of a grid, a solution (or partial solution) of a problem:

In [11]:
___ = "_"  # represents an empty answer, as _ is not in the alphabet
grid1_partial = [
    [ 'H', ___ ],
    [ ___, 'P' ],
]
In [12]:
grid1_solution = [
    [ 'H', 'E' ],
    [ 'L', 'P' ],
]

As well as a few complete grids which are NOT solutions

In [13]:
grid1_wrong1 = [
    [ 'H', 'E' ],
    [ 'L', 'F' ],
]
In [14]:
grid1_wrong2 = [
    [ 'H', 'E' ],
    [ 'E', 'P' ],
]
In [15]:
grid1_wrong3 = [
    [ 'H', 'E' ],
    [ 'O', 'F' ],
]
In [16]:
grid1_wrong4 = [
    [ 'O', 'E' ],
    [ 'O', 'F' ],
]

We also write these short functions to extract the $i$-th line or $j$-th column:

In [17]:
def nth_line(grid, line):
    return "".join(grid[line])

def nth_column(grid, column):
    return "".join(grid[line][column] for line in range(len(grid)))
In [18]:
[ nth_line(grid1_solution, line) for line in range(len(grid1_solution)) ]
Out[18]:
['HE', 'LP']
In [19]:
[ nth_column(grid1_solution, column) for column in range(len(grid1_solution[0])) ]
Out[19]:
['HL', 'EP']

A partial solution for the intermediate problem:

In [20]:
___ = "_"  # represents an empty answer, as _ is not in the alphabet
grid3_solution = [
    [ 'N', 'O', 'S' ],
    [ 'F', 'E', 'R' ],
    [ 'A', 'T', 'U' ],
]

And a partial solution for the harder problem:

In [21]:
___ = "_"  # represents an empty answer, as _ is not in the alphabet
grid2_partial = [
    [ 'T', 'R', 'A', 'N', '7' ],
    [ '2', '4', ___, ___, ' ' ],
    [ 'A', ___, ___, ___, ___ ],
    [ '-', ___, ___, ___, ___ ],
    [ 'D', ___, ___, ___, '?' ],
]

Let's extract the dimension of a grid, just to check it:

In [22]:
def dimension_grid(grid):
    m = len(grid)
    n = len(grid[0])
    assert all(n == len(grid[i]) for i in range(1, m))
    return (m, n)
In [23]:
print(f"Grid grid1_partial has dimension: {dimension_grid(grid1_partial)}")
print(f"Grid grid1_solution has dimension: {dimension_grid(grid1_solution)}")
Grid grid1_partial has dimension: (2, 2)
Grid grid1_solution has dimension: (2, 2)
In [24]:
print(f"Grid grid2_partial has dimension: {dimension_grid(grid2_partial)}")
Grid grid2_partial has dimension: (5, 5)
In [25]:
def check_dimensions(problem, grid):
    return dimension_problem(problem) == dimension_grid(grid)
In [26]:
assert check_dimensions(problem1, grid1_partial)
assert check_dimensions(problem1, grid1_solution)
In [27]:
assert not check_dimensions(problem2, grid1_partial)
In [28]:
assert check_dimensions(problem2, grid2_partial)
In [29]:
assert not check_dimensions(problem1, grid2_partial)

Two more checks

We also have to check if a word is in an alphabet:

In [30]:
def check_alphabet(alphabet, word, debug=True):
    result = True
    for i, letter in enumerate(word):
        new_result = letter in alphabet
        if debug and result and not new_result:
            print(f"The word {repr(word)} is not in alphabet {repr(alphabet)}, as its #{i}th letter {letter} is not present.")
        result = result and new_result
    return result
In [31]:
assert check_alphabet(alphabet1, 'H' 'E')  # concatenate the strings
In [32]:
assert check_alphabet(alphabet1, 'H' 'E')
assert check_alphabet(alphabet1, 'L' 'P')
assert check_alphabet(alphabet1, 'H' 'L')
assert check_alphabet(alphabet1, 'E' 'P')
In [33]:
assert check_alphabet(alphabet2, "TRAN7")

And also check that a word matches a regexp:

In [34]:
import re

As the documentation explains it:

but using prog = re.compile(regepx) and saving the resulting regular expression object prog for reuse is more efficient when the expression will be used several times in a single program.

I don't want to have to think about compiling a regexp before using it, so... I'm gonna memoize them!

In [42]:
memory_of_compiled_regexps = dict()

Now we are ready to write our "smart" match function:

In [43]:
def match(regexp, word, debug=True):
    global memory_of_compiled_regexps
    if regexp not in memory_of_compiled_regexps:
        prog = re.compile(regexp)
        memory_of_compiled_regexps[regexp] = prog
        print(f"For the first time seeing this regexp {repr(regexp)}, compiling it and storing in memory_of_compiled_regexps, now of size {len(memory_of_compiled_regexps)}.")
    else:
        prog = memory_of_compiled_regexps[regexp]
    
    result = re.fullmatch(regexp, word)
    result = prog.fullmatch(word)

    entire_match = result is not None
    # entire_match = result.group(0) == word
    if debug:
        if entire_match:
            print(f"The word {repr(word)} is matched by {repr(regexp)}")
        else:
            print(f"The word {repr(word)} is NOT matched by {repr(regexp)}")
    return entire_match

Let's compare the time of the first match and next ones:

In [44]:
%%time
match(r"(N3|TRA|N7)+", "TRAN7")
For the first time seeing this regexp '(N3|TRA|N7)+', compiling it and storing in memory_of_compiled_regexps, now of size 1.
The word 'TRAN7' is matched by '(N3|TRA|N7)+'
CPU times: user 238 µs, sys: 0 ns, total: 238 µs
Wall time: 173 µs
Out[44]:
True
In [45]:
%%time
match(r"(N3|TRA|N7)+", "TRAN8")
The word 'TRAN8' is NOT matched by '(N3|TRA|N7)+'
CPU times: user 176 µs, sys: 71 µs, total: 247 µs
Wall time: 169 µs
Out[45]:
False

Well of course it's not different for tiny test like this.

In [ ]:
match(r"(N3|TRA|N7)+", "")
In [ ]:
match(r"(N3|TRA|N7)+", "TRA")

That should be enough to start the first "easy" task.

In [47]:
%timeit match(r"(N3|TRA|N7)+", "TRA", debug=False)

%timeit re.fullmatch(r"(N3|TRA|N7)+", "TRA")
1.14 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
603 ns ± 7.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

We can see that our "memoization trick" indeed helped to speed-up the time required to check a regexp, by about a factor 2, even for very small tests like this.

First easy task: check that a line/column word validate its contraints

Given a problem $P$ of dimension $(m, n)$, its alphabet $\Sigma$, a position $i \in [| 0, m-1 |]$ of a line or $j \times [|0, n-1 |]$ of a column, and a word $w \in \Sigma^k$ (with $k=m$ for line or $k=n$ for column), I want to write a function that checks the validity of each (left/right) line, or (top/bottom) constraints.

To ease debugging, and in the goal of using this Python program to improve my skills in solving such puzzles, I don't want this function to just reply True or False, but to also print for each constraints if it is satisfied or not.

Bonus: for each regexp contraint, highlight the parts which corresponded to each letter of the word?

For lines

We are ready to check the one or two constraints of a line. The same function will be written for columns, just below.

In [48]:
def check_line(problem, alphabet, word, position, debug=True, early=False):
    if not check_alphabet(alphabet, word, debug=debug):
        return False
    m, n = dimension_problem(problem)
    if len(word) != n:
        if debug:
            print(f"Word {repr(word)} does not have correct size n = {n} for lines")
        return False
    assert 0 <= position < m
    constraints = []
    if "left_lines" in problem and problem["left_lines"] is not None:
        constraints += [ problem["left_lines"][position] ]
    if "right_lines" in problem and problem["right_lines"] is not None:
        constraints += [ problem["right_lines"][position] ]
    # okay we have one or two constraint for this line,
    assert len(constraints) in {1, 2}
    # let's check them!
    result = True
    for cnb, constraint in enumerate(constraints):
        if debug:
            print(f"For line constraint #{cnb} {repr(constraint)}:")
        new_result = match(constraint, word, debug=debug)
        if early and not new_result: return False
        result = result and new_result
    return result

Let's try it!

In [49]:
problem1, alphabet1, grid1_solution
Out[49]:
({'left_lines': ['HE|LL|O+', '[PLEASE]+'],
  'right_lines': None,
  'top_columns': ['[^SPEAK]+', 'EP|IP|EF'],
  'bottom_columns': None},
 {'A', 'E', 'F', 'H', 'I', 'K', 'L', 'O', 'P', 'S'},
 [['H', 'E'], ['L', 'P']])
In [50]:
n, m = dimension_problem(problem1)

for line in range(n):
    word = nth_line(grid1_solution, line)
    print(f"- For line number {line}, checking word {repr(word)}:")
    result = check_line(problem1, alphabet1, word, line)
- For line number 0, checking word 'HE':
For line constraint #0 'HE|LL|O+':
For the first time seeing this regexp 'HE|LL|O+', compiling it and storing in memory_of_compiled_regexps, now of size 2.
The word 'HE' is matched by 'HE|LL|O+'
- For line number 1, checking word 'LP':
For line constraint #0 '[PLEASE]+':
For the first time seeing this regexp '[PLEASE]+', compiling it and storing in memory_of_compiled_regexps, now of size 3.
The word 'LP' is matched by '[PLEASE]+'
In [51]:
n, m = dimension_problem(problem1)
fake_words = ["OK", "HEY", "NOT", "HELL", "N", "", "HU", "OO", "EA"]

for word in fake_words:
    print(f"# For word {repr(word)}:")
    for line in range(n):
        result = check_line(problem1, alphabet1, word, line)
        print(f"  => {result}")
# For word 'OK':
For line constraint #0 'HE|LL|O+':
The word 'OK' is NOT matched by 'HE|LL|O+'
  => False
For line constraint #0 '[PLEASE]+':
The word 'OK' is NOT matched by '[PLEASE]+'
  => False
# For word 'HEY':
The word 'HEY' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #2th letter Y is not present.
  => False
The word 'HEY' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #2th letter Y is not present.
  => False
# For word 'NOT':
The word 'NOT' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
The word 'NOT' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
# For word 'HELL':
Word 'HELL' does not have correct size n = 2 for lines
  => False
Word 'HELL' does not have correct size n = 2 for lines
  => False
# For word 'N':
The word 'N' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
The word 'N' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
# For word '':
Word '' does not have correct size n = 2 for lines
  => False
Word '' does not have correct size n = 2 for lines
  => False
# For word 'HU':
The word 'HU' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter U is not present.
  => False
The word 'HU' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter U is not present.
  => False
# For word 'OO':
For line constraint #0 'HE|LL|O+':
The word 'OO' is matched by 'HE|LL|O+'
  => True
For line constraint #0 '[PLEASE]+':
The word 'OO' is NOT matched by '[PLEASE]+'
  => False
# For word 'EA':
For line constraint #0 'HE|LL|O+':
The word 'EA' is NOT matched by 'HE|LL|O+'
  => False
For line constraint #0 '[PLEASE]+':
The word 'EA' is matched by '[PLEASE]+'
  => True

That was long, but it works fine!

In [52]:
n, m = dimension_problem(problem2)

for line in [0]:
    word = nth_line(grid2_partial, line)
    print(f"- For line number {line}, checking word {repr(word)}:")
    result = check_line(problem2, alphabet2, word, line)
    print(f"  => {result}")
- For line number 0, checking word 'TRAN7':
For line constraint #0 '(N3|TRA|N7)+':
The word 'TRAN7' is matched by '(N3|TRA|N7)+'
For line constraint #1 '[^OLD\\s]+':
For the first time seeing this regexp '[^OLD\\s]+', compiling it and storing in memory_of_compiled_regexps, now of size 4.
The word 'TRAN7' is matched by '[^OLD\\s]+'
  => True
In [53]:
n, m = dimension_problem(problem2)
fake_words = [
    "TRAN8", "N2TRA",  # violate first constraint
    "N3N3N7", "N3N3", "TRA9",  # smaller or bigger dimension
    "O L D", "TRA  ",  # violate second contraint
]

for word in fake_words:
    for line in [0]:
        print(f"- For line number {line}, checking word {repr(word)}:")
        result = check_line(problem2, alphabet2, word, line)
        print(f"  => {result}")
- For line number 0, checking word 'TRAN8':
For line constraint #0 '(N3|TRA|N7)+':
The word 'TRAN8' is NOT matched by '(N3|TRA|N7)+'
For line constraint #1 '[^OLD\\s]+':
The word 'TRAN8' is matched by '[^OLD\\s]+'
  => False
- For line number 0, checking word 'N2TRA':
For line constraint #0 '(N3|TRA|N7)+':
The word 'N2TRA' is NOT matched by '(N3|TRA|N7)+'
For line constraint #1 '[^OLD\\s]+':
The word 'N2TRA' is matched by '[^OLD\\s]+'
  => False
- For line number 0, checking word 'N3N3N7':
Word 'N3N3N7' does not have correct size n = 5 for lines
  => False
- For line number 0, checking word 'N3N3':
Word 'N3N3' does not have correct size n = 5 for lines
  => False
- For line number 0, checking word 'TRA9':
Word 'TRA9' does not have correct size n = 5 for lines
  => False
- For line number 0, checking word 'O L D':
The word 'O L D' is not in alphabet {'Z', 'E', 'I', 'V', 'R', 'A', 'F', 'D', 'Q', '3', 'Y', 'L', '4', 'W', '$', '5', 'M', 'B', 'N', '2', '8', '1', 'T', 'X', 'H', '-', 'S', '?', 'G', '9', 'U', 'P', '7', 'O', '.', 'C', 'J', 'K', '0', '6', ':'}, as its #1th letter   is not present.
  => False
- For line number 0, checking word 'TRA  ':
The word 'TRA  ' is not in alphabet {'Z', 'E', 'I', 'V', 'R', 'A', 'F', 'D', 'Q', '3', 'Y', 'L', '4', 'W', '$', '5', 'M', 'B', 'N', '2', '8', '1', 'T', 'X', 'H', '-', 'S', '?', 'G', '9', 'U', 'P', '7', 'O', '.', 'C', 'J', 'K', '0', '6', ':'}, as its #3th letter   is not present.
  => False

For columns

We are ready to check the one or two constraints of a line. The same function will be written for columns, just below.

In [54]:
def check_column(problem, alphabet, word, position, debug=True, early=False):
    if not check_alphabet(alphabet, word, debug=debug):
        return False
    m, n = dimension_problem(problem)
    if len(word) != m:
        if debug:
            print(f"Word {repr(word)} does not have correct size n = {n} for columns")
        return False
    assert 0 <= position < n
    constraints = []
    if "top_columns" in problem and problem["top_columns"] is not None:
        constraints += [ problem["top_columns"][position] ]
    if "bottom_columns" in problem and problem["bottom_columns"] is not None:
        constraints += [ problem["bottom_columns"][position] ]
    # okay we have one or two constraint for this column,
    assert len(constraints) in {1, 2}
    # let's check them!
    result = True
    for cnb, constraint in enumerate(constraints):
        if debug:
            print(f"For column constraint #{cnb} {repr(constraint)}:")
        new_result = match(constraint, word, debug=debug)
        if early and not new_result: return False
        result = result and new_result
    return result

Let's try it!

In [55]:
problem1, alphabet1, grid1_solution
Out[55]:
({'left_lines': ['HE|LL|O+', '[PLEASE]+'],
  'right_lines': None,
  'top_columns': ['[^SPEAK]+', 'EP|IP|EF'],
  'bottom_columns': None},
 {'A', 'E', 'F', 'H', 'I', 'K', 'L', 'O', 'P', 'S'},
 [['H', 'E'], ['L', 'P']])
In [56]:
n, m = dimension_problem(problem1)

for column in range(m):
    word = nth_column(grid1_solution, column)
    print(f"- For column number {column}, checking word {repr(word)}:")
    result = check_column(problem1, alphabet1, word, column)
- For column number 0, checking word 'HL':
For column constraint #0 '[^SPEAK]+':
For the first time seeing this regexp '[^SPEAK]+', compiling it and storing in memory_of_compiled_regexps, now of size 5.
The word 'HL' is matched by '[^SPEAK]+'
- For column number 1, checking word 'EP':
For column constraint #0 'EP|IP|EF':
For the first time seeing this regexp 'EP|IP|EF', compiling it and storing in memory_of_compiled_regexps, now of size 6.
The word 'EP' is matched by 'EP|IP|EF'
In [57]:
n, m = dimension_problem(problem1)
fake_words = ["OK", "HEY", "NOT", "HELL", "N", "", "HU", "OO", "EA"]

for word in fake_words:
    print(f"# For word {repr(word)}:")
    for column in range(m):
        result = check_column(problem1, alphabet1, word, column)
        print(f"  => {result}")
# For word 'OK':
For column constraint #0 '[^SPEAK]+':
The word 'OK' is NOT matched by '[^SPEAK]+'
  => False
For column constraint #0 'EP|IP|EF':
The word 'OK' is NOT matched by 'EP|IP|EF'
  => False
# For word 'HEY':
The word 'HEY' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #2th letter Y is not present.
  => False
The word 'HEY' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #2th letter Y is not present.
  => False
# For word 'NOT':
The word 'NOT' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
The word 'NOT' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
# For word 'HELL':
Word 'HELL' does not have correct size n = 2 for columns
  => False
Word 'HELL' does not have correct size n = 2 for columns
  => False
# For word 'N':
The word 'N' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
The word 'N' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter N is not present.
  => False
# For word '':
Word '' does not have correct size n = 2 for columns
  => False
Word '' does not have correct size n = 2 for columns
  => False
# For word 'HU':
The word 'HU' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter U is not present.
  => False
The word 'HU' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter U is not present.
  => False
# For word 'OO':
For column constraint #0 '[^SPEAK]+':
The word 'OO' is matched by '[^SPEAK]+'
  => True
For column constraint #0 'EP|IP|EF':
The word 'OO' is NOT matched by 'EP|IP|EF'
  => False
# For word 'EA':
For column constraint #0 '[^SPEAK]+':
The word 'EA' is NOT matched by '[^SPEAK]+'
  => False
For column constraint #0 'EP|IP|EF':
The word 'EA' is NOT matched by 'EP|IP|EF'
  => False

That was long, but it works fine!

In [58]:
n, m = dimension_problem(problem2)

for column in [0]:
    word = nth_column(grid2_partial, column)
    print(f"- For column number {column}, checking word {repr(word)}:")
    result = check_column(problem2, alphabet2, word, column)
    print(f"  => {result}")
- For column number 0, checking word 'T2A-D':
For column constraint #0 '[2TAIL\\-D]+':
For the first time seeing this regexp '[2TAIL\\-D]+', compiling it and storing in memory_of_compiled_regexps, now of size 7.
The word 'T2A-D' is matched by '[2TAIL\\-D]+'
For column constraint #1 '[^ILYO]+':
For the first time seeing this regexp '[^ILYO]+', compiling it and storing in memory_of_compiled_regexps, now of size 8.
The word 'T2A-D' is matched by '[^ILYO]+'
  => True
In [59]:
n, m = dimension_problem(problem2)
fake_words = [
    "TRAN8", "N2TRA",  # violate first constraint
    "N3N3N7", "N3N3", "TRA9",  # smaller or bigger dimension
    "O L D", "TRA  ",  # violate second contraint
]

for word in fake_words:
    for line in [0]:
        print(f"- For line number {line}, checking word {repr(word)}:")
        result = check_column(problem2, alphabet2, word, line)
        print(f"  => {result}")
- For line number 0, checking word 'TRAN8':
For column constraint #0 '[2TAIL\\-D]+':
The word 'TRAN8' is NOT matched by '[2TAIL\\-D]+'
For column constraint #1 '[^ILYO]+':
The word 'TRAN8' is matched by '[^ILYO]+'
  => False
- For line number 0, checking word 'N2TRA':
For column constraint #0 '[2TAIL\\-D]+':
The word 'N2TRA' is NOT matched by '[2TAIL\\-D]+'
For column constraint #1 '[^ILYO]+':
The word 'N2TRA' is matched by '[^ILYO]+'
  => False
- For line number 0, checking word 'N3N3N7':
Word 'N3N3N7' does not have correct size n = 5 for columns
  => False
- For line number 0, checking word 'N3N3':
Word 'N3N3' does not have correct size n = 5 for columns
  => False
- For line number 0, checking word 'TRA9':
Word 'TRA9' does not have correct size n = 5 for columns
  => False
- For line number 0, checking word 'O L D':
The word 'O L D' is not in alphabet {'Z', 'E', 'I', 'V', 'R', 'A', 'F', 'D', 'Q', '3', 'Y', 'L', '4', 'W', '$', '5', 'M', 'B', 'N', '2', '8', '1', 'T', 'X', 'H', '-', 'S', '?', 'G', '9', 'U', 'P', '7', 'O', '.', 'C', 'J', 'K', '0', '6', ':'}, as its #1th letter   is not present.
  => False
- For line number 0, checking word 'TRA  ':
The word 'TRA  ' is not in alphabet {'Z', 'E', 'I', 'V', 'R', 'A', 'F', 'D', 'Q', '3', 'Y', 'L', '4', 'W', '$', '5', 'M', 'B', 'N', '2', '8', '1', 'T', 'X', 'H', '-', 'S', '?', 'G', '9', 'U', 'P', '7', 'O', '.', 'C', 'J', 'K', '0', '6', ':'}, as its #3th letter   is not present.
  => False

Second easy task: check that a proposed grid is a valid solution

I think it's easy, as we just have to use $m$ times the check_line and $n$ times the check_column functions.

In [60]:
def check_grid(problem, alphabet, grid, debug=True, early=False):
    m, n = dimension_problem(problem)
    
    ok_lines = [False] * m
    for line in range(m):
        word = nth_line(grid, line)
        ok_lines[line] = check_line(problem, alphabet, word, line, debug=debug, early=early)
    
    ok_columns = [False] * n
    for column in range(n):
        word = nth_column(grid, column)
        ok_columns[column] = check_column(problem, alphabet, word, column, debug=debug, early=early)
    
    return all(ok_lines) and all(ok_columns)

Let's try it!

For the easy problem

For a partial grid, of course it's going to be invalid just because '_' is not in the alphabet $\Sigma$.

In [61]:
check_grid(problem1, alphabet1, grid1_partial)
The word 'H_' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter _ is not present.
The word '_P' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter _ is not present.
The word 'H_' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #1th letter _ is not present.
The word '_P' is not in alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}, as its #0th letter _ is not present.
Out[61]:
False

For a complete grid, let's check that our solution is valid:

In [62]:
check_grid(problem1, alphabet1, grid1_solution)
For line constraint #0 'HE|LL|O+':
The word 'HE' is matched by 'HE|LL|O+'
For line constraint #0 '[PLEASE]+':
The word 'LP' is matched by '[PLEASE]+'
For column constraint #0 '[^SPEAK]+':
The word 'HL' is matched by '[^SPEAK]+'
For column constraint #0 'EP|IP|EF':
The word 'EP' is matched by 'EP|IP|EF'
Out[62]:
True

And let's also check that the few wrong solutions are indeed not valid:

In [63]:
check_grid(problem1, alphabet1, grid1_wrong1)
For line constraint #0 'HE|LL|O+':
The word 'HE' is matched by 'HE|LL|O+'
For line constraint #0 '[PLEASE]+':
The word 'LF' is NOT matched by '[PLEASE]+'
For column constraint #0 '[^SPEAK]+':
The word 'HL' is matched by '[^SPEAK]+'
For column constraint #0 'EP|IP|EF':
The word 'EF' is matched by 'EP|IP|EF'
Out[63]:
False
In [64]:
check_grid(problem1, alphabet1, grid1_wrong2)
For line constraint #0 'HE|LL|O+':
The word 'HE' is matched by 'HE|LL|O+'
For line constraint #0 '[PLEASE]+':
The word 'EP' is matched by '[PLEASE]+'
For column constraint #0 '[^SPEAK]+':
The word 'HE' is NOT matched by '[^SPEAK]+'
For column constraint #0 'EP|IP|EF':
The word 'EP' is matched by 'EP|IP|EF'
Out[64]:
False
In [65]:
check_grid(problem1, alphabet1, grid1_wrong3)
For line constraint #0 'HE|LL|O+':
The word 'HE' is matched by 'HE|LL|O+'
For line constraint #0 '[PLEASE]+':
The word 'OF' is NOT matched by '[PLEASE]+'
For column constraint #0 '[^SPEAK]+':
The word 'HO' is matched by '[^SPEAK]+'
For column constraint #0 'EP|IP|EF':
The word 'EF' is matched by 'EP|IP|EF'
Out[65]:
False
In [66]:
check_grid(problem1, alphabet1, grid1_wrong4)
For line constraint #0 'HE|LL|O+':
The word 'OE' is NOT matched by 'HE|LL|O+'
For line constraint #0 '[PLEASE]+':
The word 'OF' is NOT matched by '[PLEASE]+'
For column constraint #0 '[^SPEAK]+':
The word 'OO' is matched by '[^SPEAK]+'
For column constraint #0 'EP|IP|EF':
The word 'EF' is matched by 'EP|IP|EF'
Out[66]:
False

We can see that for each wrong grid, at least one of the contraint is violated!

That's pretty good!

For the intermediate problem

My solution for the intermediate problem problem3 is indeed valid:

In [67]:
check_grid(problem3, alphabet3, grid3_solution)
For line constraint #0 '[ONE]*[SKA]':
For the first time seeing this regexp '[ONE]*[SKA]', compiling it and storing in memory_of_compiled_regexps, now of size 9.
The word 'NOS' is matched by '[ONE]*[SKA]'
For line constraint #1 '.*(O|S)*':
For the first time seeing this regexp '.*(O|S)*', compiling it and storing in memory_of_compiled_regexps, now of size 10.
The word 'NOS' is matched by '.*(O|S)*'
For line constraint #0 '.*(RE|ER)':
For the first time seeing this regexp '.*(RE|ER)', compiling it and storing in memory_of_compiled_regexps, now of size 11.
The word 'FER' is matched by '.*(RE|ER)'
For line constraint #1 '[^GOA]*':
For the first time seeing this regexp '[^GOA]*', compiling it and storing in memory_of_compiled_regexps, now of size 12.
The word 'FER' is matched by '[^GOA]*'
For line constraint #0 'A+[TUB]*':
For the first time seeing this regexp 'A+[TUB]*', compiling it and storing in memory_of_compiled_regexps, now of size 13.
The word 'ATU' is matched by 'A+[TUB]*'
For line constraint #1 '[STUPA]+':
For the first time seeing this regexp '[STUPA]+', compiling it and storing in memory_of_compiled_regexps, now of size 14.
The word 'ATU' is matched by '[STUPA]+'
For column constraint #0 '.*[GAF]*':
For the first time seeing this regexp '.*[GAF]*', compiling it and storing in memory_of_compiled_regexps, now of size 15.
The word 'NFA' is matched by '.*[GAF]*'
For column constraint #1 '(NF|FA|A|FN)+':
For the first time seeing this regexp '(NF|FA|A|FN)+', compiling it and storing in memory_of_compiled_regexps, now of size 16.
The word 'NFA' is matched by '(NF|FA|A|FN)+'
For column constraint #0 '(P|ET|O|TEA)*':
For the first time seeing this regexp '(P|ET|O|TEA)*', compiling it and storing in memory_of_compiled_regexps, now of size 17.
The word 'OET' is matched by '(P|ET|O|TEA)*'
For column constraint #1 '.*(A|E|I).*':
For the first time seeing this regexp '.*(A|E|I).*', compiling it and storing in memory_of_compiled_regexps, now of size 18.
The word 'OET' is matched by '.*(A|E|I).*'
For column constraint #0 '[RUSH]+':
For the first time seeing this regexp '[RUSH]+', compiling it and storing in memory_of_compiled_regexps, now of size 19.
The word 'SRU' is matched by '[RUSH]+'
For column constraint #1 '[SUPER]*':
For the first time seeing this regexp '[SUPER]*', compiling it and storing in memory_of_compiled_regexps, now of size 20.
The word 'SRU' is matched by '[SUPER]*'
Out[67]:
True

For the hard problem

Well I don't have a solution yet, so I cannot check it!

Third easy task: generate all words of a given size in the alphabet

Using itertools.product and the alphabet defined above, it's going to be easy.

Note that I'll first try with a smaller alphabet, to check the result (for problem 1).

In [68]:
import itertools
In [69]:
def all_words_of_alphabet(alphabet, size):
    yield from itertools.product(alphabet, repeat=size)

Just a quick check:

In [70]:
list(all_words_of_alphabet(['0', '1'], 3))
Out[70]:
[('0', '0', '0'),
 ('0', '0', '1'),
 ('0', '1', '0'),
 ('0', '1', '1'),
 ('1', '0', '0'),
 ('1', '0', '1'),
 ('1', '1', '0'),
 ('1', '1', '1')]

The time and memory complexity of this function should be $\mathcal{O}(|\Sigma|^k)$ for words of size $k\in\mathbb{N}^*$.

In [71]:
alphabet0 = ['0', '1']
len_alphabet = len(alphabet0)
for k in [2, 3, 4, 5]:
    print(f"Generating {len_alphabet**k} words of size = {k} takes about")
    %timeit list(all_words_of_alphabet(alphabet0, k))
Generating 4 words of size = 2 takes about
754 ns ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generating 8 words of size = 3 takes about
923 ns ± 9.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generating 16 words of size = 4 takes about
1.38 µs ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Generating 32 words of size = 5 takes about
2.28 µs ± 44 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [72]:
%timeit list(all_words_of_alphabet(['0', '1', '2', '3'], 10))
117 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can quickly check that even for the larger alphabet of size ~40, it's quite quick for small words of length $\leq 5$:

In [73]:
len_alphabet = len(alphabet1)
for k in [2, 3, 4, 5]:
    print(f"Generating {len_alphabet**k} words of size = {k} takes about")
    %timeit list(all_words_of_alphabet(alphabet1, k))
Generating 100 words of size = 2 takes about
6.27 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Generating 1000 words of size = 3 takes about
45.9 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Generating 10000 words of size = 4 takes about
519 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Generating 100000 words of size = 5 takes about
8.36 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [74]:
len_alphabet = len(alphabet2)
for k in [2, 3, 4, 5]:
    print(f"Generating {len_alphabet**k} words of size = {k} takes about")
    %timeit list(all_words_of_alphabet(alphabet2, k))
Generating 1681 words of size = 2 takes about
79.2 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Generating 68921 words of size = 3 takes about
4.57 ms ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Generating 2825761 words of size = 4 takes about
230 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Generating 115856201 words of size = 5 takes about
11.1 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Who, it takes 12 seconds to just generate all the possible words for the largest problem (which is just of size $(5,5)$)...

I'm afraid that my naive approach to solve the puzzle will be VERY slow...

Fourth easy task: generate all grids of a given size

In [75]:
def all_grids_of_alphabet(alphabet, lines, columns):
    all_words = list(itertools.product(alphabet, repeat=columns))
    all_words = [ "".join(words) for words in all_words ]
    all_grids = itertools.product(all_words, repeat=lines)
    for pre_tr_grid in all_grids:
        tr_grid = [
            [
                pre_tr_grid[line][column]
                for line in range(lines)
            ]
            for column in range(columns)
        ]
        yield tr_grid
In [76]:
for alphabet in ( ['0', '1'], ['T', 'A', 'C', 'G'] ):
    for (n, m) in [ (1, 1), (2, 2), (1, 2), (2, 1), (3, 3), (3, 2), (2, 3) ]:
        assert len(list(all_grids_of_alphabet(alphabet, n, m))) == len(alphabet)**(n*m)
        print(list(all_grids_of_alphabet(alphabet0, n, m))[0])
        print(list(all_grids_of_alphabet(alphabet0, n, m))[-1])
[['0']]
[['1']]
[['0', '0'], ['0', '0']]
[['1', '1'], ['1', '1']]
[['0'], ['0']]
[['1'], ['1']]
[['0', '0']]
[['1', '1']]
[['0', '0', '0'], ['0', '0', '0'], ['0', '0', '0']]
[['1', '1', '1'], ['1', '1', '1'], ['1', '1', '1']]
[['0', '0', '0'], ['0', '0', '0']]
[['1', '1', '1'], ['1', '1', '1']]
[['0', '0'], ['0', '0'], ['0', '0']]
[['1', '1'], ['1', '1'], ['1', '1']]
[['0']]
[['1']]
[['0', '0'], ['0', '0']]
[['1', '1'], ['1', '1']]
[['0'], ['0']]
[['1'], ['1']]
[['0', '0']]
[['1', '1']]
[['0', '0', '0'], ['0', '0', '0'], ['0', '0', '0']]
[['1', '1', '1'], ['1', '1', '1'], ['1', '1', '1']]
[['0', '0', '0'], ['0', '0', '0']]
[['1', '1', '1'], ['1', '1', '1']]
[['0', '0'], ['0', '0'], ['0', '0']]
[['1', '1'], ['1', '1'], ['1', '1']]
In [77]:
print(f"For the alphabet {alphabet0} of size = {len(alphabet0)} :")
for (n, m) in [ (1, 1), (2, 1), (1, 2), (2, 2) ]:
    %time all_these_grids = list(all_grids_of_alphabet(alphabet0, n, m))
    print(f"For (n, m) = {(n, m)} the number of grids is {len(all_these_grids)}")
For the alphabet ['0', '1'] of size = 2 :
CPU times: user 28 µs, sys: 6 µs, total: 34 µs
Wall time: 37.9 µs
For (n, m) = (1, 1) the number of grids is 2
CPU times: user 21 µs, sys: 4 µs, total: 25 µs
Wall time: 29.8 µs
For (n, m) = (2, 1) the number of grids is 4
CPU times: user 20 µs, sys: 4 µs, total: 24 µs
Wall time: 26.5 µs
For (n, m) = (1, 2) the number of grids is 4
CPU times: user 38 µs, sys: 8 µs, total: 46 µs
Wall time: 48.9 µs
For (n, m) = (2, 2) the number of grids is 16

How long does it take and how many grids for the easy problem?

In [78]:
print(f"For the alphabet {alphabet1} of size = {len(alphabet1)} :")
for (n, m) in [ (1, 1), (2, 1), (1, 2), (2, 2) ]:
    %time all_these_grids = list(all_grids_of_alphabet(alphabet1, n, m))
    print(f"For (n, m) = {(n, m)} the number of grids is {len(all_these_grids)}")
For the alphabet {'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'} of size = 10 :
CPU times: user 73 µs, sys: 0 ns, total: 73 µs
Wall time: 80.1 µs
For (n, m) = (1, 1) the number of grids is 10
CPU times: user 303 µs, sys: 31 µs, total: 334 µs
Wall time: 341 µs
For (n, m) = (2, 1) the number of grids is 100
CPU times: user 584 µs, sys: 124 µs, total: 708 µs
Wall time: 719 µs
For (n, m) = (1, 2) the number of grids is 100
CPU times: user 27.6 ms, sys: 152 µs, total: 27.7 ms
Wall time: 27.3 ms
For (n, m) = (2, 2) the number of grids is 10000

That's still pretty small and fast!

How long does it take and how many grids for the hard problem?

In [79]:
print(f"For the alphabet {alphabet2} of size = {len(alphabet2)} :")
for (n, m) in [ (1, 1), (2, 1), (1, 2), (2, 2) ]:
    %time all_these_grids = list(all_grids_of_alphabet(alphabet2, n, m))
    print(f"For (n, m) = {(n, m)} the number of grids is {len(all_these_grids)}")
For the alphabet {'Z', 'E', 'I', 'V', 'R', 'A', 'F', 'D', 'Q', '3', 'Y', 'L', '4', 'W', '$', '5', 'M', 'B', 'N', '2', '8', '1', 'T', 'X', 'H', '-', 'S', '?', 'G', '9', 'U', 'P', '7', 'O', '.', 'C', 'J', 'K', '0', '6', ':'} of size = 41 :
CPU times: user 1.05 ms, sys: 224 µs, total: 1.28 ms
Wall time: 1.28 ms
For (n, m) = (1, 1) the number of grids is 41
CPU times: user 2.67 ms, sys: 0 ns, total: 2.67 ms
Wall time: 2.68 ms
For (n, m) = (2, 1) the number of grids is 1681
CPU times: user 4.54 ms, sys: 30 µs, total: 4.57 ms
Wall time: 4.58 ms
For (n, m) = (1, 2) the number of grids is 1681
CPU times: user 6.39 s, sys: 191 ms, total: 6.59 s
Wall time: 6.58 s
For (n, m) = (2, 2) the number of grids is 2825761
In [80]:
41**(2*3)
Out[80]:
4750104241

Just for $(n, m) = (2, 2)$ it takes about 7 seconds... So to scale for $(n, m) = (5, 5)$ would just take... WAY TOO MUCH TIME!

In [81]:
n, m = 5, 5
41**(5*5)
Out[81]:
20873554875923477449109855954682643681001
In [82]:
import math
In [83]:
math.log10(41**(5*5))
Out[83]:
40.31959641799339

For a grid of size $(5,5)$, the number of different possible grids is about $10^{40}$, that is CRAZY large, we have no hope of solving this problem with a brute force approach.

How much time would that require, just to generate the grids?

In [84]:
s = 7
estimate_of_running_time = 7*s * len(alphabet1)**(5*5) / len(alphabet1)**(2*2)
estimate_of_running_time  # in seconds
Out[84]:
4.9e+22

This rough estimate gives about $5 * 10^{22}$ seconds, about $10^{15}$ years, so about a million of billion years !

In [85]:
math.log10( estimate_of_running_time / (60*60*24*365) )
Out[85]:
15.191389473093146

First difficult task: for each possible grid, check if its valid

In [91]:
def naive_solve(problem, alphabet, debug=False, early=True):
    n, m = dimension_problem(problem)
    good_grids = []
    for possible_grid in all_grids_of_alphabet(alphabet, n, m):
        is_good_grid = check_grid(problem, alphabet, possible_grid, debug=debug, early=early)
        if is_good_grid:
            if early:
                return [ possible_grid ]
            good_grids.append(possible_grid)
    return good_grids

Let's try it!

Solving the easy problem

Let's check that we can quickly find one solution:

In [92]:
%%time
good_grids1 = naive_solve(problem1, alphabet1, debug=False, early=True)

print(f"For problem 1\n{problem1}\nOn alphabet\n{alphabet1}\n==> We found one solution:\n{good_grids1}")
For problem 1
{'left_lines': ['HE|LL|O+', '[PLEASE]+'], 'right_lines': None, 'top_columns': ['[^SPEAK]+', 'EP|IP|EF'], 'bottom_columns': None}
On alphabet
{'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}
==> We found one solution:
[[['H', 'E'], ['L', 'P']]]
CPU times: user 111 ms, sys: 3.25 ms, total: 114 ms
Wall time: 115 ms

Then can we find more solutions?

In [93]:
%%time
good_grids1 = naive_solve(problem1, alphabet1, debug=False, early=False)

print(f"For problem 1\n{problem1}\nOn alphabet\n{alphabet1}\n==> We found these solutions:\n{good_grids1}")
For problem 1
{'left_lines': ['HE|LL|O+', '[PLEASE]+'], 'right_lines': None, 'top_columns': ['[^SPEAK]+', 'EP|IP|EF'], 'bottom_columns': None}
On alphabet
{'S', 'P', 'E', 'O', 'L', 'H', 'I', 'K', 'A', 'F'}
==> We found these solutions:
[[['H', 'E'], ['L', 'P']]]
CPU times: user 168 ms, sys: 441 µs, total: 169 ms
Wall time: 167 ms

No there is indeed a unique solution here for the first "easy" problem!

Solving the intermediate problem

In [261]:
%%time
good_grids3 = naive_solve(problem3, alphabet3, debug=False, early=True)

print(f"For problem 3\n{problem3}\nOn alphabet\n{alphabet3}\n==> We found one solution:\n{good_grids3}")
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<timed exec> in <module>

<ipython-input-260-3cd60b633370> in naive_solve(problem, alphabet, debug, early)
      2     n, m = dimension_problem(problem)
      3     good_grids = []
----> 4     for possible_grid in all_grids_of_alphabet(alphabet, n, m):
      5         is_good_grid = check_grid(problem, alphabet, possible_grid, debug=debug, early=early)
      6         if is_good_grid:

<ipython-input-161-8aea88f3c7d0> in all_grids_of_alphabet(alphabet, lines, columns)
      9                 for line in range(lines)
     10             ]
---> 11             for column in range(columns)
     12         ]
     13         yield tr_grid

<ipython-input-161-8aea88f3c7d0> in <listcomp>(.0)
      9                 for line in range(lines)
     10             ]
---> 11             for column in range(columns)
     12         ]
     13         yield tr_grid

KeyboardInterrupt: 

That was so long...

I could try to use Pypy3 IPython kernel, to speed things up?

Yes it's possible to use a Pypy kernel from your regular Python notebook! See https://stackoverflow.com/questions/33850577/is-it-possible-to-run-a-pypy-kernel-in-the-jupyter-notebook

Solving the hard problem

Most probably, it will run forever if I use the naive approach of:

  • generate all grids of $m$ words of size $n$ in given alphabet $\Sigma$ ;
  • for all grid:
    • test it using naive algorithm
    • if it's valid: adds it to the list of good grids

There are $|\Sigma|^{n \times m}$ possible grids, so this approach is doubly exponential in $n$ for square grids.

I must think of a better approach... Being just exponential in $\max(m, n)$ would imply that it's practical for the harder problem of size $(5,5)$.

In [192]:
%%time
good_grids2 = naive_solve(problem2, alphabet2, debug=False, early=True)

print(f"For problem 2\n{problem2}\nOn alphabet\n{alphabet2}\n==> We found one solution:\n{good_grids2}")
-------------------------------------------------------------------
KeyboardInterrupt                 Traceback (most recent call last)
<timed exec> in <module>

<ipython-input-170-3cd60b633370> in naive_solve(problem, alphabet, debug, early)
      2     n, m = dimension_problem(problem)
      3     good_grids = []
----> 4     for possible_grid in all_grids_of_alphabet(alphabet, n, m):
      5         is_good_grid = check_grid(problem, alphabet, possible_grid, debug=debug, early=early)
      6         if is_good_grid:

<ipython-input-161-8aea88f3c7d0> in all_grids_of_alphabet(alphabet, lines, columns)
      1 def all_grids_of_alphabet(alphabet, lines, columns):
----> 2     all_words = list(itertools.product(alphabet, repeat=columns))
      3     all_words = [ "".join(words) for words in all_words ]
      4     all_grids = itertools.product(all_words, repeat=lines)
      5     for pre_tr_grid in all_grids:

KeyboardInterrupt: 

My first idea was to try to tackle each constraint independently, and generate the set of words that satisfy this contraint. (by naively checking check(constraint, word) for each word in $\Sigma^n$ or $\Sigma^m$).

  • if there are two line constraints (left/right), get the intersection of the two sets of words;
  • then, for each line we have a set of possible words:
    • we can build each column, and then check that the top/bottom constraint is valid or not
    • if valid, continue to next column until the last
    • if all columns are valid, then these lines/columns form a possible grid!
    • (if we want only one solution, stop now, otherwise continue)

Second difficult task: a more efficient approach to solve any problem

In [94]:
n, m = dimension_problem(problem1)
In [95]:
problem1
Out[95]:
{'left_lines': ['HE|LL|O+', '[PLEASE]+'],
 'right_lines': None,
 'top_columns': ['[^SPEAK]+', 'EP|IP|EF'],
 'bottom_columns': None}
In [96]:
alphabet1
Out[96]:
{'A', 'E', 'F', 'H', 'I', 'K', 'L', 'O', 'P', 'S'}
In [97]:
len(list(all_words_of_alphabet(alphabet1, n)))
Out[97]:
100
In [98]:
["".join(word) for word in list(all_words_of_alphabet(alphabet1, n))][:10]
Out[98]:
['SS', 'SP', 'SE', 'SO', 'SL', 'SH', 'SI', 'SK', 'SA', 'SF']
In [99]:
[
    [ "".join(word)
         for word in all_words_of_alphabet(alphabet1, n)
         if check_line(problem1, alphabet1, "".join(word), line, debug=False, early=True)
    ]
    for line in range(m)
]
Out[99]:
[['OO', 'LL', 'HE'],
 ['SS',
  'SP',
  'SE',
  'SL',
  'SA',
  'PS',
  'PP',
  'PE',
  'PL',
  'PA',
  'ES',
  'EP',
  'EE',
  'EL',
  'EA',
  'LS',
  'LP',
  'LE',
  'LL',
  'LA',
  'AS',
  'AP',
  'AE',
  'AL',
  'AA']]
In [100]:
[
    [ "".join(word)
         for word in all_words_of_alphabet(alphabet1, m)
         if check_column(problem1, alphabet1, "".join(word), column, debug=False, early=True)
    ]
    for column in range(n)
]
Out[100]:
[['OO',
  'OL',
  'OH',
  'OI',
  'OF',
  'LO',
  'LL',
  'LH',
  'LI',
  'LF',
  'HO',
  'HL',
  'HH',
  'HI',
  'HF',
  'IO',
  'IL',
  'IH',
  'II',
  'IF',
  'FO',
  'FL',
  'FH',
  'FI',
  'FF'],
 ['EP', 'EF', 'IP']]

So let's write this algorithm.

I'm using a tqdm.tqdm() wrapper on the foor loops, to keep an eye on the progress.

In [101]:
from tqdm.notebook import trange, tqdm
In [102]:
def smart_solve(problem, alphabet, debug=False, early=True):
    n, m = dimension_problem(problem)
    good_grids = []
    
    possible_words_for_lines = [
        [ "".join(word)
             for word in all_words_of_alphabet(alphabet, n)
             if check_line(problem, alphabet, "".join(word), line, debug=False, early=True)
             # TODO don't compute this "".join(word) twice?
        ]
        for line in range(m)
    ]
    number_of_combinations = 1
    for line in range(m):
        number_of_combinations *= len(possible_words_for_lines[line])
        print(f"- There are {len(possible_words_for_lines[line])} different words for line #{line}")
    print(f"=> There are {number_of_combinations} combinations of words for lines #{0}..#{m-1}")

    for possible_words in tqdm(
        list(itertools.product(*possible_words_for_lines)),
        desc="lines"
    ):
        if debug: print(f"    Trying possible_words from line constraints = {possible_words}")
        column = 0
        no_wrong_column = True
        while no_wrong_column and column < n:
            word_column = "".join(possible_words[line][column] for line in range(m))
            if debug: print(f"       For column #{column}, word = {word_column}, checking constraint...")
            if not check_column(problem, alphabet, word_column, column, debug=False, early=True):
                # this word is NOT valid for this column, so let's go to the next word
                if debug: print(f"    This word {word_column} is NOT valid for this column {column}, so let's go to the next word")
                no_wrong_column = False
                # break: this was failing... broke the outer for-loop and not the inner one
            column += 1
        if no_wrong_column:
            print(f"    These words seemed to satisfy the column constraints!\n{possible_words}")
            
            # so all columns are valid! this choice of words is good!
            possible_grid = [
                list(word) for word in possible_words
            ]
            print(f"Giving this grid:\n{possible_grid}")
            # let's check it, just in case (this takes a short time, compared to the rest)
            is_good_grid = check_grid(problem, alphabet, possible_grid, debug=debug, early=early)
            if is_good_grid:
                if early:
                    return [ possible_grid ]
                good_grids.append(possible_grid)
    
    # after the outer for loop on possible_words
    return good_grids

And let's try it:

For the easy problem

In [103]:
grid1_solution
Out[103]:
[['H', 'E'], ['L', 'P']]
In [104]:
%%time
good_grids1 = smart_solve(problem1, alphabet1)

good_grids1
- There are 3 different words for line #0
- There are 25 different words for line #1
=> There are 75 combinations of words for lines #0..#1
    These words seemed to satisfy the column constraints!
('HE', 'LP')
Giving this grid:
[['H', 'E'], ['L', 'P']]
CPU times: user 37.9 ms, sys: 3.77 ms, total: 41.6 ms
Wall time: 90.5 ms
Out[104]:
[[['H', 'E'], ['L', 'P']]]

So it worked!

🚀 It was also BLAZING fast compared to the naive approach: 160ms against about 900µs, almost a 160x speed-up factor!

🤔 I don't understand why it's so slow now I did get a time of 900 µs at first try, now it's about 90 ms... just a 2x spee-up factor.

Let's try for the harder problem!

For the intermediate problem

In [106]:
%%time
#assert False  # uncomment when ready

good_grids3 = smart_solve(problem3, alphabet3)

good_grids3
- There are 27 different words for line #0
- There are 24 different words for line #1
- There are 7 different words for line #2
=> There are 4536 combinations of words for lines #0..#2
    These words seemed to satisfy the column constraints!
('NOS', 'FER', 'ATU')
Giving this grid:
[['N', 'O', 'S'], ['F', 'E', 'R'], ['A', 'T', 'U']]
CPU times: user 92 ms, sys: 0 ns, total: 92 ms
Wall time: 90 ms
Out[106]:
[[['N', 'O', 'S'], ['F', 'E', 'R'], ['A', 'T', 'U']]]

🚀 It was also BLAZING fast compared to the naive approach: 90ms, when the naive approach was just too long that I killed it...

For the harder problem

In [ ]:
%%time
#assert False  # uncomment when ready

good_grids2 = smart_solve(problem2, alphabet2)

good_grids2
For the first time seeing this regexp '[1LOVE2?4]+.', compiling it and storing in memory_of_compiled_regexps, now of size 21.
For the first time seeing this regexp '(\\d+)[LA\\s$?]+', compiling it and storing in memory_of_compiled_regexps, now of size 22.
For the first time seeing this regexp '(A|D)M[5-8$L]+', compiling it and storing in memory_of_compiled_regexps, now of size 23.
For the first time seeing this regexp '(\\-P|5\\$|AM|Z|L)+', compiling it and storing in memory_of_compiled_regexps, now of size 24.
For the first time seeing this regexp '[^\\s0ILAD]+', compiling it and storing in memory_of_compiled_regexps, now of size 25.
For the first time seeing this regexp '(\\-D|\\-WE)+[^L4-9N$?]+', compiling it and storing in memory_of_compiled_regexps, now of size 26.
For the first time seeing this regexp '[B-E]+(.)\\1.', compiling it and storing in memory_of_compiled_regexps, now of size 27.
For the first time seeing this regexp '[FED$?]+', compiling it and storing in memory_of_compiled_regexps, now of size 28.

It made my kernel restart...

Improve the solution - TODO

If you're extra curious about this puzzle problem, and my experiments, you can continue from here and finish these ideas:

  • It could be great if it were be possible to give a partially filled grid, and start from there.

  • It could also be great to just be able to fill one cell in the grid, in case you're blocked and want some hint.

My feeling about these problems and my solutions

I could have tried to be more efficient, but I didn't have much time to spend on this.

Conclusion

That was nice! Writing this notebook took about 4.5 hours entirely, from first idea to final edit, on Sunday 28th of February, 2021. (note that I was also cooking my pancakes during the first half, so I wasn't intensely coding)

Have a look at my other notebooks.