In [1]:

```
%pylab inline
```

In [10]:

```
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
from pandas import *
import scipy.optimize as opt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import os
import random
```

In [11]:

```
heights_weights = read_csv('data/01_heights_weights_genders.csv')
```

In [24]:

```
ols_fit = ols('Weight ~ Height', df = heights_weights).fit()
ols_sse = ols_fit.mse_resid * (ols_fit.df_resid)
print np.round(ols_fit.params, 3)
print 'MSE %i' % round(ols_sse)
```

We'll explicitly set up the ridge SSE function to minimize. We'll also provide explicit functions for the gradient and hessian of the SSE, since they're pretty straightforward.

We'll try a couple of different optimization algorithms, some use the gradient and hessians, other (e.g., Nelder-Mead) don't use either.

In [25]:

```
y = heights_weights['Weight'].values
Xmat = sm.add_constant(heights_weights['Height'], prepend = True)
def ridge_error(params, y, Xmat, lam):
'''
Compute SSE of the ridge regression.
This is the normal regression SSE, plus the
L2 cost of the parameters.
'''
predicted = np.dot(Xmat, params)
sse = ((y - predicted) ** 2).sum()
sse += lam * (params ** 2).sum()
return sse
def ridge_grad(params, y, Xmat, lam):
'''
The gradiant of the ridge regression SSE.
'''
grad = np.dot(np.dot(Xmat.T, Xmat), params) - np.dot(Xmat.T, y)
grad += lam * params
grad *= 2
return grad
def ridge_hess(params, y, Xmat, lam):
'''
The hessian of the ridge regression SSE.
'''
return np.dot(Xmat.T, Xmat) + np.eye(2) * lam
```

Set the starting parameters (constant, coefficient).

In [26]:

```
params0 = np.array([0.0, 0.0])
```

**Nelder-Mead Simplex:** no gradient required. This is the default of `optim`

in R.

In [28]:

```
ridge_fit = opt.fmin(ridge_error, params0, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
```

**Newton conjugate gradient:** requires gradient, hessian optional.

First without the hessian.

In [34]:

```
ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
```

Now, with the hessian. This shaves a little bit of time off.

In [35]:

```
ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad,
fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
```

**BFGS:** Uses the gradient, no hessian

In [36]:

```
ridge_fit = opt.fmin_bfgs(ridge_error, params0, fprime = ridge_grad,
args = (y, Xmat, 1))
print 'Solution: ', ridge_fit
```

The lexical database is a collection of words and their frequency on Wikipedia.

In [37]:

```
lexical_database = read_csv('data/lexical_database.csv', index_col = 0,
header = None, skiprows = 1, squeeze = True)
lexical_database.index.name = 'word'
```

In [38]:

```
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
'y', 'z']
```

The "Ceasar Cipher" is a cipher that encrypts text by translating each letter in the text to the next letter in the alphabet (with Z going back to A). We'll represent the cipher, and its key, as mappings in a dictionary.

In [39]:

```
ceasar_cipher = {i: j for i, j in zip(letters, letters[1:] + letters[:1])}
inverse_ceasar_cipher = {ceasar_cipher[k]: k for k in ceasar_cipher}
```

Functions that cipher (encrypt) and decipher a test string. `decipher_text`

doesn't need an explicit deciphering key, it will invert the cipher dictionary used by `cipher_text`

.

In [40]:

```
def cipher_text(text, cipher_dict = ceasar_cipher):
# Split the string into a list of characters to apply
# the decoder over.
strlist = list(text)
ciphered = ''.join([cipher_dict.get(x) or x for x in strlist])
return ciphered
def decipher_text(text, cipher_dict = ceasar_cipher):
# Split the string into a list of characters to apply
# the decoder over.
strlist = list(text)
# Invert the cipher dictionary (k, v) -> (v, k)
decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
deciphered = ''.join([decipher_dict.get(x) or x for x in strlist])
return deciphered
```

The following functions are used to generate proposal ciphers for the Metropolis algorithm. The idea is to randomly generate ciphers and see what text they result in. If the text resulting from a proposed cipher is more likely (according to the lexical database) than the current cipher, we accept the proposal. If it's not, we accept it wil a probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a method that chooses a key (letter) at random from the current cipher, and swaps its with some other letter. For example, if we start with the Ceasar Cipher, our proposal might randomly choose to re-map A to N (instead of B). The proposal would then be the same a the Ceasar Cipher, but with A $\rightarrow$ N and M $\rightarrow$ B (since A originally mapped to B and M originally mapped to N). This proposal-generating mechanism is encapsulated in `propose_modified_cipher_from_cipher`

.

This is inefficient in a few ways. First, the letter chosen to modify in the cipher may not even appear in the text, so the proposed cipher won't modify the text at all and you end up wasting cycles generating a lot of useless proposals. Second, we may end up picking a letter that occurs in a highly likely word, which will increase the probability of generating an inferior proposal.

We'll suggest another mechanism that, instead of selecting a letter from the current cipher to re-map, will choose a letter amongst the non-words in the current deciphered text. For example, if our current deciphered text is "hello wqrld", we will only select amongst {w, q, r, l, d} to modify at random. The minimizes the chances that a modified cipher will turn real words into gibberish and produce less likely text. The function `propose_modified_cipher_from_text`

performs this proposal mechanism.

One way to think of this is that it's analogous to tuning the variance of the proposal distribution in the typical Metropolis algorithm. If the variance is too low, our algorithm won't efficiently explore the target distribution. If it's too high, we'll end up generating lots of lousy proposals. Our cipher proposal rules can suffer from similar problems.

In [60]:

```
def generate_random_cipher():
'''
Randomly generate a cipher dictionary (a one-to-one letter -> letter map).
Used to generate the starting cipher of the algorithm.
'''
cipher = []
input = letters
output = letters[:]
random.shuffle(output)
cipher_dict = {k: v for (k, v) in zip(input, output)}
return cipher_dict
def modify_cipher(cipher_dict, input, new_output):
'''
Swap a single key in a cipher dictionary.
Old: a -> b, ..., m -> n, ...
New: a -> n, ..., m -> b, ...
'''
decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
old_output = cipher_dict[input]
new_cipher = cipher_dict.copy()
new_cipher[input] = new_output
new_cipher[decipher_dict[new_output]] = old_output
return new_cipher
def propose_modified_cipher_from_cipher(text, cipher_dict,
lexical_db = lexical_database):
'''
Generates a new cipher by choosing and swapping a key in the
current cipher.
'''
_ = text # Unused
input = random.sample(cipher_dict.keys(), 1)[0]
new_output = random.sample(letters, 1)[0]
return modify_cipher(cipher_dict, input, new_output)
def propose_modified_cipher_from_text(text, cipher_dict,
lexical_db = lexical_database):
'''
Generates a new cipher by choosing a swapping a key in the current
cipher, but only chooses keys that are letters that appear in the
gibberish words in the current text.
'''
deciphered = decipher_text(text, cipher_dict).split()
letters_to_sample = ''.join([t for t in deciphered
if lexical_db.get(t) is None])
letters_to_sample = letters_to_sample or ''.join(set(deciphered))
input = random.sample(letters_to_sample, 1)[0]
new_output = random.sample(letters, 1)[0]
return modify_cipher(cipher_dict, input, new_output)
```

The following functions compute the likelihood of text generated by proposed ciphers. To find a word's probability, we just look it up in the lexical database. To compute the log probability of some text, we just sum up the log probabilities of the words in it.

In [43]:

```
def one_gram_prob(one_gram, lexical_db = lexical_database):
return lexical_db.get(one_gram) or np.finfo(float).eps
def text_logp(text, cipher_dict, lexical_db = lexical_database):
deciphered = decipher_text(text, cipher_dict).split()
logp = np.array([math.log(one_gram_prob(w)) for w in deciphered]).sum()
return logp
```

Now the Metropolis algorithm step. We generate a proposal cipher, and compute the log probability of the its decoded text. If this is greater than the log probability of the decoded text of the current cipher, we accept the proposal for sure. Otherwise, we only accept it with a probability given by its likelihood relative to that of the current cipher, so the less likely its decoded text is, the less likely we accept it.

In [44]:

```
def metropolis_step(text, cipher_dict, proposal_rule, lexical_db = lexical_database):
proposed_cipher = proposal_rule(text, cipher_dict)
lp1 = text_logp(text, cipher_dict)
lp2 = text_logp(text, proposed_cipher)
if lp2 > lp1:
return proposed_cipher
else:
a = math.exp(lp2 - lp1)
x = random.random()
if x < a:
return proposed_cipher
else:
return cipher_dict
```

Create a message and decode it with the Ceaser Cipher. Then set up the Metropolis algorithm which generates proposal ciphers (we can provide different proposal mechanisms), and calls the Metropolis step defined above. This runs for a fixed number of iterations.

In [45]:

```
message = 'here is some sample text'
ciphered_text = cipher_text(message, ceasar_cipher)
niter = 250000
def metropolis_decipher(ciphered_text, proposal_rule, niter, seed = 4):
random.seed(seed)
cipher = generate_random_cipher()
deciphered_text_list = []
logp_list = []
for i in xrange(niter):
logp = text_logp(ciphered_text, cipher)
current_deciphered_text = decipher_text(ciphered_text, cipher)
deciphered_text_list.append(current_deciphered_text)
logp_list.append(logp)
cipher = metropolis_step(ciphered_text, cipher, proposal_rule)
results = DataFrame({'deciphered_text': deciphered_text_list, 'logp': logp_list})
results.index = np.arange(1, niter + 1)
return results
```

First, let's try the author's proposal rule.

In [46]:

```
results0 = metropolis_decipher(ciphered_text,
propose_modified_cipher_from_cipher, niter)
```

Looking at the deciphered text of every 10,000th entry, we find we're not even close after 250,000 iteration, and we seem to be locked in a pocket of the target distribution that's the proposal rule has trouble escaping from.

In [47]:

```
results0.ix[10000::10000]
```

Out[47]:

Now, let's try the alternative proposal rule, which only chooses letters from gibberish words when it modifies the current cipher to propose a new one.

In [48]:

```
results1 = metropolis_decipher(ciphered_text,
propose_modified_cipher_from_text, niter)
```

The algorithm doesn't find the actual message, but it actually finds a more likely message (according the the lexical database) within 20,000 iterations.

In [49]:

```
results1.ix[10000::10000]
```

Out[49]:

Let's take a look at the likelihood paths of the algorithm with the two proposal rules. We can compute the probability of the original message, which is somewhat less than the message the "propose from text" rule converges to.

In [50]:

```
plt.figure(figsize = (10, 7))
results0.logp.plot(style = 'r-', label = 'propose from cipher')
results1.logp.plot(style = 'k-', label = 'propose from text')
plt.xlabel('Iteration')
plt.ylabel('Log Probability')
plt.axhline(text_logp(ciphered_text, ceasar_cipher), label = 'log prob of true text')
plt.legend(loc = 'best')
```

Out[50]:

The Metropolis algorithm is kind of pointless for this application. It's really just jumping around looking for the most likely phrase. But since the likelihood of a message is just the sum of the log probabilities of the log probabilities of its component words, we just need to look for the most likely words of the lengths of the words of the ciphered message.

If the message at some point is "fgk tp hpdt", then, if run long enough, the algorithm should just find the most likely three-letter word, the most likely two-letter word, and the most likely four-letter word. But we can look these up directly.

For example, the message we encrypted is 'here is some sample text', which has word lengths 4, 2, 4, 6, 4. What's the most likely message with these word lengths?

In [64]:

```
def maxprob_message(word_lens = (4, 2, 4, 6, 4), lexical_db = lexical_database):
db_word_series = Series(lexical_db.index)
db_word_len = db_word_series.str.len()
max_prob_wordlist = []
logp = 0.0
for i in word_lens:
db_words_i = list(db_word_series[db_word_len == i])
db_max_prob_word = lexical_db[db_words_i].idxmax()
logp += math.log(lexical_db[db_words_i].max())
max_prob_wordlist.append(db_max_prob_word)
return max_prob_wordlist, logp
```

In [65]:

```
maxprob_message()
```

Out[65]: