In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
In [10]:
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
from pandas import *
import scipy.optimize as opt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import os
import random

Ridge regression by explicit SSE minimization.

In [11]:
heights_weights = read_csv('data/01_heights_weights_genders.csv')

The OLS estimate

In [24]:
ols_fit = ols('Weight ~ Height', df = heights_weights).fit()
ols_sse = ols_fit.mse_resid * (ols_fit.df_resid) 

print np.round(ols_fit.params, 3)
print 'MSE   %i' % round(ols_sse)
Intercept   -350.737
Height         7.717
MSE   1492935

Set up the ridge regression.

We'll explicitly set up the ridge SSE function to minimize. We'll also provide explicit functions for the gradient and hessian of the SSE, since they're pretty straightforward.

We'll try a couple of different optimization algorithms, some use the gradient and hessians, other (e.g., Nelder-Mead) don't use either.

In [25]:
y = heights_weights['Weight'].values
Xmat = sm.add_constant(heights_weights['Height'], prepend = True)

def ridge_error(params, y, Xmat, lam):
    '''
    Compute SSE of the ridge regression.
    This is the normal regression SSE, plus the
    L2 cost of the parameters.
    '''
    predicted = np.dot(Xmat, params)
    sse = ((y - predicted) ** 2).sum()
    sse += lam * (params ** 2).sum()
    
    return sse

def ridge_grad(params, y, Xmat, lam):
    '''
    The gradiant of the ridge regression SSE.
    '''
    grad = np.dot(np.dot(Xmat.T, Xmat), params) - np.dot(Xmat.T, y)
    grad += lam * params
    grad *= 2
    return grad

def ridge_hess(params, y, Xmat, lam):
    ''' 
    The hessian of the ridge regression SSE.
    '''
    return np.dot(Xmat.T, Xmat) + np.eye(2) * lam

Minimize the ridge SSE using different algorithms

Set the starting parameters (constant, coefficient).

In [26]:
params0 = np.array([0.0, 0.0])

Nelder-Mead Simplex: no gradient required. This is the default of optim in R.

In [28]:
ridge_fit = opt.fmin(ridge_error, params0, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 117
         Function evaluations: 221
Solution: a = -340.565, b =    7.565 

Newton conjugate gradient: requires gradient, hessian optional.

First without the hessian.

In [34]:
ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 4
         Gradient evaluations: 11
         Hessian evaluations: 0
Solution: a = -340.565, b =    7.565 

Now, with the hessian. This shaves a little bit of time off.

In [35]:
ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, 
                         fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)
Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 7
         Gradient evaluations: 3
         Hessian evaluations: 3
Solution: a = -340.565, b =    7.565 

BFGS: Uses the gradient, no hessian

In [36]:
ridge_fit = opt.fmin_bfgs(ridge_error, params0, fprime = ridge_grad, 
                          args = (y, Xmat, 1))
print 'Solution: ', ridge_fit
Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 15
         Function evaluations: 290
         Gradient evaluations: 243
Solution:  [-340.565481     7.5645375]

Deciphering text with the Metropolis-Hastings Algorithm

The lexical database is a collection of words and their frequency on Wikipedia.

In [37]:
lexical_database = read_csv('data/lexical_database.csv', index_col = 0, 
                            header = None, skiprows = 1, squeeze = True)
lexical_database.index.name = 'word'
In [38]:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 
           'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 
           'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 
           'y', 'z']

The "Ceasar Cipher" is a cipher that encrypts text by translating each letter in the text to the next letter in the alphabet (with Z going back to A). We'll represent the cipher, and its key, as mappings in a dictionary.

In [39]:
ceasar_cipher = {i: j for i, j in zip(letters, letters[1:] + letters[:1])}
inverse_ceasar_cipher = {ceasar_cipher[k]: k for k in ceasar_cipher}

Functions that cipher (encrypt) and decipher a test string. decipher_text doesn't need an explicit deciphering key, it will invert the cipher dictionary used by cipher_text.

In [40]:
def cipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)
    
    ciphered = ''.join([cipher_dict.get(x) or x for x in strlist])
    return ciphered

def decipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)
    
    # Invert the cipher dictionary (k, v) -> (v, k)
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}

    deciphered = ''.join([decipher_dict.get(x) or x for x in strlist]) 
    return deciphered

The following functions are used to generate proposal ciphers for the Metropolis algorithm. The idea is to randomly generate ciphers and see what text they result in. If the text resulting from a proposed cipher is more likely (according to the lexical database) than the current cipher, we accept the proposal. If it's not, we accept it wil a probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a method that chooses a key (letter) at random from the current cipher, and swaps its with some other letter. For example, if we start with the Ceasar Cipher, our proposal might randomly choose to re-map A to N (instead of B). The proposal would then be the same a the Ceasar Cipher, but with A $\rightarrow$ N and M $\rightarrow$ B (since A originally mapped to B and M originally mapped to N). This proposal-generating mechanism is encapsulated in propose_modified_cipher_from_cipher.

This is inefficient in a few ways. First, the letter chosen to modify in the cipher may not even appear in the text, so the proposed cipher won't modify the text at all and you end up wasting cycles generating a lot of useless proposals. Second, we may end up picking a letter that occurs in a highly likely word, which will increase the probability of generating an inferior proposal.

We'll suggest another mechanism that, instead of selecting a letter from the current cipher to re-map, will choose a letter amongst the non-words in the current deciphered text. For example, if our current deciphered text is "hello wqrld", we will only select amongst {w, q, r, l, d} to modify at random. The minimizes the chances that a modified cipher will turn real words into gibberish and produce less likely text. The function propose_modified_cipher_from_text performs this proposal mechanism.

One way to think of this is that it's analogous to tuning the variance of the proposal distribution in the typical Metropolis algorithm. If the variance is too low, our algorithm won't efficiently explore the target distribution. If it's too high, we'll end up generating lots of lousy proposals. Our cipher proposal rules can suffer from similar problems.

In [60]:
def generate_random_cipher():
    '''
    Randomly generate a cipher dictionary (a one-to-one letter -> letter map).
    Used to generate the starting cipher of the algorithm.
    '''
    cipher = []

    input  = letters
    output = letters[:]
    random.shuffle(output)
    
    cipher_dict = {k: v for (k, v) in zip(input, output)}
    
    return cipher_dict

def modify_cipher(cipher_dict, input, new_output):
    '''
     Swap a single key in a cipher dictionary.

     Old: a -> b, ..., m -> n, ...
     New: a -> n, ..., m -> b, ...
     '''
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
    old_output = cipher_dict[input]
    
    new_cipher = cipher_dict.copy()
    new_cipher[input] = new_output
    new_cipher[decipher_dict[new_output]] = old_output
    
    return new_cipher
    
def propose_modified_cipher_from_cipher(text, cipher_dict, 
                                        lexical_db = lexical_database):
    '''
    Generates a new cipher by choosing and swapping a key in the
    current cipher.
    '''
    _          = text # Unused
    input      = random.sample(cipher_dict.keys(), 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

def propose_modified_cipher_from_text(text, cipher_dict, 
                                      lexical_db = lexical_database):
    
    '''
    Generates a new cipher by choosing a swapping a key in the current
    cipher, but only chooses keys that are letters that appear in the
    gibberish words in the current text.
    '''
    deciphered = decipher_text(text, cipher_dict).split()
    letters_to_sample = ''.join([t for t in deciphered 
                                 if lexical_db.get(t) is None])
    letters_to_sample = letters_to_sample or ''.join(set(deciphered))
    
    input      = random.sample(letters_to_sample, 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

The following functions compute the likelihood of text generated by proposed ciphers. To find a word's probability, we just look it up in the lexical database. To compute the log probability of some text, we just sum up the log probabilities of the words in it.

In [43]:
def one_gram_prob(one_gram, lexical_db = lexical_database):
    return lexical_db.get(one_gram) or np.finfo(float).eps 

def text_logp(text, cipher_dict, lexical_db = lexical_database):
    deciphered = decipher_text(text, cipher_dict).split()
    logp = np.array([math.log(one_gram_prob(w)) for w in deciphered]).sum()
    return logp

Now the Metropolis algorithm step. We generate a proposal cipher, and compute the log probability of the its decoded text. If this is greater than the log probability of the decoded text of the current cipher, we accept the proposal for sure. Otherwise, we only accept it with a probability given by its likelihood relative to that of the current cipher, so the less likely its decoded text is, the less likely we accept it.

In [44]:
def metropolis_step(text, cipher_dict, proposal_rule, lexical_db = lexical_database):
    proposed_cipher = proposal_rule(text, cipher_dict)
    lp1 = text_logp(text, cipher_dict)
    lp2 = text_logp(text, proposed_cipher)
    
    if lp2 > lp1:
        return proposed_cipher
    else:
        a = math.exp(lp2 - lp1)
        x = random.random()
        if x < a:
            return proposed_cipher
        else:
            return cipher_dict

Create a message and decode it with the Ceaser Cipher. Then set up the Metropolis algorithm which generates proposal ciphers (we can provide different proposal mechanisms), and calls the Metropolis step defined above. This runs for a fixed number of iterations.

In [45]:
message = 'here is some sample text'
ciphered_text = cipher_text(message, ceasar_cipher)
niter = 250000

def metropolis_decipher(ciphered_text, proposal_rule, niter, seed = 4):
    random.seed(seed)
    cipher = generate_random_cipher()

    deciphered_text_list = []
    logp_list = []

    for i in xrange(niter):
        logp = text_logp(ciphered_text, cipher)
        current_deciphered_text = decipher_text(ciphered_text, cipher)

        deciphered_text_list.append(current_deciphered_text)
        logp_list.append(logp)
    
        cipher = metropolis_step(ciphered_text, cipher, proposal_rule)
    
    results = DataFrame({'deciphered_text': deciphered_text_list, 'logp': logp_list})
    results.index = np.arange(1, niter + 1)
    return results

First, let's try the author's proposal rule.

In [46]:
results0 = metropolis_decipher(ciphered_text, 
                               propose_modified_cipher_from_cipher, niter)

Looking at the deciphered text of every 10,000th entry, we find we're not even close after 250,000 iteration, and we seem to be locked in a pocket of the target distribution that's the proposal rule has trouble escaping from.

In [47]:
results0.ix[10000::10000]
Out[47]:
deciphered_text logp
10000 kudu of feru fyrvbu hush -86.585205
20000 wudu of feru fbrkxu hush -87.124919
30000 kudu of feru fnrbau hush -86.585205
40000 wudu of feru fmrjiu hush -87.124919
50000 kudu of feru fyrnbu hush -86.585205
60000 kudu of feru fxrnvu hush -86.585205
70000 pudu of feru fvrnlu hush -87.561022
80000 kudu of feru fvrxgu hush -86.585205
90000 kudu of feru fbrvtu hush -86.585205
100000 kudu of feru fjrnlu hush -86.585205
110000 kudu of feru fprbju hush -86.585205
120000 kudu of feru fnrjcu hush -86.585205
130000 kudu of feru flrvpu hush -86.585205
140000 puku of feru flrvxu hush -88.028362
150000 kudu of feru fxrviu hush -86.585205
160000 pulu of feru ftrdzu hush -88.323162
170000 wuzu of feru flrxdu hush -89.575925
180000 kudu of feru firamu hush -86.585205
190000 wudu of feru fyrzqu hush -87.124919
200000 wudu of feru fnraxu hush -87.124919
210000 puku of feru fjrnyu hush -88.028362
220000 puku of feru firyau hush -88.028362
230000 pudu of feru fkrcvu hush -87.561022
240000 kudu of feru ftrwzu hush -86.585205
250000 kudu of feru fprxzu hush -86.585205

Now, let's try the alternative proposal rule, which only chooses letters from gibberish words when it modifies the current cipher to propose a new one.

In [48]:
results1 = metropolis_decipher(ciphered_text, 
                               propose_modified_cipher_from_text, niter)

The algorithm doesn't find the actual message, but it actually finds a more likely message (according the the lexical database) within 20,000 iterations.

In [49]:
results1.ix[10000::10000]
Out[49]:
deciphered_text logp
10000 were mi isle izlkde text -68.946850
20000 were as some simple text -35.784429
30000 were as some simple text -35.784429
40000 were as some simple text -35.784429
50000 were as some simple text -35.784429
60000 were as some simple text -35.784429
70000 were us some simple text -38.176725
80000 were as some simple text -35.784429
90000 were as some simple text -35.784429
100000 were as some simple text -35.784429
110000 were as some simple text -35.784429
120000 were as some simple text -35.784429
130000 were as some simple text -35.784429
140000 were as some simple text -35.784429
150000 were us some simple text -38.176725
160000 were as some simple text -35.784429
170000 were is some sample text -37.012894
180000 were as some simple text -35.784429
190000 were as some simple text -35.784429
200000 were as some simple text -35.784429
210000 were as some simple text -35.784429
220000 were as some simple text -35.784429
230000 were as some simple text -35.784429
240000 were as some simple text -35.784429
250000 were is some sample text -37.012894

Let's take a look at the likelihood paths of the algorithm with the two proposal rules. We can compute the probability of the original message, which is somewhat less than the message the "propose from text" rule converges to.

In [50]:
plt.figure(figsize = (10, 7))
results0.logp.plot(style = 'r-', label = 'propose from cipher')
results1.logp.plot(style = 'k-', label = 'propose from text')
plt.xlabel('Iteration')
plt.ylabel('Log Probability')

plt.axhline(text_logp(ciphered_text, ceasar_cipher), label = 'log prob of true text')
plt.legend(loc = 'best')
Out[50]:
<matplotlib.legend.Legend at 0x10df05f90>

The Metropolis algorithm is kind of pointless for this application. It's really just jumping around looking for the most likely phrase. But since the likelihood of a message is just the sum of the log probabilities of the log probabilities of its component words, we just need to look for the most likely words of the lengths of the words of the ciphered message.

If the message at some point is "fgk tp hpdt", then, if run long enough, the algorithm should just find the most likely three-letter word, the most likely two-letter word, and the most likely four-letter word. But we can look these up directly.

For example, the message we encrypted is 'here is some sample text', which has word lengths 4, 2, 4, 6, 4. What's the most likely message with these word lengths?

In [64]:
def maxprob_message(word_lens = (4, 2, 4, 6, 4), lexical_db = lexical_database):
    db_word_series = Series(lexical_db.index)
    db_word_len    = db_word_series.str.len() 
    max_prob_wordlist = []
    logp = 0.0
    for i in word_lens:
        db_words_i = list(db_word_series[db_word_len == i])
        db_max_prob_word = lexical_db[db_words_i].idxmax()
        logp += math.log(lexical_db[db_words_i].max())
        max_prob_wordlist.append(db_max_prob_word)
    return max_prob_wordlist, logp
In [65]:
maxprob_message()
Out[65]:
(['with', 'of', 'with', 'united', 'with'], -25.642396806584493)