In [1]:

%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [10]:

import numpy as np
from pandas import *
import matplotlib.pyplot as plt
from pandas import *
import scipy.optimize as opt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import os
import random

Ridge regression by explicit SSE minimization.¶

In [11]:

heights_weights = read_csv('data/01_heights_weights_genders.csv')

The OLS estimate¶

In [24]:

ols_fit = ols('Weight ~ Height', df = heights_weights).fit()
ols_sse = ols_fit.mse_resid * (ols_fit.df_resid) 

print np.round(ols_fit.params, 3)
print 'MSE   %i' % round(ols_sse)

Intercept   -350.737
Height         7.717
MSE   1492935

Set up the ridge regression.¶

We'll explicitly set up the ridge SSE function to minimize. We'll also provide explicit functions for the gradient and hessian of the SSE, since they're pretty straightforward.

We'll try a couple of different optimization algorithms, some use the gradient and hessians, other (e.g., Nelder-Mead) don't use either.

In [25]:

y = heights_weights['Weight'].values
Xmat = sm.add_constant(heights_weights['Height'], prepend = True)

def ridge_error(params, y, Xmat, lam):
    '''
    Compute SSE of the ridge regression.
    This is the normal regression SSE, plus the
    L2 cost of the parameters.
    '''
    predicted = np.dot(Xmat, params)
    sse = ((y - predicted) ** 2).sum()
    sse += lam * (params ** 2).sum()
    
    return sse

def ridge_grad(params, y, Xmat, lam):
    '''
    The gradiant of the ridge regression SSE.
    '''
    grad = np.dot(np.dot(Xmat.T, Xmat), params) - np.dot(Xmat.T, y)
    grad += lam * params
    grad *= 2
    return grad

def ridge_hess(params, y, Xmat, lam):
    ''' 
    The hessian of the ridge regression SSE.
    '''
    return np.dot(Xmat.T, Xmat) + np.eye(2) * lam

Minimize the ridge SSE using different algorithms¶

Set the starting parameters (constant, coefficient).

In [26]:

params0 = np.array([0.0, 0.0])

Nelder-Mead Simplex: no gradient required. This is the default of optim in R.

In [28]:

ridge_fit = opt.fmin(ridge_error, params0, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 117
         Function evaluations: 221
Solution: a = -340.565, b =    7.565

Newton conjugate gradient: requires gradient, hessian optional.

First without the hessian.

In [34]:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 4
         Gradient evaluations: 11
         Hessian evaluations: 0
Solution: a = -340.565, b =    7.565

Now, with the hessian. This shaves a little bit of time off.

In [35]:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, 
                         fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 7
         Gradient evaluations: 3
         Hessian evaluations: 3
Solution: a = -340.565, b =    7.565

BFGS: Uses the gradient, no hessian

In [36]:

ridge_fit = opt.fmin_bfgs(ridge_error, params0, fprime = ridge_grad, 
                          args = (y, Xmat, 1))
print 'Solution: ', ridge_fit

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 15
         Function evaluations: 290
         Gradient evaluations: 243
Solution:  [-340.565481     7.5645375]

Deciphering text with the Metropolis-Hastings Algorithm¶

The lexical database is a collection of words and their frequency on Wikipedia.

In [37]:

lexical_database = read_csv('data/lexical_database.csv', index_col = 0, 
                            header = None, skiprows = 1, squeeze = True)
lexical_database.index.name = 'word'

In [38]:

letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 
           'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 
           'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 
           'y', 'z']

The "Ceasar Cipher" is a cipher that encrypts text by translating each letter in the text to the next letter in the alphabet (with Z going back to A). We'll represent the cipher, and its key, as mappings in a dictionary.

In [39]:

ceasar_cipher = {i: j for i, j in zip(letters, letters[1:] + letters[:1])}
inverse_ceasar_cipher = {ceasar_cipher[k]: k for k in ceasar_cipher}

Functions that cipher (encrypt) and decipher a test string. decipher_text doesn't need an explicit deciphering key, it will invert the cipher dictionary used by cipher_text.

In [40]:

def cipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)
    
    ciphered = ''.join([cipher_dict.get(x) or x for x in strlist])
    return ciphered

def decipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)
    
    # Invert the cipher dictionary (k, v) -> (v, k)
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}

    deciphered = ''.join([decipher_dict.get(x) or x for x in strlist]) 
    return deciphered

The following functions are used to generate proposal ciphers for the Metropolis algorithm. The idea is to randomly generate ciphers and see what text they result in. If the text resulting from a proposed cipher is more likely (according to the lexical database) than the current cipher, we accept the proposal. If it's not, we accept it wil a probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a method that chooses a key (letter) at random from the current cipher, and swaps its with some other letter. For example, if we start with the Ceasar Cipher, our proposal might randomly choose to re-map A to N (instead of B). The proposal would then be the same a the Ceasar Cipher, but with A $\rightarrow$ N and M $\rightarrow$ B (since A originally mapped to B and M originally mapped to N). This proposal-generating mechanism is encapsulated in propose_modified_cipher_from_cipher.

This is inefficient in a few ways. First, the letter chosen to modify in the cipher may not even appear in the text, so the proposed cipher won't modify the text at all and you end up wasting cycles generating a lot of useless proposals. Second, we may end up picking a letter that occurs in a highly likely word, which will increase the probability of generating an inferior proposal.

We'll suggest another mechanism that, instead of selecting a letter from the current cipher to re-map, will choose a letter amongst the non-words in the current deciphered text. For example, if our current deciphered text is "hello wqrld", we will only select amongst {w, q, r, l, d} to modify at random. The minimizes the chances that a modified cipher will turn real words into gibberish and produce less likely text. The function propose_modified_cipher_from_text performs this proposal mechanism.

One way to think of this is that it's analogous to tuning the variance of the proposal distribution in the typical Metropolis algorithm. If the variance is too low, our algorithm won't efficiently explore the target distribution. If it's too high, we'll end up generating lots of lousy proposals. Our cipher proposal rules can suffer from similar problems.

In [60]:

def generate_random_cipher():
    '''
    Randomly generate a cipher dictionary (a one-to-one letter -> letter map).
    Used to generate the starting cipher of the algorithm.
    '''
    cipher = []

    input  = letters
    output = letters[:]
    random.shuffle(output)
    
    cipher_dict = {k: v for (k, v) in zip(input, output)}
    
    return cipher_dict

def modify_cipher(cipher_dict, input, new_output):
    '''
     Swap a single key in a cipher dictionary.

     Old: a -> b, ..., m -> n, ...
     New: a -> n, ..., m -> b, ...
     '''
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
    old_output = cipher_dict[input]
    
    new_cipher = cipher_dict.copy()
    new_cipher[input] = new_output
    new_cipher[decipher_dict[new_output]] = old_output
    
    return new_cipher
    
def propose_modified_cipher_from_cipher(text, cipher_dict, 
                                        lexical_db = lexical_database):
    '''
    Generates a new cipher by choosing and swapping a key in the
    current cipher.
    '''
    _          = text # Unused
    input      = random.sample(cipher_dict.keys(), 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

def propose_modified_cipher_from_text(text, cipher_dict, 
                                      lexical_db = lexical_database):
    
    '''
    Generates a new cipher by choosing a swapping a key in the current
    cipher, but only chooses keys that are letters that appear in the
    gibberish words in the current text.
    '''
    deciphered = decipher_text(text, cipher_dict).split()
    letters_to_sample = ''.join([t for t in deciphered 
                                 if lexical_db.get(t) is None])
    letters_to_sample = letters_to_sample or ''.join(set(deciphered))
    
    input      = random.sample(letters_to_sample, 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

The following functions compute the likelihood of text generated by proposed ciphers. To find a word's probability, we just look it up in the lexical database. To compute the log probability of some text, we just sum up the log probabilities of the words in it.

In [43]:

def one_gram_prob(one_gram, lexical_db = lexical_database):
    return lexical_db.get(one_gram) or np.finfo(float).eps 

def text_logp(text, cipher_dict, lexical_db = lexical_database):
    deciphered = decipher_text(text, cipher_dict).split()
    logp = np.array([math.log(one_gram_prob(w)) for w in deciphered]).sum()
    return logp

Now the Metropolis algorithm step. We generate a proposal cipher, and compute the log probability of the its decoded text. If this is greater than the log probability of the decoded text of the current cipher, we accept the proposal for sure. Otherwise, we only accept it with a probability given by its likelihood relative to that of the current cipher, so the less likely its decoded text is, the less likely we accept it.

In [44]:

def metropolis_step(text, cipher_dict, proposal_rule, lexical_db = lexical_database):
    proposed_cipher = proposal_rule(text, cipher_dict)
    lp1 = text_logp(text, cipher_dict)
    lp2 = text_logp(text, proposed_cipher)
    
    if lp2 > lp1:
        return proposed_cipher
    else:
        a = math.exp(lp2 - lp1)
        x = random.random()
        if x < a:
            return proposed_cipher
        else:
            return cipher_dict

Create a message and decode it with the Ceaser Cipher. Then set up the Metropolis algorithm which generates proposal ciphers (we can provide different proposal mechanisms), and calls the Metropolis step defined above. This runs for a fixed number of iterations.

In [45]:

message = 'here is some sample text'
ciphered_text = cipher_text(message, ceasar_cipher)
niter = 250000

def metropolis_decipher(ciphered_text, proposal_rule, niter, seed = 4):
    random.seed(seed)
    cipher = generate_random_cipher()

    deciphered_text_list = []
    logp_list = []

    for i in xrange(niter):
        logp = text_logp(ciphered_text, cipher)
        current_deciphered_text = decipher_text(ciphered_text, cipher)

        deciphered_text_list.append(current_deciphered_text)
        logp_list.append(logp)
    
        cipher = metropolis_step(ciphered_text, cipher, proposal_rule)
    
    results = DataFrame({'deciphered_text': deciphered_text_list, 'logp': logp_list})
    results.index = np.arange(1, niter + 1)
    return results

First, let's try the author's proposal rule.

In [46]:

results0 = metropolis_decipher(ciphered_text, 
                               propose_modified_cipher_from_cipher, niter)

Looking at the deciphered text of every 10,000th entry, we find we're not even close after 250,000 iteration, and we seem to be locked in a pocket of the target distribution that's the proposal rule has trouble escaping from.

In [47]:

results0.ix[10000::10000]

Out[47]:

	deciphered_text	logp
10000	kudu of feru fyrvbu hush	-86.585205
20000	wudu of feru fbrkxu hush	-87.124919
30000	kudu of feru fnrbau hush	-86.585205
40000	wudu of feru fmrjiu hush	-87.124919
50000	kudu of feru fyrnbu hush	-86.585205
60000	kudu of feru fxrnvu hush	-86.585205
70000	pudu of feru fvrnlu hush	-87.561022
80000	kudu of feru fvrxgu hush	-86.585205
90000	kudu of feru fbrvtu hush	-86.585205
100000	kudu of feru fjrnlu hush	-86.585205
110000	kudu of feru fprbju hush	-86.585205
120000	kudu of feru fnrjcu hush	-86.585205
130000	kudu of feru flrvpu hush	-86.585205
140000	puku of feru flrvxu hush	-88.028362
150000	kudu of feru fxrviu hush	-86.585205
160000	pulu of feru ftrdzu hush	-88.323162
170000	wuzu of feru flrxdu hush	-89.575925
180000	kudu of feru firamu hush	-86.585205
190000	wudu of feru fyrzqu hush	-87.124919
200000	wudu of feru fnraxu hush	-87.124919
210000	puku of feru fjrnyu hush	-88.028362
220000	puku of feru firyau hush	-88.028362
230000	pudu of feru fkrcvu hush	-87.561022
240000	kudu of feru ftrwzu hush	-86.585205
250000	kudu of feru fprxzu hush	-86.585205

Now, let's try the alternative proposal rule, which only chooses letters from gibberish words when it modifies the current cipher to propose a new one.

In [48]:

results1 = metropolis_decipher(ciphered_text, 
                               propose_modified_cipher_from_text, niter)

The algorithm doesn't find the actual message, but it actually finds a more likely message (according the the lexical database) within 20,000 iterations.

In [49]:

results1.ix[10000::10000]

Out[49]:

	deciphered_text	logp
10000	were mi isle izlkde text	-68.946850
20000	were as some simple text	-35.784429
30000	were as some simple text	-35.784429
40000	were as some simple text	-35.784429
50000	were as some simple text	-35.784429
60000	were as some simple text	-35.784429
70000	were us some simple text	-38.176725
80000	were as some simple text	-35.784429
90000	were as some simple text	-35.784429
100000	were as some simple text	-35.784429
110000	were as some simple text	-35.784429
120000	were as some simple text	-35.784429
130000	were as some simple text	-35.784429
140000	were as some simple text	-35.784429
150000	were us some simple text	-38.176725
160000	were as some simple text	-35.784429
170000	were is some sample text	-37.012894
180000	were as some simple text	-35.784429
190000	were as some simple text	-35.784429
200000	were as some simple text	-35.784429
210000	were as some simple text	-35.784429
220000	were as some simple text	-35.784429
230000	were as some simple text	-35.784429
240000	were as some simple text	-35.784429
250000	were is some sample text	-37.012894

Let's take a look at the likelihood paths of the algorithm with the two proposal rules. We can compute the probability of the original message, which is somewhat less than the message the "propose from text" rule converges to.

In [50]:

plt.figure(figsize = (10, 7))
results0.logp.plot(style = 'r-', label = 'propose from cipher')
results1.logp.plot(style = 'k-', label = 'propose from text')
plt.xlabel('Iteration')
plt.ylabel('Log Probability')

plt.axhline(text_logp(ciphered_text, ceasar_cipher), label = 'log prob of true text')
plt.legend(loc = 'best')

Out[50]:

<matplotlib.legend.Legend at 0x10df05f90>

The Metropolis algorithm is kind of pointless for this application. It's really just jumping around looking for the most likely phrase. But since the likelihood of a message is just the sum of the log probabilities of the log probabilities of its component words, we just need to look for the most likely words of the lengths of the words of the ciphered message.

If the message at some point is "fgk tp hpdt", then, if run long enough, the algorithm should just find the most likely three-letter word, the most likely two-letter word, and the most likely four-letter word. But we can look these up directly.

For example, the message we encrypted is 'here is some sample text', which has word lengths 4, 2, 4, 6, 4. What's the most likely message with these word lengths?

In [64]:

def maxprob_message(word_lens = (4, 2, 4, 6, 4), lexical_db = lexical_database):
    db_word_series = Series(lexical_db.index)
    db_word_len    = db_word_series.str.len() 
    max_prob_wordlist = []
    logp = 0.0
    for i in word_lens:
        db_words_i = list(db_word_series[db_word_len == i])
        db_max_prob_word = lexical_db[db_words_i].idxmax()
        logp += math.log(lexical_db[db_words_i].max())
        max_prob_wordlist.append(db_max_prob_word)
    return max_prob_wordlist, logp

In [65]:

maxprob_message()

Out[65]:

(['with', 'of', 'with', 'united', 'with'], -25.642396806584493)