Motif Discovery with EM

Implement the EM algorithm for motif discovery as discussed in Section 2 of the paper The EM Algorithm and the Rise of Computational Biology. Code for setting up the problem, simulating sequences (data), and comparing results is given. The notation is somewhat cumbersome, as is usually the case with EM. Also, not all steps are explicit in the paper, but for the most part the derivation is given.

Note: The variable $Z$ in the code below is related to $\Gamma$ in the paper.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as st
from collections import Counter
print("Modules Imported!")
Modules Imported!
In [2]:
# Setting up the constants
w = 6 # length of motif
N = 20 # number of sequences
len_lu = [w,w+10] # lower and upper bound on the lengths of sequences
m = 4 # alphabet size
In [3]:
# generating theta
def gen_theta(w,m):
    theta = [[0]*m]*(w+1)
    for i in range(w+1):
        th = np.concatenate(([0],st.uniform.rvs(0,1,m-1),[1]))
        th = np.sort(th)
        theta[i] = np.diff(th)
    return theta
In [4]:
# generating RVs and sequences

# generating one sequence
def genSeq(n,p): 
    # Generate an iid random sequences of length n over 
    # the alphabet 0,1,...,|p|-1 where the probability of i is p[i]
    y = [0] * n
    for i in range(n):
        r = st.uniform.rvs()
        y[i] = np.sum(r>np.cumsum(p))
    return y

# generate N sequences containing motifs at random positions
def genSeqMotif(theta,N,len_lu):
    w = len(theta) - 1
    Y = [[]]*N # sequences
    Z = [0]*N # start positions of motifs
    L = [0]*N # lengths of sequences
    for i in range(N):
        L[i] = st.randint.rvs(len_lu[0],len_lu[1]+1)
        Z[i] = st.randint.rvs(0,L[i]-w+1)
        y_nonmotif_pre = genSeq(Z[i],theta[0])
        y_motif = [genSeq(1,theta[i+1])[0] for i in range(w)]     
        y_nonmotif_post = genSeq(L[i]-Z[i]-w, theta[0])
        Y[i] = np.concatenate((y_nonmotif_pre,y_motif,y_nonmotif_post)).astype(int)
    return (Y,Z)
In [5]:
# estimators

# Below is the estimator with complete data. You can use some of this code for the incomplete data. However while here we 
# know the value of z as z=Z[i], for EM we need to do the computation for all possible z and take expectation. Understanding 
# completeDataML is a good first step for writing the EM code
def completeDataML(Y,Z,w,m):
    thetaHatML = [[0]*m] * (w+1)
    hn = np.array([[0]*m] * (w+1)) # composition for one sequence
    h = np.array([[0]*m] * (w+1)) # compostion for all sequences
    N = len(Y)
    for n in range(N):
        z = Z[n]
        y = Y[n]
        y_nonmotif = np.concatenate((y[0:z],y[z+w:])).astype(int)
        y_motif = y[z:z+w]
        hn[0] = [Counter(y_nonmotif)[j] for j in range(m)]
        for i in range(1,w+1):
            hn[i] = [Counter([y_motif[i-1]])[j] for j in range(m)]
        h += hn
    thetaML = [h[i]/sum(h[i]) for i in range(w+1)]
    return thetaML
    
    


# EM motif discovery, stating with theta_start and doing itr_max iterations
def motifDiscoveryEM(Y,theta_start,itr_max):
    theta_current = theta_start
    w = len(theta_current) - 1
    m = len(theta_current[0])
    for itr in range(10):
        # solution
        # This code segment could get long. It may be helpful to package some of it in a function
        ...
    return(theta_current)
In [6]:
theta_start = [[1/m]*m]*(w+1)
np.set_printoptions(precision=2,suppress=True) # printing prettier matrices

# with random theta, do not change the seed in your final submission
np.random.seed(seed=31)
theta = gen_theta(w,m)
(Y,Z) = genSeqMotif(theta,N,len_lu)
thetaML = completeDataML(Y,Z,w,m)
thetaEM = motifDiscoveryEM(Y,theta_start,10)
print('\n--\nTrue values:\n',np.matrix(theta))
print('\nML estimates:\n',np.matrix(thetaML))
# print('\nEM estimates:\n',np.matrix(thetaEM))

# with theta given as below
theta = [[0.25,0.25,0.25,0.25],
         [1,0,0,0],
         [0.2,0,0,0.8],
         [0,.5,0,.5],
         [.3,0,.7,0],
         [0.2,0.3,0,0.5],
         [0.1,0,0,0.9]]
(Y,Z) = genSeqMotif(theta,N,len_lu)
thetaML = completeDataML(Y,Z,w,m)
thetaEM = motifDiscoveryEM(Y,theta_start,10)
print('\n--\nTrue values:\n',np.matrix(theta))
print('\nML estimates:\n',np.matrix(thetaML))
# print('\nEM estimates:\n',np.matrix(thetaEM))
--
True values:
 [[ 0.29  0.48  0.19  0.04]
 [ 0.14  0.07  0.78  0.01]
 [ 0.07  0.01  0.83  0.09]
 [ 0.09  0.29  0.16  0.46]
 [ 0.04  0.39  0.24  0.33]
 [ 0.06  0.13  0.25  0.55]
 [ 0.28  0.01  0.65  0.06]]

ML estimates:
 [[ 0.26  0.45  0.26  0.03]
 [ 0.15  0.05  0.75  0.05]
 [ 0.1   0.    0.85  0.05]
 [ 0.3   0.25  0.1   0.35]
 [ 0.05  0.15  0.35  0.45]
 [ 0.05  0.1   0.4   0.45]
 [ 0.25  0.    0.75  0.  ]]

--
True values:
 [[ 0.25  0.25  0.25  0.25]
 [ 1.    0.    0.    0.  ]
 [ 0.2   0.    0.    0.8 ]
 [ 0.    0.5   0.    0.5 ]
 [ 0.3   0.    0.7   0.  ]
 [ 0.2   0.3   0.    0.5 ]
 [ 0.1   0.    0.    0.9 ]]

ML estimates:
 [[ 0.26  0.25  0.26  0.22]
 [ 1.    0.    0.    0.  ]
 [ 0.05  0.    0.    0.95]
 [ 0.    0.55  0.    0.45]
 [ 0.2   0.    0.8   0.  ]
 [ 0.2   0.4   0.    0.4 ]
 [ 0.1   0.    0.    0.9 ]]
In [ ]: