v1.0 (2014 Fall) Rishi Sharma ***, Sahaana Suri ***, Paul Rigge ***, Kangwook Lee ***, Kannan Ramchandran ***
v1.1 (2015 Fall) Kabir Chandrasekher **, Max Kanwal **, Kangwook Lee ***, Kannan Ramchandran ***
v1.2 (2016 Spring) Kabir Chandrasekher, Tony Duan, David Marn, Ashvin Nair, Kangwook Lee, Kannan Ramchandran
The problem of RNA sequencing is to figure out how much and what type of RNA is present in a genome at a given moment in time. It has many applications including genome annotation, comprehensive identification of fusions in cancer, discovery of novel isoforms of genes, and genome sequence assembly [1][2].
For our purposes, we'll formulate the problem as follows: given a set of short reads that are sampled from a set of larger genes, how can we find the relative abundance of each gene? That is, given just the short reads, how do we know how frequently each original gene occurs? This process is depicted in Figure 1. (Aside: in the actual paper, these "genes" are actually "transcripts," but that's not relevant for us.)
It turns out we can use some methods we learned in class (MLE & EM) to solve this problem -- let's try it out.
We assume that all genes are of the same length, $\ell_t$, and all reads are of the same length, $\ell_x$. Then, we assume the following Bayesian generative model:
Given a set of reads $X = \{X_1 \ldots X_n\}$, we want to figure out what distribution $\rho = \{ \rho_1 \ldots \rho_m \}$ over the genes was most likely to give us the reads $X$. To do this, we'll need to maximize the following likelihood function:
$$ L(\rho) = \prod_{i=1}^{N}{ P(X_i|\rho) } $$To make life easier, we're first going to assume no read is ambiguous. That is -- given a read, we can immediately tell which gene it came from. (In practice, this means we would have to know the chart mapping each color to a gene, as in Figure 1.)
Your solution should take both the gene lengths and the read lengths into consideration. Denote the probability of seeing a specific gene as $\rho_{t_i}$.
Hint: how many possible starting positions exist for $X_i$?
Your answer here
Your answer here
Hint: you might just be able to make an "educated guess" from your answer above.
Your answer here.
Life gets harder when you allow for ambiguous reads. Going back to the figure above, we can imagine a case where we don't know which color a read belongs to, but we have a rough idea of the possible colors it could have been. See the modified figure below.
####First, we define a compatibility matrix $A \in \{0,1\}^{n \times m} = \{a_{i,j}\}$, where $a_{i,j}$ is $1$ if read $i$ is aligned with gene $j$, $0$ otherwise.
As a motivating example, let's assume we have $3$ genes and $5$ reads aligned as follows:
(ref: pp16-17 Pachter 2011)
####Your answer here.
You'll have to tweak your solution to the last portion by carefully considering how you can represent $P(X_i|\rho)$ given the compatibility matrix $A$.
Your answer here.
In general, it is not easy to find the exact maximum likelihood estimator of $\rho$ if ambiguous reads exist. Instead, we can rely on iterative methods that we hope will converge to the true value. One way to go about this is via expectation maximization (EM), as you have seen in class.
To recap, here is a short tutorial that does a wonderful job of explaining all that you need to know about EM [3].
For our purposes, here is an EM algorithm that can be used. You'll be implementing it shortly, so read it carefully and understand why it works.
The following figure is a visual representation of the algorithm (Ref: p17 Pachter 2011) [1].
This is a lot to digest at once, so let's only look at the first step to begin with.
There are three possible genes to analyze (red, green, blue). There are five reads (a,b,c,d,e) that align with different genes. Initially, you assume a uniform prior.
For example, the abundance of red after the first M step is estimated by
$$\rho_1 = 0.47 = \frac{0.33 + 0.5 + 1 + 0.5}{5}$$Your answer here.
Let's begin by with a toy example. We'll start by making the relevant imports...
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Now we'll define our genes.
transcripts = ['ATCTCGACGCACTGC', 'GAGTTCGAACTCTTC', 'AGAGTTCCAGTGTCA', 'AAAGCTCACTGCGGA', 'AGCGATATCAGAGTD']
M = len(transcripts)
We'll randomly generate the true distribution $\rho$.
rho = np.random.rand(M)
rho /= sum(rho)
plt.bar(np.arange(M) + 0.6, rho, width=0.8)
We'll then pick 1000 random reads of length 5 each.
def random_read( s, rho, L ):
chosen_seq = np.random.choice(s, p=rho)
start_idx = np.random.randint(0, len(chosen_seq) - L)
end_idx = start_idx + L
return chosen_seq[ start_idx:end_idx ]
N = 1000 # Number of reads
L = 5
reads = []
for i in range(N):
reads.append( random_read(transcripts, rho, L) )
print('First 10 reads...', reads[0:10])
In the code below, run the EM algorithm for 100 iterations and report your estimated distribution of $\rho$.
N_iter = 100 #number of E/M iterations
def find_all_alignments(s, read):
tmp = []
for j in range(len(s)):
if read in s[j]:
tmp.append(j)
return tmp
# A: Compatibility Matrix.
# Note: You can choose to represent this matrix in whatever form is easiest for your calculations
# The form we are pushing you towards here is just how we chose to implement the algorithm
A = []
for each_read in reads:
A.append( find_all_alignments(transcripts, each_read) )
hidden_state_prior = np.zeros([N, M])
rho_est = np.array([1/5, 1/5, 1/5, 1/5, 1/5])
for i in range(N_iter):
# Your code here
# 1. E step
# 2. M step
pass
Your estimated distribution should look similar to the real distribution.
print('(real) rho', rho)
print('(est.) rho', rho_est)
plt.bar(np.arange(M) + 0.6, rho, color='blue', width=0.4, label='Real')
plt.bar(np.arange(M) + 1, rho_est, color='green', width=0.4, label='Estimated')
plt.legend()
plt.show()
Let's try running the same algorithm on much more data. Here we have a dataset of 150 transcripts of length varying from 4 to 19. We've gathered 10000 reads from exactly 5 of them -- can you recover which 5 transcripts we gathered the reads from?
M, N = 150, 10000
transcripts = np.load('transcripts.npy')
reads = np.load('reads.npy')
# Your answer here.
Congratulations -- you've just finished your last lab for EE 126!
Take a minute to appreciate what you've accomplished throughout the semester. This class wasn't easy, but hopefully you've learned a lot! Come see us if you'd like some help figuring out where to go from here.
[1] L. Pachter. Models for transcript quantification from RNA-Seq. available at arXiv:1104.3889 [q-bio.GN], 2011.
[2] A. Roberts and L. Pachter, RNA-Seq and find: entering the RNA deep field, Genome Medicine, 3 (2011), 74.
[3] C. Do and S. Batzoglou, What is the expectation maximization algorithm? Nature Biotechnology, 26 (2008), 8.