v1.0 (2014 Fall) Rishi Sharma ***, Sahaana Suri ***, Paul Rigge ***, Kangwook Lee ***, Kannan Ramchandran ***

v1.1 (2015 Fall) Kabir Chandrasekher **, Max Kanwal **, Kangwook Lee ***, Kannan Ramchandran ***

v1.2 (2016 Spring) Kabir Chandrasekher, Tony Duan, David Marn, Ashvin Nair, Kangwook Lee, Kannan Ramchandran

The problem of RNA sequencing is to figure out how much and what type of RNA is present in a genome at a given moment in time. It has many applications including genome annotation, comprehensive identification of fusions in cancer, discovery of novel isoforms of genes, and genome sequence assembly [1][2].

For our purposes, we'll formulate the problem as follows: given a set of short reads that are sampled from a set of larger genes, how can we find the relative abundance of each gene? That is, given just the short reads, how do we know how frequently each original gene occurs? This process is depicted in Figure 1. (Aside: in the actual paper, these "genes" are actually "transcripts," but that's not relevant for us.)

It turns out we can use some methods we learned in class (MLE & EM) to solve this problem -- let's try it out.

We assume that all genes are of the same length, $\ell_t$, and all reads are of the same length, $\ell_x$. Then, we assume the following Bayesian generative model:

- A read comes from a randomly chosen gene $t_i$
- A read's starting point is randomly chosen among all possible $\ell_t$ starting positions in that gene

Given a set of reads $X = \{X_1 \ldots X_n\}$, we want to figure out what distribution $\rho = \{ \rho_1 \ldots \rho_m \}$ over the genes was most likely to give us the reads $X$. To do this, we'll need to maximize the following likelihood function:

$$ L(\rho) = \prod_{i=1}^{N}{ P(X_i|\rho) } $$To make life easier, we're first going to assume no read is ambiguous. That is -- given a read, we can immediately tell which gene it came from. (In practice, this means we would have to know the chart mapping each color to a gene, as in Figure 1.)

Your solution should take both the gene lengths and the read lengths into consideration. Denote the probability of seeing a specific gene as $\rho_{t_i}$.

Hint: how many possible starting positions exist for $X_i$?

*Your answer here*

*Your answer here*

Hint: you might just be able to make an "educated guess" from your answer above.

*Your answer here.*

Life gets harder when you allow for ambiguous reads. Going back to the figure above, we can imagine a case where we don't know which color a read belongs to, but we have a rough idea of the possible colors it could have been. See the modified figure below.

In this portion, we'll consider a general problem where a read is aligned with possibly more than one gene.

First, we define a compatibility matrix $A \in \{0,1\}^{n \times m} = \{a_{i,j}\}$, where $a_{i,j}$ is $1$ if read $i$ is aligned with gene $j$, $0$ otherwise.

*Your answer here.*

You'll have to tweak your solution to the last portion by carefully considering how you can represent $P(X_i|\rho)$ given the compatibility matrix $A$.

*Your answer here.*

In general, it is not easy to find the exact maximum likelihood estimator of $\rho$ if ambiguous reads exist. Instead, we can rely on iterative methods that we hope will converge to the true value. One way to go about this is via expectation maximization (EM), as you have seen in class.

To recap, here is a short tutorial that does a wonderful job of explaining all that you need to know about EM [3].

For our purposes, here is an EM algorithm that can be used. You'll be implementing it shortly, so read it carefully and understand why it works.

- Initialize $\rho = (\frac{1}{M}, \frac{1}{M}, \ldots, \frac{1}{M})$, all genes are equally probable.
- Find the compatibility matrix $A$.
- Repeat the following until $\rho$ converges:

a. Each reading $i$ corresponds to the $i$th row of $A$. Call this $\mathcal{I}_i$.

b. Put the values of $\rho$ in $\mathcal{I}_i$ where $\mathcal{I}_i$ is not zero. Then normalize the vector and replace the $i$th row of $A$ with the normalized vector.

c. Replace $\rho$ such that for each gene $j$, $\rho_j = \frac{1}{N}\sum_{i=1}^{N}{A_{i,j}}$

The following figure is a visual representation of the algorithm (Ref: p17 Pachter 2011) [1].

This is a lot to digest at once, so let's only look at the first step to begin with.

There are three possible genes to analyze (red, green, blue). There are five reads (a,b,c,d,e) that align with different genes. Initially, you assume a uniform prior.

- E step. Reads are proportionately assigned to each of the genes they could have come from, according to the current distribution $\rho$.
- M step. The distribution $\rho$ is recalculated proportionally from the assigned read counts from the E step.

For example, the abundance of red after the first M step is estimated by

$$\rho_1 = 0.47 = \frac{0.33 + 0.5 + 1 + 0.5}{5}$$*Your answer here.*

Let's begin by with a toy example. We'll start by making the relevant imports...

In [ ]:

```
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

Now we'll define our genes.

In [ ]:

```
transcripts = ['ATCTCGACGCACTGC', 'GAGTTCGAACTCTTC', 'AGAGTTCCAGTGTCA', 'AAAGCTCACTGCGGA', 'AGCGATATCAGAGTD']
M = len(transcripts)
```

We'll randomly generate the true distribution $\rho$.

In [ ]:

```
rho = np.random.rand(M)
rho /= sum(rho)
plt.bar(np.arange(M) + 0.6, rho, width=0.8)
```

We'll then pick 1000 random reads of length 5 each.

In [ ]:

```
def random_read( s, rho, L ):
chosen_seq = np.random.choice(s, p=rho)
start_idx = np.random.randint(0, len(chosen_seq) - L)
end_idx = start_idx + L
return chosen_seq[ start_idx:end_idx ]
N = 1000 # Number of reads
L = 5
reads = []
for i in range(N):
reads.append( random_read(transcripts, rho, L) )
print('First 10 reads...', reads[0:10])
```

In the code below, run the EM algorithm for 100 iterations and report your estimated distribution of $\rho$.

In [ ]:

```
N_iter = 100 #number of E/M iterations
def find_all_alignments(s, read):
tmp = []
for j in range(len(s)):
if read in s[j]:
tmp.append(j)
return tmp
# A: Compatibility Matrix.
# Note: You can choose to represent this matrix in whatever form is easiest for your calculations
# The form we are pushing you towards here is just how we chose to implement the algorithm
A = []
for each_read in reads:
A.append( find_all_alignments(transcripts, each_read) )
hidden_state_prior = np.zeros([N, M])
rho_est = np.array([1/5, 1/5, 1/5, 1/5, 1/5])
```

In [ ]:

```
for i in range(N_iter):
# Your code here
# 1. E step
# 2. M step
pass
```

Your estimated distribution should look similar to the real distribution.

In [ ]:

```
print('(real) rho', rho)
print('(est.) rho', rho_est)
```

In [ ]:

```
plt.bar(np.arange(M) + 0.6, rho, color='blue', width=0.4, label='Real')
plt.bar(np.arange(M) + 1, rho_est, color='green', width=0.4, label='Estimated')
plt.legend()
plt.show()
```

Let's try running the same algorithm on much more data. Here we have a dataset of 150 transcripts of length varying from 4 to 19. We've gathered 10000 reads from exactly 5 of them -- can you recover which 5 transcripts we gathered the reads from?

In [ ]:

```
M, N = 150, 10000
transcripts = np.load('transcripts.npy')
reads = np.load('reads.npy')
```

In [ ]:

```
# Your answer here.
```

Congratulations -- you've just finished your last lab for EE 126!

Take a minute to appreciate what you've accomplished throughout the semester. This class wasn't easy, but hopefully you've learned a lot! Come see us if you'd like some help figuring out where to go from here.

[1] L. Pachter. Models for transcript quantification from RNA-Seq. available at arXiv:1104.3889 [q-bio.GN], 2011.

[2] A. Roberts and L. Pachter, RNA-Seq and find: entering the RNA deep field, Genome Medicine, 3 (2011), 74.

[3] C. Do and S. Batzoglou, What is the expectation maximization algorithm? Nature Biotechnology, 26 (2008), 8.