Mickey Atwal, CSHL

$\mathbf{1}$. Estimate how many mutations in a 5ml culture of Escherichia coli that originally grew from a single bacterium.

$\mathbf{2}$. High-throughput screening assays, in the field of drug discovery, can typically test a library of millions of compounds to identify a few that are active. The challenge is to figure out how many assays do we need to perform before we can reliably identify a successful compound. Let’s assume that the success rate in these screens is one in ten thousand, $10^{−4}$ .

- (a) What is the probability of observing at least one active compound out of two assays?

- (b) What is the probability of observing at least one active compound out of N assays?

- (c) How large a library do we need to be 99% sure that we will find at least one active molecule?

- (d) Can you see a connection to your answer in part (b) to the statistical significance problem during multiple hypothesis testing?

$\mathbf{3}$ . A neuron generates spikes at an average rate of $r$ spikes per second (Hertz). We can assume a homogeneous Poisson process to model the firing of spikes.

- (a) What is the average time between spikes?

- (b) What is the probability distribution for the time, $T$ , between spikes?

- (c) The clock strikes midnight between two spikes. What is the mean time from the clock striking to the next spike?

- (d) How do you reconcile the results of (a) and (c) ?

$\mathbf{4}$ . Let’s simulate a Poisson process with a constant rate $m$ in Python.

- (a) Consider a window of time $T$, which you split into very small bins of duration $dt$. In each bin use np.random.rand to generate a random number that you can compare to some threshold $k$; if the random number is above, put $1$ in the time bin, else put a $0$. How is the threshold $k$ related to the rate of events? Use $T = 10^3$s and rate $m = 10s^{-1}$. Use a small enough time window $dt$ that the probability of having $2$ events per bin is negligible.

- (b) Check that the generated process obeys Poisson statistics. Take successive windows of duration $τ$ from your simulated process (of total duration $T$) and count the number of events, $n$, in each window. What is the average number, $\langle n \rangle$ , and the variance, $\sigma_n^2$? What do you expect and do the expectations match the data? Plot a distribution of $P(n)$, obtained from your simulation, and compare it to the Poisson distribution that you expect, on the same plot. If you make $T$ very long and $dt$ small enough, the agreement should be almost perfect.

- (c) Measure the inter-event interval distribution: In your simulated data, compare the distances between events and plot them as a normalized probability distribution. Compare to the theoretical expectation on the same plot. Make the plots also in the log-scale to see the behavior of distributions in the tail.

$\mathbf{5}$. Hemophilia is a disease associated with a recessive gene on the X chromosome. Since human males are XY, a male inheriting the mutant gene will always be affected. Human females, XX, with only one bad copy of the gene are simply carriers and are not affected, whereas females with two bad copies will be affected. Consider a woman with an affected brother. Her parents, her husband, and herself are all unaffected.

- (a) What is the probability that this woman is a carrier?

- (b) She later has two sons, neither of whom is affected. With this new information, what is the posterior probability that she is a carrier? (Assume no infidelity in the family and sons are not identical twins).

$\mathbf{6}$. A published study reported the micrarray expressions of a select number of genes in two kinds of tumors: those with BRCA1 mutations and those with BRCA2 mutations. The goal was to detect genes that showed differential expression across the two conditions. The data consists of the expression ratios of $3226$ genes on $n_1 = 7$ BRCA1 arrays and $n_2 = 8$ BRCA2 arrays.

- (a) Download the entire data from http://research.nhgri.nih.gov/microarray/NEJM_Supplement/Images/nejm_brca_release.txt

- (b) Convert the expression ratios for each gene i into $\log_2$ values. In this representation, going down by a factor of $1/2$ has the same magnitude as going up by a factor of $2$.

- (c) Calculate the mean $\langle x \rangle$ and sample variance $s^2$ for each gene in each tumor type.

- (d) The null hypothesis is that there is no differential expression and so we calculate the two-sample t-statistic. For example the t-statistic for gene i is $$ t_i = \displaystyle \frac{ \langle x_{i,{\rm BRCA1}} \rangle − \langle x_{i,{\rm BRCA2}} \rangle}{\sqrt{ \displaystyle \frac{ s_{i,{\rm BRCA1}}^2}{n_1} + \frac{ s_{i,{\rm BRCA2}}^2}{n_2}}} $$ Calculate this for each gene.

- (e) Normally, if we had a large number of samples or if the data looked Gaussian for each gene, we would employ a t-test and look up a table containing values of the so-called Student’s t-distribution to figure out the p-value for each gene. However, the sample sizes are way too small to justify using the Student’s t-distribution. Instead we will have to resort calculating the p-values using a Monte Carlo permutation procedure. For each gene calculate a randomized t-statistic 1000 times by randomly shuffling (permuting) the labels on the array, i.e. randomly assign the $n = 15$ arrays to $n_1 = 7$ BRCA1 arrays and $n_2 = 8$ BRCA2 arrays. The null hypothesis is that there is no differential expression and these 1000 randomized t-statistic values will form the null hypothesis distribution. Calculate the p-value for each gene by using the permuted distribution of t-statistics and comparing these values with your results from part (d).

- (f) Plot a histogram of all the p-values.

- (g) Estimate approximately how many genes are differentially expressed.