*Key ideas:* Gene expression data, multiple hypothesis testing, false discovery rate

This notebook demonstrates how to do a simple differential expression analysis on a gene expression dataset. We use false discovery rates to provide interpretable results when conducting an analysis that involves large scale multiple hypothesis testing.

The data set we analyze here contains measurements of the expression levels of 22,283 genes in peripheral blood mononuclear cells (PBMCs). Data for 127 subjects are included, 26 of whom have ulcerative colitis, and 59 of whom have Crohn's disease (the remaining subjects are healthy and will not be considered here). The goal is to identify genes that have different mean expression levels in the two disease groups.

The raw data are available here as accession number GDS1615 from the NCBI's GEO (Gene Expression Omnibus) site.

Here are the import statements for the modules that we will use.

In [1]:

```
import gzip
import numpy as np
import urllib2
import gzip
from StringIO import StringIO
from statsmodels.stats.multitest import multipletests
from scipy.stats.distributions import norm
import matplotlib.pyplot as plt
```

We cannot decompress data on the fly while reading from a url, so instead we first download the compressed data into a string, wrap the string in a stringio object, and wrap that in a gzipfile object that can do the decompression.

In [2]:

```
url = urllib2.urlopen("ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS1nnn/GDS1615/soft/GDS1615_full.soft.gz")
zdata = url.read()
sio = StringIO(zdata)
fid = gzip.GzipFile(fileobj=sio)
```

Alternatively, the data can be downloaded and read this way:

fid = gzip.open("GDS1615_full.soft.gz")

Next we read the SOFT format data file into the following data structures:

- GID : A list of gene identifiers of length d
- SID : A list of sample identifiers of length n
- STP : A list of sample desriptions of length d
- X : A dxn array of gene expression values

We begin by reading the header section of the data file, and placing the information we need into a dictionary:

In [3]:

```
SIF = {}
for line in fid:
if line.startswith("!dataset_table_begin"):
break
elif line.startswith("!subset_description"):
subset_description = line.split("=")[1].strip()
elif line.startswith("!subset_sample_id"):
subset_ids = line.split("=")[1].split(",")
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
```

Now we create arrays containing some information we will need while processing the data.

In [4]:

```
# Next line is the column headers (sample id's)
SID = fid.next().split("\t")
# The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
# Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
# Get a list of sample labels
STP = [SIF[k] for k in SID]
```

Next we read the gene expression data row by row, and place the gene id's into a separate array.

In [5]:

```
# Read the gene expression data as a list of lists, also get the gene identifiers
GID,X = [],[]
for line in fid:
# This is what signals the end of the gene expression data
# section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
# Extract the values that correspond to gene expression measures
# and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(np.asarray(x))
GID.append(V[0] + ";" + V[1])
# Convert the Python list of lists to a Numpy array.
X = np.asarray(X)
```

We are comparing two groups, and will ignore the remaining (healthy) samples. Here we get the column indices of the samples in each of the two groups being compared.

In [6]:

```
# The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
# The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
```

Gene expression data is usually skewed, so we analyze it on the log scale.

In [7]:

```
XL = np.log(X) / np.log(2)
```

Now that we have the data, we can calculate the mean and variance for each gene within each of the two groups.

In [8]:

```
MUC = XL[:,UC].mean(1) ## Mean of ulcerative colitis samples
MCD = XL[:,CD].mean(1) ## Mean of Crohn's disease samples
VUC = XL[:,UC].var(1) ## Variance of ulcerative colitis samples
VCD = XL[:,CD].var(1) ## Variance of Crohn's disease samples
nUC = len(UC) ## Number of ulcerative colitis samples
nCD = len(CD) ## Number of Crohn's disease samples
```

The Z-scores summarize the evidence for differential expression.

In [9]:

```
zscores = (MUC - MCD) / np.sqrt(VUC/nUC + VCD/nCD)
```

If a gene is not differentially expressed, it has the same expected value in the two groups of samples. In this case, the Z-score will be standardized (i.e. will have mean zero and unit standard deviation). This data set contains data for 22,283 genes. So if none of the genes are differentially expressed, our Z-scores should approximately have zero mean and unit variance. We can check this as follows.

In [10]:

```
print zscores.mean()
print zscores.std()
```

Since the standard deviation is much greater than 1, there appear to be multiple genes for which the mean expression levels in the ulcerative colitis and Crohn's disease samples differ. Further, since the mean Z-score is positive, it appears that the dominant pattern is for genes to be expressed at a higher level in the ulcerative colitis compared to the Crohn's disease samples.

The conventional threshold for statistical significance is a p-value smaller than 0.05, which corresponds to the Z-score being greater than 2 in magnitude. Many genes meet this condition, however as we will see below this does not carry much evidence that these genes are differentially expressed.

In [11]:

```
print np.mean(np.abs(zscores) > 2)
print np.mean(zscores > 2)
print np.mean(zscores < -2)
```

To use the multiple testing methods we need to convert the Z-scores to p-values.

In [12]:

```
pvalues = 2*norm.cdf(-np.abs(zscores))
```

Here we compute the Benjamini-Hochberg false discovery rates. The adjusted p-values are in the first position of the returned value.

In [13]:

```
pm = multipletests(pvalues, method="fdr_bh")
apv = pm[1]
```

This plot shows how the Z-scores and false discovery rates are related.

In [14]:

```
ii = np.argsort(zscores)
plt.plot(zscores[ii], apv[ii], '-')
plt.xlabel("Z-score")
plt.ylabel("False Discovery Rate")
```

Out[14]:

This plot shows how the p-values and false discovery rates are related.

In [15]:

```
ii = np.argsort(pvalues)
plt.plot(pvalues[ii], apv[ii], '-')
plt.xlabel("p-value")
plt.ylabel("False Discovery Rate")
```

Out[15]: