*Key ideas:* Gene expression data, multiple hypothesis testing, false discovery rate

This notebook demonstrates how to do a simple differential expression analysis on a gene expression dataset. We use false discovery rates to provide interpretable results when conducting an analysis that involves large scale multiple hypothesis testing.

The data set we analyze here contains measurements of the expression levels of 22,283 genes in peripheral blood mononuclear cells (PBMCs). Data for 127 subjects are included, 26 of whom have ulcerative colitis, and 59 of whom have Crohn's disease (the remaining subjects are healthy and will not be considered here). The goal is to identify genes that have different mean expression levels in the two disease groups.

The raw data are available here as accession number GDS1615 from the NCBI's GEO (Gene Expression Omnibus) site.

Here are the import statements for the modules that we will use.

In [1]:

```
import gzip
import numpy as np
import urllib2
import gzip
from StringIO import StringIO
from statsmodels.stats.multitest import multipletests
from scipy.stats.distributions import norm
import matplotlib.pyplot as plt
```

In [2]:

```
url = urllib2.urlopen("ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS1nnn/GDS1615/soft/GDS1615_full.soft.gz")
zdata = url.read()
sio = StringIO(zdata)
fid = gzip.GzipFile(fileobj=sio)
```

Alternatively, the data can be downloaded and read this way:

fid = gzip.open("GDS1615_full.soft.gz")

Next we read the SOFT format data file into the following data structures:

- GID : A list of gene identifiers of length d
- SID : A list of sample identifiers of length n
- STP : A list of sample desriptions of length d
- X : A dxn array of gene expression values

We begin by reading the header section of the data file, and placing the information we need into a dictionary:

In [3]:

```
SIF = {}
for line in fid:
if line.startswith("!dataset_table_begin"):
break
elif line.startswith("!subset_description"):
subset_description = line.split("=")[1].strip()
elif line.startswith("!subset_sample_id"):
subset_ids = line.split("=")[1].split(",")
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
```

Now we create arrays containing some information we will need while processing the data.

In [4]:

```
# Next line is the column headers (sample id's)
SID = fid.next().split("\t")
# The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
# Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
# Get a list of sample labels
STP = [SIF[k] for k in SID]
```

Next we read the gene expression data row by row, and place the gene id's into a separate array.

In [5]:

```
# Read the gene expression data as a list of lists, also get the gene identifiers
GID,X = [],[]
for line in fid:
# This is what signals the end of the gene expression data
# section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
# Extract the values that correspond to gene expression measures
# and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(np.asarray(x))
GID.append(V[0] + ";" + V[1])
# Convert the Python list of lists to a Numpy array.
X = np.asarray(X)
```

In [6]:

```
# The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
# The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
```

Gene expression data is usually skewed, so we analyze it on the log scale.

In [7]:

```
XL = np.log(X) / np.log(2)
```

In [8]:

```
MUC = XL[:,UC].mean(1) ## Mean of ulcerative colitis samples
MCD = XL[:,CD].mean(1) ## Mean of Crohn's disease samples
VUC = XL[:,UC].var(1) ## Variance of ulcerative colitis samples
VCD = XL[:,CD].var(1) ## Variance of Crohn's disease samples
nUC = len(UC) ## Number of ulcerative colitis samples
nCD = len(CD) ## Number of Crohn's disease samples
```

The Z-scores summarize the evidence for differential expression.

In [9]:

```
zscores = (MUC - MCD) / np.sqrt(VUC/nUC + VCD/nCD)
```

In [10]:

```
print zscores.mean()
print zscores.std()
```

Since the standard deviation is much greater than 1, there appear to be multiple genes for which the mean expression levels in the ulcerative colitis and Crohn's disease samples differ. Further, since the mean Z-score is positive, it appears that the dominant pattern is for genes to be expressed at a higher level in the ulcerative colitis compared to the Crohn's disease samples.

The conventional threshold for statistical significance is a p-value smaller than 0.05, which corresponds to the Z-score being greater than 2 in magnitude. Many genes meet this condition, however as we will see below this does not carry much evidence that these genes are differentially expressed.

In [11]:

```
print np.mean(np.abs(zscores) > 2)
print np.mean(zscores > 2)
print np.mean(zscores < -2)
```

To use the multiple testing methods we need to convert the Z-scores to p-values.

In [12]:

```
pvalues = 2*norm.cdf(-np.abs(zscores))
```

In [13]:

```
pm = multipletests(pvalues, method="fdr_bh")
apv = pm[1]
```

This plot shows how the Z-scores and false discovery rates are related.

In [14]:

```
ii = np.argsort(zscores)
plt.plot(zscores[ii], apv[ii], '-')
plt.xlabel("Z-score")
plt.ylabel("False Discovery Rate")
```

Out[14]:

This plot shows how the p-values and false discovery rates are related.

In [15]:

```
ii = np.argsort(pvalues)
plt.plot(pvalues[ii], apv[ii], '-')
plt.xlabel("p-value")
plt.ylabel("False Discovery Rate")
```

Out[15]: