Tutorial: The UseR Conference 2016

 

Genome-Wide Association Analysis and Post-Analytic Interrogation with R

 

Part 2

 

Instructor: Professor Andrea S. Foulkes

Department of Mathematics and Statistics, Mount Holyoke College

Acknowledgements:

Special thanks to Sara Nunez , Reserach Scientist, Mount Holyoke College, for work on the development of code snippets.

Support for this tutorial was provided by:

NIH/NHLBI R01 107196 (PI: Foulkes).

Outline

 

Part 1: Core analytic components of a standard GWAS

 

Part 2: Interrogation of regional and higher-order associations.

  1. Motivation
  2. Sequence Kernel Association Test (SKAT)
  3. Class level analytic strategies - GenCAT, VEGAS and QT

Download and unzip data needed for this tutorial:

In [ ]:
library(downloader)
download("https://www.mtholyoke.edu/courses/afoulkes/UseR2016_Tutorial_Files.zip",
             "UseR2016_Tutorial_Files.zip")
unzip("UseR2016_Tutorial_Files.zip", exdir = "/Users/afoulkes/UseR2016/")

Part 2: Interrogation of regional and higher-order associations.

  1. Motivation

1. Motivation

  • Large-scale genome-wide association (GWA) meta-analyses have become routine practice for discovery of the genetic underpinnings of complex traits, such as cardiometabolic disease (CMD).
  • Several resulting meta-analysis resources, including summary level information on association between each of several million typed and imputed SNPs and a well-defined trait, are now publicly available.

1. Motivation continued

  • At the same time, we see a growing number of classifications or taxonomies of the genome—for example, protein coding genes, epigenetic marks, enhancer elements and non-coding RNAs—herein referred to as classes.
  • Additionally, we have refined knowledge regarding the linkage-disequilibrium (LD) structure across the genome via the 1000 genomes project.

1. Motivation continued

  • In turn, this presents new opportunity to further potentiate the extensive, existing data resources through application of theoretically sound methodological advancements that integrate these multiple knowledge components.

In this tutorial, we leverage these multiple existing knowledge resources to demonstrate the added value of applying a class-level testing strategy to complement more routine analysis practice.

1. Motivation continued

  • The minP approach to evaluating association between protein coding genes and a trait is to ascribe SNPs with p-values less than a Bonferroni corrected threshold ($p \le 5 \times 10^{−8}$) to genes and to declare these genes statistically meaningful.
  • While the minP approach has led to a large number of novel gene discoveries, it is limited in that a gene must contain at least one SNP with a very large signal to be identified.

1. Motivation continued

  • MinP may not capture genes with multiple SNPs that have moderate signals, which in combination are genetically, biologically, clinically, and statistically meaningful.

1. Motivation continued

Two representative and established analytic frameworks considered:

  • SKAT
  • GenCAT (strong theoretical relationships to VEGAS and QT)

with relative advantages for alternative settings.

Part 2: Interrogation of regional and higher-order associations.

  1. Motivation $\checkmark$
  2. Sequence Kernel Association Test (SKAT)

2. Sequence Kernel Association Test (SKAT)$^*$

Framework for testing for regional association of (rare or common) genetic variants with a quantitative or binary trait.

$^*$Wu et al. (2011) The American Journal of Human Genetics 89, 82–93.

2. SKAT method

Linear model (for a quantitative trait):

\begin{eqnarray*} y_i = \alpha_0 + \pmb{\alpha}^T{\bf X}_i +\pmb{\beta}^T{\bf G}_i + \epsilon_i \end{eqnarray*}

$\alpha_0$ is the intercept, $\pmb{\alpha}^T = (\alpha_1,...,\alpha_m)^T$ is the vector of regression coefficients for m covariates, $\pmb{\beta}^T = (\beta_1,...,\beta_p)^T$ is the vector of regression coefficients for p genetic variants, and $\epsilon_i \sim N(0,\sigma^2)$ is the error term.

2. SKAT method

Interest is in testing:

\begin{eqnarray*} H_0: \pmb{\beta}^T={\bf 0} \end{eqnarray*}

Additionally assume $\beta_j$ follows an arbitrary mean $0$ distribution with variance $w_j\tau$ for prespecified weights $w_j$. So the above null hypothesis can be rewritten:

\begin{eqnarray*} H_0: \tau = 0 \end{eqnarray*}

$\rightarrow$ Variance component score test for correspondig mixed model

2. SKAT data prep

In order to implement SKAT, we first call the necessary R packages and set our working directory:

In [1]:
library(snpStats)
library(SKAT)
library(GenCAT)
setwd("/Users/afoulkes/UseR2016/")
Loading required package: survival
Loading required package: Matrix
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: doParallel
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: ggplot2

2. SKAT data prep

  • The GWAStutorial_clinical.csv file contains clinical data.
  • We maintain observations with complete data on CAD, sex, age, TG, HDL-C and LDL-C:
In [2]:
clinical <- read.csv("UseR2016_Tutorial_Files/GWAStutorial_clinical.csv",
                     stringsAsFactors = F)
clinical.c <- clinical[complete.cases(clinical),]
head(clinical.c)
FamIDCADsexagetghdlldl
210004 1 2 50 55 23 75
310005 1 1 55 105 37 69
410007 1 1 52 314 54 108
510008 1 1 58 161 40 94
610009 1 1 59 171 46 92
710010 1 1 54 147 69 109

2. SKAT data prep

The .bim, .bed and .fam files contain the genotype information and are read into R using the read.plink() function:

In [3]:
geno <- read.plink("UseR2016_Tutorial_Files/GWAStutorial",
                  na.strings=c("-9"))
genotype <- geno$genotypes
print(genotype)
genotypeNumeric <- as(genotype,"numeric")
genotypeNumeric[1:3,1:7]
A SnpMatrix with  1401 rows and  861473 columns
Row names:  10002 ... 11596 
Col names:  rs10458597 ... rs5970564 
rs10458597rs12565286rs12082473rs3094315rs2286139rs11240776rs2980319
10002 2 2 2 2 2NA 2
100042021222
100052222222

2. SKAT data prep

  • The geno object also contains information on each SNP:
In [4]:
snps <- geno$map
head(snps)
chromosomesnp.namecMpositionallele.1allele.2
rs104585971 rs104585970 564621 NA C
rs125652861 rs125652860 721290 G C
rs120824731 rs120824730 740857 T C
rs30943151 rs30943150 752566 C T
rs22861391 rs22861390 761732 C T
rs112407761 rs112407760 765269 G A

2. SKAT data prep

For illustration, we consider 5 known CAD associated loci.

  • First we read in a .csv file containing corresponding coordinates:
In [5]:
coords <- read.csv("UseR2016_Tutorial_Files/CAD_known_loci_coords.csv",
                   stringsAsFactors = F)
colnames(coords)[1] <- "class"
coords
classchrstartstopsnpsposition
1LPL 8 1931318020313180rs264 19813180
2PLG 6 160643608161643608rs4252120161143608
3ABO 9 135654168136654168rs579459 136154168
4SORT1 1 109321511110321511rs602633 109821511
5FLT1 13 28473621 29473621 rs931942828973621

2. SKAT data prep

  • Using the map2class() function within the GenCAT package, we assign SNPs to each locus:
In [6]:
colnames(snps) <- c("chr","SNP","cM","position","A1","A2")
mapped <- map2class(coords,snps)
head(mapped)
[1] "Mapping SNPs to classes on chromosome 8."
[1] "Mapping SNPs to classes on chromosome 6."
[1] "Mapping SNPs to classes on chromosome 9."
[1] "Mapping SNPs to classes on chromosome 1."
[1] "Mapping SNPs to classes on chromosome 13."
chrSNPcMpositionA1A2class
18 rs132618240 19313912 C G LPL
28 rs44271900 19321385 T C LPL
38 rs40721310 19321634 C T LPL
48 rs49220300 19322771 C T LPL
58 rs49220310 19323062 T A LPL
68 rs69831750 19323403 C T LPL

2. SKAT data prep

Next we:

  • subset columns of the genotype object for SNPs in the mapped object and subjects with complete clinical data; and
  • filter the genotype data on call rate ($=1$) and minor allele frequency (MAF >0.01)
In [7]:
geno2 <- genotype[,unique(mapped$SNP)]
geno3 <- geno2[as.character(clinical.c$FamID),]
summary <- col.summary(geno3)
keep <- row.names(summary[summary$Call.rate == 1 & summary$MAF > 0.01,])
geno4 <- geno3[,keep]
print(geno4)
A SnpMatrix with  1281 rows and  267 columns
Row names:  10004 ... 11596 
Col names:  rs4072131 ... rs1161456 

2. SKAT in R

  • Using LDL as the outcome, and sex and age as covariates, we first fit the null model using the SKAT_Null_Model() function to obtain parameter estimates needed to run SKAT:
In [8]:
out <- SKAT_Null_Model(clinical.c[,"ldl"] ~ 
                       as.matrix(clinical.c[,c("sex","age")]), 
                       out_type="C")
attributes(out)
$names
  1. 'res'
  2. 'X1'
  3. 'res.out'
  4. 'out_type'
  5. 'n.Resampling'
  6. 'type.Resampling'
  7. 'id_include'
  8. 's2'
  9. 'n.all'
$class
'SKAT_NULL_Model'

2. SKAT in R

  • Prior to applying SKAT, we reformort the snpMatrix object by using the flip.matrix() function to ensure that the allele counts refer to the number of minor alleles present:
In [9]:
genoNum <- as(geno4,"numeric")

flip.matrix<-function(x) {
  zero2 <- which(x==0)
  two0 <- which(x==2)
  x[zero2] <- 2
  x[two0] <- 0
  return(x)
}
In [10]:
Z <- flip.matrix(genoNum)
head(Z)
dim(Z)
rs4072131rs6998060rs7002507rs4413788rs11204047rs4483172rs4921654rs7017776rs9644631rs17480050rs1617234rs9578057rs7339280rs4272874rs952648rs9314921rs17561728rs1161473rs1161468rs1161456
1000401100000001111220000
1000501000010110002110001
1000710000000000001011111
1000800000000012220111000
1000910011111111110010111
1001000000000111112010111
  1. 1281
  2. 267

2. SKAT in R

  • Finally we subset the SNPS within a given locus and apply the SKAT() function:
In [11]:
gene_snps <- unique(mapped$SNP[mapped$class == "SORT1"])
gene_snps <- gene_snps[gene_snps%in%colnames(geno4)]
Z_sub <- as.matrix(Z[,gene_snps])
attributes(SKAT(Z_sub,out))
SKAT(Z_sub,out)$p.value
$names
  1. 'p.value'
  2. 'p.value.resampling'
  3. 'Test.Type'
  4. 'Q'
  5. 'param'
  6. 'pval.zero.msg'
$class
'SKAT_OUT'
0.0399313759930093

2. SKAT notable features

  • Computationally efficient - only requires fitting reduced model (under the null)
  • Accomodates both common and rare variants
  • Uses raw data as input

Part 2: Interrogation of regional and higher-order associations.

  • Motivation $\checkmark$
  • Sequence Kernel Association Test (SKAT) $\checkmark$
  • Genetic Class Associatin Test (GenCAT)

3. Genetic Class Association Test (GenCAT)$^*$

  • Efficient framework for testing for class-level (protein coding gene, regulatory element, etc) association.
  • Leverages summary level data from GWA meta-analysis resources.
  • Extensions of QT and VEGAS.

$^*$Qian et al. (2016) PLoS ONE DOI:10.1371/journal.pone.0148218.

3. GenCAT method

  • Let ${\bf z}=(z_1, z_2,..., z_n)^T$ be a vector of $n$ test statistics (z-scores) for association of each element in this class with the trait under study
  • ${\bf z} \sim N_n(0, \Sigma)$, under the null of no association between any of the $n$ elements in this class.
  • Because $\Sigma$ is square and positive definite, we can decompose $\Sigma$ as:
\begin{eqnarray*} \Sigma=Q\Lambda Q^{T} \end{eqnarray*}

where $Q$ is an orthogonal matrix (i.e., $Q^{T}Q=QQ^{T}=I$) whose columns are the eigenvectors (normalized to unit length) of $\Sigma$, and $\Lambda$ is a diagonal matrix with diagonal elements equal to the eigenvalues of $\Sigma$.

3. GenCAT method

The class-level test statistic is defined as: \begin{eqnarray} \mathcal{C}_{s} = U^{T}U \end{eqnarray}

where $U=\Lambda^{-1/2}Q^{T}{\bf z}$.

Note that $\mathcal{C}_s$ can be expressed as $\mathcal{C}_{s} = {\bf z}^{T}(\Lambda^{-1/2}Q^{T})^{T}(\Lambda^{-1/2}Q^{T}){\bf z} = {\bf z}^{T}\Sigma^{-1}{\bf z}$ and:

\begin{eqnarray*} {\bf z}^{T}\Sigma^{-1}{\bf z} \sim \chi^2(n) \end{eqnarray*}

3. GenCAT method

  • It can be shown that $\Sigma$ can be estimated by ${\bf P}$, Pearson's pairwise correlations between SNPs.
  • Further projecting the test statistics onto the minimum set of eigenvectors that caputure $(1-\phi)\%$ of the variability in $\Sigma$ offers stability in high correlation setting.

3. GenCAT example

3. GenCAT example

First we load the GLGC TG summary statistics:

In [12]:
glgc <- read.delim("UseR2016_Tutorial_Files/GLGC_TG.tbl",
                               stringsAsFactors = F,header=T,sep="")
head(glgc)
dim(glgc)
MarkerNameAllele1Allele2WeightGC.ZscoreGC.PvalueOverallDirection
1rs964184 c g 96576 -33.075 6.71e-240 - -------------------------
2rs1260326 t c 96590 24.539 5.68e-133 + +++++++++++++++++++++++++
3rs780094 t c 96546 23.768 7.08e-125 + +++++++++++++++++++++++++
4rs2160669 t c 96598 -23.735 1.59e-124 - ---------------------+---
5rs9326246 c g 96598 23.688 4.79e-124 + +++++++++++++++++++++-+++
6rs780093 t c 96590 23.681 5.68e-124 + +++++++++++++++++++++++++
  1. 2692560
  2. 8

3. GenCAT example

  • We then load the 1000 Genomes genotype data available in the GenCAT package.
  • These data include genotypes for SNPs on chromosomes 13, 14 and 15, for 99 individuals of northern and western european ancestry (CEU).
In [13]:
library(GenCAT)
data("geno")
geno_dat <- geno$genotypes
snps <- geno$map
print(geno_dat)
dim(geno_dat)
A SnpMatrix with  99 rows and  84195 columns
Row names:  CEU_1 ... CEU_99 
Col names:  rs624673 ... rs1399026 
  1. 99
  2. 84195

3. GenCAT example

  • Next we subset the GLGC data to include only those SNPs with 1000 Genomes data:
In [14]:
glgc2 <- glgc[,c("MarkerName","Allele1","Allele2","GC.Zscore")]
colnames(glgc2) <- c("SNP","A1","A2","testStat")
colnames(snps) <- c("chr","SNP","cM","position","A1","A2")
merg <- merge(snps,glgc2,by.x="SNP",by.y="SNP")
dim(merg)
head(merg)
  1. 84195
  2. 9
SNPchrcMpositionA1.xA2.xA1.yA2.ytestStat
1rs100002213 NA 100461219C T t c -0.221
2rs100016013 NA 100950452T C t c -0.885
3rs100028115 NA 72553422 C T t c -4.091
4rs100029015 NA 100949609T C t c 0.079
5rs100052114 NA 70522484 A G a g -1.059
6rs100058715 NA 89050628 A G a g 3.007

3. GenCAT example

  • Some additional housekeeping is required to make sure the allele assignments (and in turn the direction of the correlation) is the same:
In [15]:
# make A1.x and A1.y upper case
merg$A1.y <- toupper(merg$A1.y)
merg$A2.y <- toupper(merg$A2.y)
# check 'A1.x' matches one 'y' allele
keep <- merg[merg$A1.x == merg$A1.y | merg$A1.x == merg$A2.y,] 
# check 'A2.x' matches one 'y' allele
keep2 <- keep[keep$A2.x == keep$A1.y | keep$A2.x == keep$A2.y,]
In [16]:
# change test stat sign if A1.x != A1.y
keep2$testStat[keep2$A1.x != keep2$A1.y] <- keep2$testStat[keep2$A1.x != keep2$A1.y]*(-1)  
dim(keep2)
head(keep2)
  1. 84146
  2. 9
SNPchrcMpositionA1.xA2.xA1.yA2.ytestStat
1rs100002213 NA 100461219C T T C 0.221
2rs100016013 NA 100950452T C T C -0.885
3rs100028115 NA 72553422 C T T C 4.091
4rs100029015 NA 100949609T C T C 0.079
5rs100052114 NA 70522484 A G A G -1.059
6rs100058715 NA 89050628 A G A G 3.007

3. GenCAT example

  • Next we read in a file that includes all PCG coordinates:
In [17]:
coords <- read.csv("useR2016_Tutorial_Files/ProCodgene_coords.csv",stringsAsFactors=F)
colnames(coords)[4] <- "class"
head(coords)
chrstartstopclass
11 6909070008OR4F5
21 367658 368597 OR4F29-1
31 621095 622034 OR4F29-2
41 861120879961SAMD11
51 879582894679NOC2L
61 895966901099KLHL17

3. GenCAT example

  • Each SNP is then mapped to a class using the map2class() function in the GenCAT package:
In [18]:
mapped <- map2class(coords,keep2)
colnames(mapped) <- c("SNP","chr","cM","position","effect_allele","other_allele",
                      "A1","A2","testStat","class")
length(unique(mapped$SNP))
[1] "Mapping SNPs to classes on chromosome 13."
[1] "Mapping SNPs to classes on chromosome 14."
[1] "Mapping SNPs to classes on chromosome 15."
77409
In [19]:
head(mapped)
SNPchrcMpositioneffect_alleleother_alleleA1A2testStatclass
1rs1707688613 NA 19755612 C T T C -0.507 TUBA3C
2rs1707689113 NA 19755726 G C C G -0.562 TUBA3C
3rs1719471913 NA 19752733 C T T C 0.908 TUBA3C
4rs1719472613 NA 19753231 A G A G 0.868 TUBA3C
5rs1719473313 NA 19753310 G A A G 0.866 TUBA3C
6rs277449413 NA 19748709 T G T G 1.505 TUBA3C

3. GenCAT example

  • We are now ready to apply GenCAT using the GenCAT() function:
In [20]:
geno_sub <- geno_dat[,unique(mapped$SNP)]
out <- GenCAT(mapped, geno_sub, snps)
ntests <- dim(out$GenCAT)[1]
[1] "Running GenCAT on 298 classes on chromosome 13."
[1] "Running GenCAT on 529 classes on chromosome 14."
[1] "Running GenCAT on 500 classes on chromosome 15."
In [21]:
out$GenCAT[out$GenCAT$CsumP<=0.05/ntests,]
classchrn_SNPsn_ObsCsumTCsumP
774SLC25A29 14 5 3 25.68382 1.110751e-05
941TGM5 15 20 3 42.60935 2.97883e-09
949PPIP5K1-1 15 9 2 21.49987 2.144681e-05
950CKMT1B 15 1 1 21.11402 4.327493e-06
951STRC-1 15 2 1 23.38788 1.324103e-06
952CATSPER2 15 9 3 27.50306 4.617692e-06
953CKMT1A 15 2 2 21.27628 2.398359e-05
954PDIA3 15 12 2 22.83105 1.102304e-05
957SERINC4 15 3 2 20.80912 3.029403e-05
959WDR76 15 14 3 23.3486 3.415989e-05
960FRMD5 15 62 11 154.7388 1.600565e-27
1043LIPC 15 177 24 89.61096 1.671801e-09
1102TIPIN 15 4 3 23.8544 2.679076e-05
1138PKM 15 8 3 25.66676 1.119921e-05
1141HEXA 15 1 1 19.4481 1.033706e-05

3. GenCAT summary

  • Post-analytic approach to identify class-level association
  • VEGAS$^*$ uses numerical techniques to evaluate statistical signficance of same statistic
  • Extends the QT$^{**}$ by considering test statistics from generalized linear model and introducing a data reduction step for stability.

$^*$Liu et al. (Am J Hum Genet. 2010) PMID: 20598278

$^{**}$Luo L et al (Eur J Hum Genet. 2010) PMID: 20442747

Part 2: Interrogation of regional and higher-order associations.

  • Motivation $\checkmark$
  • Sequence Kernel Association Test (SKAT) $\checkmark$
  • Genetic Class Associatin Test (GenCAT) $\checkmark$

Additional References:

  • Li et al (Am J Hum Genet. 2011)
  • Wu et al. (Am J Hum Genet. 2010) PMID: 20560208
  • Huang et al. (PLoS Genet. 2011) PMID: 21829371
  • Li et al. (Am J Hum Genet. 2008) PMID: 18691683
  • Hu et al. (Am J Hum Genet. 2013)
  • Liu et al. (Nat Genet. 2014) PMID: 24336170
  • Mootha et al. (Nat Genet. 2003) PMID: 12808457
  • Subramanian et. al (Proc Natl Acad Sci USA. 2005) PMID: 16199517
  • Peng et al. (Eur J Hum Genet. 2010) PMID: 19584899
  • Segre et al. (PLoS Genet. 2010)
  • Weng et al. (BMC Bioinformatics. 2011) PMID: 21496265
  • Foulkes et al. (PLoS ONE. 2013) PMID: 23405096
  • Lee et al. (Am J Hum Genet. 2013) PMID: 23768515