Tutorial: The UseR Conference 2016

 

Genome-Wide Association Analysis and Post-Analytic Interrogation with R

 

Part 1

 

Instructor: Professor Andrea S. Foulkes

Department of Mathematics and Statistics, Mount Holyoke College

Acknowledgements:

Special thanks to Sara Nunez , Reserach Scientist, Mount Holyoke College, for work on the development of code snippets.

Support for this tutorial was provided by:

NIH/NHLBI R01 107196 (PI: Foulkes).

Outline

 

Part 1: Core analytic components of a standard GWAS

  1. Data preprocessing
  2. Data generation
  3. Association analysis
  4. Post-analytic visualization

 

Part 2: Interrogation of regional and higher-order associations.

Part 1: Core analytic components of a standard GWAS.

I. Data preprocessing

 

Step 1: Creation of R objects

Download and unzip data needed for this tutorial:

In [ ]:
library(downloader)
download("https://www.mtholyoke.edu/courses/afoulkes/UseR2016_Tutorial_Files.zip",
             "UseR2016_Tutorial_Files.zip")
unzip("UseR2016_Tutorial_Files.zip", exdir = "/Users/afoulkes/UseR2016/")

Reading in SNP data:

  • Use the read.plink() function in the Bioconductor snpStats package to read the \textit{.fam}, \textit{.bim} and \textit{.bed} files into R.
In [1]:
library(snpStats)
setwd("/Users/afoulkes/UseR2016/")
geno <- read.plink("UseR2016_Tutorial_Files/GWAStutorial", 
                   na.strings = ("-9"))  
attributes(geno)
Loading required package: survival
Loading required package: Matrix
$names =
  1. 'genotypes'
  2. 'fam'
  3. 'map'

Manipulating data:

  • Obtain the genotype of the resulting list. This is a SnpMatrix object, a matrix of genotype data with a column for each SNP and a row for each study participant.
In [2]:
genotype <- geno$genotypes
print(genotype)         
genotypeNumeric <- as(genotype,"numeric")
genotypeNumeric[1:3,1:7]
A SnpMatrix with  1401 rows and  861473 columns
Row names:  10002 ... 11596 
Col names:  rs10458597 ... rs5970564 
rs10458597rs12565286rs12082473rs3094315rs2286139rs11240776rs2980319
10002 2 2 2 2 2NA 2
100042021222
100052222222
In [3]:
dim(genotype)
  1. 1401
  2. 861473

Manipulating data:

  • Obtain the SNP information table:
In [4]:
genoBim <- geno$map
colnames(genoBim) <- c("chr", "SNP", "gen.dist", "position", "A1", "A2")
print(head(genoBim))
           chr        SNP gen.dist position   A1 A2
rs10458597   1 rs10458597        0   564621 <NA>  C
rs12565286   1 rs12565286        0   721290    G  C
rs12082473   1 rs12082473        0   740857    T  C
rs3094315    1  rs3094315        0   752566    C  T
rs2286139    1  rs2286139        0   761732    C  T
rs11240776   1 rs11240776        0   765269    G  A

Clinical data components:

  • ID
  • coronary artery disease (CAD): 0 = control and 1 = case;
  • sex: 1 = male, 2 = female;
  • age (years);
  • triglyceride level (TG - mg/dL)
  • high-density lipoprotein cholesterol (HDL-C - mg/dL)
  • low-density lipoprotein cholesterol (LDL-C - mg/dL)

Reading in clinical data:

In [5]:
clinical <- read.csv("UseR2016_Tutorial_Files/GWAStutorial_clinical.csv",
                     colClasses=c("character", "factor", "factor", rep("numeric", 4)))
rownames(clinical) <- clinical$FamID
print(head(clinical))
      FamID CAD sex age  tg hdl ldl
10002 10002   1   1  60  NA  NA  NA
10004 10004   1   2  50  55  23  75
10005 10005   1   1  55 105  37  69
10007 10007   1   1  52 314  54 108
10008 10008   1   1  58 161  40  94
10009 10009   1   1  59 171  46  92

I. Data preprocessing

 

Step 1: Creation of R objects $\checkmark$

Step 2: SNP level filtering (part 1)

Step 2: SNP level filtering (part 1)

Once the data is loaded, we are ready to remove SNPs that fail to meet minimum criteria due to missing data, low variability or genotyping errors.

We address these by first filtering on call rate and minor allele frequency (MAF).

SNP level characteristics:

snpStats provides functions, col.summary() and row.summary(), that return statistics on SNPs and samples, respectively.

In [6]:
snpsum.col <- col.summary(genotype)
head(snpsum.col)
CallsCall.rateCertain.callsRAFMAFP.AAP.ABP.BBz.HWE
rs104585971398.0000000 0.9978587 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000 1.0000000 NA
rs125652861384.00000000 0.98786581 1.00000000 0.94833815 0.05166185 0.00433526 0.09465318 0.90101156 -1.26529432
rs120824731.369000e+039.771592e-011.000000e+009.985391e-011.460920e-030.000000e+002.921841e-039.970782e-015.413314e-02
rs30943151386.00000000 0.98929336 1.00000000 0.82178932 0.17821068 0.04761905 0.26118326 0.69119769 -4.03172248
rs22861391364.00000000 0.97359029 1.00000000 0.86217009 0.13782991 0.02199413 0.23167155 0.74633431 -0.93146122
rs112407761.269000e+039.057816e-011.000000e+009.988180e-011.182033e-030.000000e+002.364066e-039.976359e-014.215743e-02

Call rate and minor allele frequency (MAF) filtering:

Using these summary statistics, we keep the subset of SNPs that meet our criteria for minimum call rate and MAF.

In [7]:
# Setting thresholds
call <- 0.95
minor <- 0.01

# Filter on MAF and call rate
use <- with(snpsum.col, (!is.na(MAF) & MAF > minor) & Call.rate >= call)
use[is.na(use)] <- FALSE                # Remove NA's as well
 
# Subset genotype and SNP summary data for SNPs that pass call rate and MAF criteria
genotype <- genotype[,use]
snpsum.col <- snpsum.col[use,]

print(genotype) # started with 861,473 SNPs
A SnpMatrix with  1401 rows and  658186 columns
Row names:  10002 ... 11596 
Col names:  rs12565286 ... rs5970564 

I. Data preprocessing

 

Step 1: Creation of R objects $\checkmark$

Step 2: SNP level filtering (part 1) $\checkmark$

Step 3: Sample level filtering

Step 3: Sample level filtering

Next we remove individuals due to missing data, sample contamination, correlation (for population-based investigations), and racial/ethnic or gender ambiguity or discordance.

We address these issues by filtering on call rate, heterozygosity, cryptic relatedness and duplicates, and ancestry.

Calculate SNP level call rate and heterozygosity:

In [8]:
snpsum.row <- row.summary(genotype)
head(snpsum.row)
Call.rateCertain.callsHeterozygosity
100020.98265541.00000000.3289825
100040.98915811.00000000.3242931
100050.99184271.00000000.3231825
100070.98610271.00000000.3241469
100080.98233331.00000000.3228218
100090.99130341.00000000.3213658

Calculating the F statistic (inbreeding coefficient):

\begin{equation*} |F|=(1−O/E) \end{equation*}

where O and E are the observed and expected proportions of heterozygous genotypes for a given sample based on the MAFs.

In [9]:
MAF <- snpsum.col$MAF
callmatrix <- !is.na(genotype)
hetExp <- callmatrix %*% (2*MAF*(1-MAF))
hetObs <- with(snpsum.row, Heterozygosity*(ncol(genotype))*Call.rate)
snpsum.row$hetF <- 1-(hetObs/hetExp)

Call rate and heterozygosity filtering:

Using these summary statistics, we keep the subset of individuals (samples) that meet our criteria for call rate and heterozygosity.

In [10]:
# Setting thresholds
sampcall <- 0.95    # Sample call rate cut-off
hetcutoff <- 0.1    # Inbreeding coefficient cut-off

# Filter on MAF and heterozygosity
sampleuse <- with(snpsum.row, !is.na(Call.rate) & Call.rate > sampcall & abs(hetF) <= hetcutoff)
sampleuse[is.na(sampleuse)] <- FALSE    # remove NA's as well
cat(nrow(genotype)-sum(sampleuse), "subjects will be removed due to low sample call rate or inbreeding coefficient.\n") 

# Subset genotype and clinical data for subjects who pass call rate and heterozygosity crtieria
genotype <- genotype[sampleuse,]
clinical<- clinical[ rownames(genotype), ]
0 subjects will be removed due to low sample call rate or inbreeding coefficient.

IBD analysis for relatedness and duplicates:

  • We use the snpgdsIBDMoM(), snpgdsIBDSelection() and ibdcoeff() functions in the SNPRelate package to perform identity-by-descent (IBD) analysis. This package requires that the data be transformed into a GDS format file.
  • Methods of moments IBD analysis is performed on only a subset of SNPs that are in linkage equilibrium by iteratively removing adjacent SNPs that exceed an LD threshold in a sliding window using the snpgdsLDpruning() function also in the SNPRelate package.

IBD analysis for relatedness and duplicates:

  • More information on implementing this approach is provided on the stat-gen.org site:

    http://www.stat-gen.org/tut/tut_preproc.html

  • Relatedness filtering was performed previously and therefore none of the $1401$ individuals in the available data are expected to be related or duplicated.

PCA analysis for ancestry filtering:

  • Recall, population substructure refers to a population consisting of subgroups within which there is random mating but between which there is little mixing or gene flow (population stratification) or to a population in which mating occurs between subgroups with different allelic distributions (admixture).

  • We use the snpgdsPCA() function in the SNPRelate package to perform principal component analysis (PCA) and then plot the first two principal components to identify cluster and outliers.

PCA analysis for ancestry filtering:

  • More information on implementing this approach is provided on the stat-gen.org site:

    http://www.stat-gen.org/tut/tut_preproc.html

  • Our samples are reasonably homogeneous from European ancestry, and therefore none of the $1401$ individuals are filtered for ancestry.

I. Data preprocessing

 

Step 1: Creation of R objects $\checkmark$

Step 2: SNP level filtering (part 1) $\checkmark$

Step 3: Sample level filtering $\checkmark$

Step 4: SNP level filtering (part 2)

Step 4: SNP level filtering (part 2)

Final SNP level filtering is based on Hardy-Weinberg equilibrium

Hardy-Weinberg equilibrium (HWE):

  • HWE denotes independence of alleles at a single site between two homologous chromosomes.
  • Recall Linkage Disequilibrium (LD) is allelic association across sites on a single homolog.

  • Example:

    Consider a bi-allelic SNP with genotypes AA, Aa and aa:

    HWE implies $p_{AA}=p_A^2$, $p_{aa}=p_a^2$ and $p_{Aa}=p_Ap_a$.

Violations of Hardy-Weinberg equilibrium (HWE)

Can indicate:

  • Presence of population substructure (non-random mating)
  • Genotyping errors

We test for HWE at the SNP level within CAD controls, and remove SNPs with corresponding $p<1 \times 10^{−6}$.

In [11]:
# Setting threshold
hardy <- 10^-6  

# Filter on HWE
CADcontrols <- clinical[ clinical$CAD==0, 'FamID' ]
snpsum.colCont <- col.summary( genotype[CADcontrols,] )
HWEuse <- with(snpsum.colCont, !is.na(z.HWE) & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) )
rm(snpsum.colCont)

HWEuse[is.na(HWEuse)] <- FALSE          # Remove NA's as well
cat(ncol(genotype)-sum(HWEuse),"SNPs will be removed due to high HWE.\n") 

# Subset genotype and SNP summary data for SNPs that pass HWE criteria
genotype <- genotype[,HWEuse]
print(genotype)
1296 SNPs will be removed due to high HWE.
A SnpMatrix with  1401 rows and  656890 columns
Row names:  10002 ... 11596 
Col names:  rs12565286 ... rs28729663 

I. Data preprocessing

 

Step 1: Creation of R objects $\checkmark$

Step 2: SNP level filtering (part 1) $\checkmark$

Step 3: Sample level filtering $\checkmark$

Step 4: SNP level filtering (part 2) $\checkmark$

II. Data generation

 

Step 5: Principal component analysis

Principal component analysis:

Principal components are re-calculted to be included as covariates in the GWA models.

These serve to adjust for any remaining substructure that may act as confounders in the SNP-trait association analysis.

As with Ancestry filtering we calculate PCs using the snpgdsPCA() function in the SNPRelate package, after performing LD pruning once again on the filtered genotype data set. Further details are found on the stat-gen.org site:

http://www.stat-gen.org/tut/tut_generation.html

The first 10 principal components are retained for the GWA models.

Principal components read in from data file:

In [12]:
pca10 <- read.csv("UseR2016_Tutorial_Files/PCA_out.csv")
head(pca10)
FamIDpc1pc2pc3pc4pc5pc6pc7pc8pc9pc10
1 1.000200e+04 7.764870e-03 1.448038e-02-6.315881e-04 2.866464e-03-1.883914e-02 9.680646e-03 2.764681e-02-6.645818e-03-2.342975e-02 1.049231e-02
2 1.000400e+04-1.204511e-02-7.231015e-03-3.001290e-03-1.079727e-02-7.770540e-03-4.645751e-03 1.806108e-03-3.087891e-03-1.833242e-03-4.538746e-03
3 1.000500e+04-1.670293e-02-5.347697e-03 1.444984e-02-6.151058e-04 3.451702e-02 3.870855e-02 2.057908e-02-1.226551e-02 3.592690e-03-2.287043e-03
4 1.000700e+04-9.537235e-03 4.556977e-03 2.683566e-03 1.662557e-02-2.363142e-04 5.514627e-03 1.595889e-02 2.797546e-02 2.977718e-02-7.461255e-03
5 1.000800e+04-1.539211e-02-2.446933e-03 2.050879e-02-5.724177e-03-3.969623e-03 5.354244e-03-7.269312e-04 2.701471e-02 1.067216e-02-3.352997e-03
6 1.000900e+04-1.512386e-02-2.353917e-03 2.136045e-02 6.915653e-03 4.006776e-02 2.322248e-02 1.524852e-02 1.329685e-02 2.274635e-02 1.314389e-02

II. Data generation

 

Step 5: Principal component analysis $\checkmark$

Step 6: SNP Imputation

Step 6: SNP Imputation

SNP imputation is used to extend analysis to non-typed SNPs, rare variants and SNPs that were removed during filtering.

Genotype imputation requires reference data from homogeneous sample. Sources for these data include HapMap and 1000 Genomes.

Imputation of SNPs:

The snp.imputation() function in the snpStats package is used to derive 'rules' representing a predictive model for genotypes of untyped SNPs that is based on near-by typed SNPs. The impute.snps() function, also in the snpStats package, is then applied to estimate expected posterior values of the non-typed SNPs.

An implementation example is provided on the stat-gen.org site:

http://www.stat-gen.org/tut/tut_generation.html

II. Data generation

 

Step 5: Principal component analysis $\checkmark$

Step 6: SNP imputation $\checkmark$

III. GWAS analysis

 

Step 7: Association analysis for typed SNPs

Step 7: Association analysis for typed SNPs

The Genome-Wide Association Analysis (GWAA) function in R:

  • Parallel implementation of generalized linear model fitting on each SNP using the doParallel package.
  • Returns "SNP", "Estimate", "Std.Error", "t-value" and "p-value".
  • An additive model of association (trend test) is evaluated; data are manipulated to always count the number of minor alleles.
  • Gaussian (linear) or binomial (logistic) families (among others) can be selected.

The GWAA() function is available in the downloaded tutorial folder and can be read into R as follows:

In [13]:
source("UseR2016_Tutorial_Files/GWAA.R")

Additional details on specifying the parameters of the GWAA() function are provided at: http://www.stat-gen.org/tut/tut_analysis.html

The GWAA() function writes output to a .txt file. An example of this output is in the GWAStutorialout.txt file:

In [14]:
GWASout <- read.table("UseR2016_Tutorial_Files/GWAStutorialout.txt", 
                      header=TRUE, colClasses=c("character", rep("numeric",4)))
head(GWASout)
SNPEstimateStd.Errort.valuep.value
1rs12565286 -0.11006048440612 0.0811307149802178-1.35658220728063 0.1751533944922
2rs3094315 -0.08774214467582560.0458130124264677 -1.91522320905346 0.0556863412136138
3rs2286139 -0.1253968023206270.0533236836341205-2.35161552568339 0.0188456227223414
4rs2980319 -0.0973866846369160.0531366340089406-1.83275976081831 0.0670677893265018
5rs2980300 -0.08068613306415050.0521019732939761 -1.54861952365783 0.121720599080485
6rs11240777 0.003296683641471230.0438024109222104 0.0752626070589099 0.940017405693539

III. GWAS analysis

 

Step 7: Association analysis for typed SNPs $\checkmark$

IV. Post-analytic interrogation

 

Step 10: Visualization

The results of a GWA analysis are typically visualized using a Manhattan plot:

In [15]:
# Find the -log_10 of the p-values
GWASout$Neg_logP <- -log10(GWASout$p.value)

# Merge output with genoBim by SNP name to add position and chromosome number
GWASout <- merge(GWASout, genoBim[,c("SNP", "chr", "position")])
head(GWASout)
SNPEstimateStd.Errort.valuep.valueNeg_logPchrposition
1rs10000012 -0.02085290094662520.0544108133922203 -0.383249204460648 0.701597949161597 0.153911689124775 4 1357325
2rs1000002 0.03665472523381010.03523749467771281.04021939042658 0.298433967867149 0.52515174682373 3 183635768
3rs1000003 -0.02929796356718280.0509472366476217 -0.57506482186312 0.565347973331718 0.247684160357594 3 98342907
4rs10000033 0.02337480415042060.03552719384432460.657941188736883 0.510693039327769 0.291840061475068 4 139599898
5rs10000036 0.02126911707137570.03757186210596470.56609164090377 0.571434580103119 0.243033482025455 4 139219262
6rs10000037 -0.02557369319108390.0442322654635925 -0.578168287856148 0.563251030147267 0.249298005308766 4 38924330
In [ ]:
library(Cairo)
GWASout$type <- "typed"
source("UseR2016_Tutorial_Files/GWAS_Manhattan.R")
GWAS_Manhattan(GWASout)