### Part 1¶

#### Instructor: Professor Andrea S. Foulkes ¶

Department of Mathematics and Statistics, Mount Holyoke College

### Acknowledgements:¶

Special thanks to Sara Nunez , Reserach Scientist, Mount Holyoke College, for work on the development of code snippets.

Support for this tutorial was provided by:

NIH/NHLBI R01 107196 (PI: Foulkes).

### A Jupyter notebook¶

for this tutorial is available at:

http://nbviewer.jupyter.org/urls/www.mtholyoke.edu/courses/afoulkes/UseR2016/UseR2016-Part1.ipynb

## Outline

#### Part 1: Core analytic components of a standard GWAS

1. Data preprocessing
2. Data generation
3. Association analysis
4. Post-analytic visualization

### I. Data preprocessing

#### Data files:

In [ ]:
library(downloader)
"UseR2016_Tutorial_Files.zip")
unzip("UseR2016_Tutorial_Files.zip", exdir = "/Users/afoulkes/UseR2016/")


• Use the read.plink() function in the Bioconductor snpStats package to read the \textit{.fam}, \textit{.bim} and \textit{.bed} files into R.
In [1]:
library(snpStats)
setwd("/Users/afoulkes/UseR2016/")
na.strings = ("-9"))
attributes(geno)

Loading required package: survival

$names = 1. 'genotypes' 2. 'fam' 3. 'map' #### Manipulating data: • Obtain the genotype of the resulting list. This is a SnpMatrix object, a matrix of genotype data with a column for each SNP and a row for each study participant. In [2]: genotype <- geno$genotypes
print(genotype)
genotypeNumeric <- as(genotype,"numeric")
genotypeNumeric[1:3,1:7]

A SnpMatrix with  1401 rows and  861473 columns
Row names:  10002 ... 11596
Col names:  rs10458597 ... rs5970564

rs10458597rs12565286rs12082473rs3094315rs2286139rs11240776rs2980319
10002 2 2 2 2 2NA 2
100042021222
100052222222
In [3]:
dim(genotype)

1. 1401
2. 861473

#### Manipulating data:

• Obtain the SNP information table:
In [4]:
genoBim <- geno$map colnames(genoBim) <- c("chr", "SNP", "gen.dist", "position", "A1", "A2") print(head(genoBim))   chr SNP gen.dist position A1 A2 rs10458597 1 rs10458597 0 564621 <NA> C rs12565286 1 rs12565286 0 721290 G C rs12082473 1 rs12082473 0 740857 T C rs3094315 1 rs3094315 0 752566 C T rs2286139 1 rs2286139 0 761732 C T rs11240776 1 rs11240776 0 765269 G A  #### Clinical data components: • ID • coronary artery disease (CAD): 0 = control and 1 = case; • sex: 1 = male, 2 = female; • age (years); • triglyceride level (TG - mg/dL) • high-density lipoprotein cholesterol (HDL-C - mg/dL) • low-density lipoprotein cholesterol (LDL-C - mg/dL) #### Reading in clinical data: In [5]: clinical <- read.csv("UseR2016_Tutorial_Files/GWAStutorial_clinical.csv", colClasses=c("character", "factor", "factor", rep("numeric", 4))) rownames(clinical) <- clinical$FamID

      FamID CAD sex age  tg hdl ldl
10002 10002   1   1  60  NA  NA  NA
10004 10004   1   2  50  55  23  75
10005 10005   1   1  55 105  37  69
10007 10007   1   1  52 314  54 108
10008 10008   1   1  58 161  40  94
10009 10009   1   1  59 171  46  92


### I. Data preprocessing

#### Step 2: SNP level filtering (part 1)

Once the data is loaded, we are ready to remove SNPs that fail to meet minimum criteria due to missing data, low variability or genotyping errors.

We address these by first filtering on call rate and minor allele frequency (MAF).

#### SNP level characteristics:

snpStats provides functions, col.summary() and row.summary(), that return statistics on SNPs and samples, respectively.

In [6]:
snpsum.col <- col.summary(genotype)

CallsCall.rateCertain.callsRAFMAFP.AAP.ABP.BBz.HWE
rs104585971398.0000000 0.9978587 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000 1.0000000 NA
rs125652861384.00000000 0.98786581 1.00000000 0.94833815 0.05166185 0.00433526 0.09465318 0.90101156 -1.26529432
rs120824731.369000e+039.771592e-011.000000e+009.985391e-011.460920e-030.000000e+002.921841e-039.970782e-015.413314e-02
rs30943151386.00000000 0.98929336 1.00000000 0.82178932 0.17821068 0.04761905 0.26118326 0.69119769 -4.03172248
rs22861391364.00000000 0.97359029 1.00000000 0.86217009 0.13782991 0.02199413 0.23167155 0.74633431 -0.93146122
rs112407761.269000e+039.057816e-011.000000e+009.988180e-011.182033e-030.000000e+002.364066e-039.976359e-014.215743e-02

#### Call rate and minor allele frequency (MAF) filtering:

Using these summary statistics, we keep the subset of SNPs that meet our criteria for minimum call rate and MAF.

In [7]:
# Setting thresholds
call <- 0.95
minor <- 0.01

# Filter on MAF and call rate
use <- with(snpsum.col, (!is.na(MAF) & MAF > minor) & Call.rate >= call)
use[is.na(use)] <- FALSE                # Remove NA's as well

# Subset genotype and SNP summary data for SNPs that pass call rate and MAF criteria
genotype <- genotype[,use]
snpsum.col <- snpsum.col[use,]

print(genotype) # started with 861,473 SNPs

A SnpMatrix with  1401 rows and  658186 columns
Row names:  10002 ... 11596
Col names:  rs12565286 ... rs5970564


### I. Data preprocessing

#### Step 3: Sample level filtering

Next we remove individuals due to missing data, sample contamination, correlation (for population-based investigations), and racial/ethnic or gender ambiguity or discordance.

We address these issues by filtering on call rate, heterozygosity, cryptic relatedness and duplicates, and ancestry.

#### Calculate SNP level call rate and heterozygosity:

In [8]:
snpsum.row <- row.summary(genotype)

Call.rateCertain.callsHeterozygosity
100020.98265541.00000000.3289825
100040.98915811.00000000.3242931
100050.99184271.00000000.3231825
100070.98610271.00000000.3241469
100080.98233331.00000000.3228218
100090.99130341.00000000.3213658

#### Calculating the F statistic (inbreeding coefficient):

\begin{equation*} |F|=(1−O/E) \end{equation*}

where O and E are the observed and expected proportions of heterozygous genotypes for a given sample based on the MAFs.

In [9]:
MAF <- snpsum.col$MAF callmatrix <- !is.na(genotype) hetExp <- callmatrix %*% (2*MAF*(1-MAF)) hetObs <- with(snpsum.row, Heterozygosity*(ncol(genotype))*Call.rate) snpsum.row$hetF <- 1-(hetObs/hetExp)


#### Call rate and heterozygosity filtering:

Using these summary statistics, we keep the subset of individuals (samples) that meet our criteria for call rate and heterozygosity.

In [10]:
# Setting thresholds
sampcall <- 0.95    # Sample call rate cut-off
hetcutoff <- 0.1    # Inbreeding coefficient cut-off

# Filter on MAF and heterozygosity
sampleuse <- with(snpsum.row, !is.na(Call.rate) & Call.rate > sampcall & abs(hetF) <= hetcutoff)
sampleuse[is.na(sampleuse)] <- FALSE    # remove NA's as well
cat(nrow(genotype)-sum(sampleuse), "subjects will be removed due to low sample call rate or inbreeding coefficient.\n")

# Subset genotype and clinical data for subjects who pass call rate and heterozygosity crtieria
genotype <- genotype[sampleuse,]
clinical<- clinical[ rownames(genotype), ]

0 subjects will be removed due to low sample call rate or inbreeding coefficient.


#### IBD analysis for relatedness and duplicates:

• We use the snpgdsIBDMoM(), snpgdsIBDSelection() and ibdcoeff() functions in the SNPRelate package to perform identity-by-descent (IBD) analysis. This package requires that the data be transformed into a GDS format file.
• Methods of moments IBD analysis is performed on only a subset of SNPs that are in linkage equilibrium by iteratively removing adjacent SNPs that exceed an LD threshold in a sliding window using the snpgdsLDpruning() function also in the SNPRelate package.

#### IBD analysis for relatedness and duplicates:

• More information on implementing this approach is provided on the stat-gen.org site:

http://www.stat-gen.org/tut/tut_preproc.html

• Relatedness filtering was performed previously and therefore none of the $1401$ individuals in the available data are expected to be related or duplicated.

#### PCA analysis for ancestry filtering:

• Recall, population substructure refers to a population consisting of subgroups within which there is random mating but between which there is little mixing or gene flow (population stratification) or to a population in which mating occurs between subgroups with different allelic distributions (admixture).

• We use the snpgdsPCA() function in the SNPRelate package to perform principal component analysis (PCA) and then plot the first two principal components to identify cluster and outliers.

#### PCA analysis for ancestry filtering:

• More information on implementing this approach is provided on the stat-gen.org site:

http://www.stat-gen.org/tut/tut_preproc.html

• Our samples are reasonably homogeneous from European ancestry, and therefore none of the $1401$ individuals are filtered for ancestry.

### I. Data preprocessing

#### Step 4: SNP level filtering (part 2)

Final SNP level filtering is based on Hardy-Weinberg equilibrium

#### Hardy-Weinberg equilibrium (HWE):

• HWE denotes independence of alleles at a single site between two homologous chromosomes.
• Recall Linkage Disequilibrium (LD) is allelic association across sites on a single homolog.

• Example:

Consider a bi-allelic SNP with genotypes AA, Aa and aa:

HWE implies $p_{AA}=p_A^2$, $p_{aa}=p_a^2$ and $p_{Aa}=p_Ap_a$.

#### Violations of Hardy-Weinberg equilibrium (HWE)

Can indicate:

• Presence of population substructure (non-random mating)
• Genotyping errors

We test for HWE at the SNP level within CAD controls, and remove SNPs with corresponding $p<1 \times 10^{−6}$.

In [11]:
# Setting threshold
hardy <- 10^-6

# Filter on HWE
CADcontrols <- clinical[ clinical$CAD==0, 'FamID' ] snpsum.colCont <- col.summary( genotype[CADcontrols,] ) HWEuse <- with(snpsum.colCont, !is.na(z.HWE) & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) ) rm(snpsum.colCont) HWEuse[is.na(HWEuse)] <- FALSE # Remove NA's as well cat(ncol(genotype)-sum(HWEuse),"SNPs will be removed due to high HWE.\n") # Subset genotype and SNP summary data for SNPs that pass HWE criteria genotype <- genotype[,HWEuse] print(genotype)  1296 SNPs will be removed due to high HWE. A SnpMatrix with 1401 rows and 656890 columns Row names: 10002 ... 11596 Col names: rs12565286 ... rs28729663  ### I. Data preprocessing #### Step 1: Creation of R objects$\checkmark$#### Step 2: SNP level filtering (part 1)$\checkmark$#### Step 3: Sample level filtering$\checkmark$#### Step 4: SNP level filtering (part 2)$\checkmark$### II. Data generation #### Step 5: Principal component analysis #### Principal component analysis: Principal components are re-calculted to be included as covariates in the GWA models. These serve to adjust for any remaining substructure that may act as confounders in the SNP-trait association analysis. As with Ancestry filtering we calculate PCs using the snpgdsPCA() function in the SNPRelate package, after performing LD pruning once again on the filtered genotype data set. Further details are found on the stat-gen.org site: http://www.stat-gen.org/tut/tut_generation.html The first 10 principal components are retained for the GWA models. #### Principal components read in from data file: In [12]: pca10 <- read.csv("UseR2016_Tutorial_Files/PCA_out.csv") head(pca10)  FamIDpc1pc2pc3pc4pc5pc6pc7pc8pc9pc10 1 1.000200e+04 7.764870e-03 1.448038e-02-6.315881e-04 2.866464e-03-1.883914e-02 9.680646e-03 2.764681e-02-6.645818e-03-2.342975e-02 1.049231e-02 2 1.000400e+04-1.204511e-02-7.231015e-03-3.001290e-03-1.079727e-02-7.770540e-03-4.645751e-03 1.806108e-03-3.087891e-03-1.833242e-03-4.538746e-03 3 1.000500e+04-1.670293e-02-5.347697e-03 1.444984e-02-6.151058e-04 3.451702e-02 3.870855e-02 2.057908e-02-1.226551e-02 3.592690e-03-2.287043e-03 4 1.000700e+04-9.537235e-03 4.556977e-03 2.683566e-03 1.662557e-02-2.363142e-04 5.514627e-03 1.595889e-02 2.797546e-02 2.977718e-02-7.461255e-03 5 1.000800e+04-1.539211e-02-2.446933e-03 2.050879e-02-5.724177e-03-3.969623e-03 5.354244e-03-7.269312e-04 2.701471e-02 1.067216e-02-3.352997e-03 6 1.000900e+04-1.512386e-02-2.353917e-03 2.136045e-02 6.915653e-03 4.006776e-02 2.322248e-02 1.524852e-02 1.329685e-02 2.274635e-02 1.314389e-02 ### II. Data generation #### Step 5: Principal component analysis$\checkmark$#### Step 6: SNP Imputation #### Step 6: SNP Imputation SNP imputation is used to extend analysis to non-typed SNPs, rare variants and SNPs that were removed during filtering. Genotype imputation requires reference data from homogeneous sample. Sources for these data include HapMap and 1000 Genomes. #### Imputation of SNPs: The snp.imputation() function in the snpStats package is used to derive 'rules' representing a predictive model for genotypes of untyped SNPs that is based on near-by typed SNPs. The impute.snps() function, also in the snpStats package, is then applied to estimate expected posterior values of the non-typed SNPs. An implementation example is provided on the stat-gen.org site: http://www.stat-gen.org/tut/tut_generation.html ### II. Data generation #### Step 5: Principal component analysis$\checkmark$#### Step 6: SNP imputation$\checkmark$### III. GWAS analysis #### Step 7: Association analysis for typed SNPs #### Step 7: Association analysis for typed SNPs The Genome-Wide Association Analysis (GWAA) function in R: • Parallel implementation of generalized linear model fitting on each SNP using the doParallel package. • Returns "SNP", "Estimate", "Std.Error", "t-value" and "p-value". • An additive model of association (trend test) is evaluated; data are manipulated to always count the number of minor alleles. • Gaussian (linear) or binomial (logistic) families (among others) can be selected. The GWAA() function is available in the downloaded tutorial folder and can be read into R as follows: In [13]: source("UseR2016_Tutorial_Files/GWAA.R")  Additional details on specifying the parameters of the GWAA() function are provided at: http://www.stat-gen.org/tut/tut_analysis.html The GWAA() function writes output to a .txt file. An example of this output is in the GWAStutorialout.txt file: In [14]: GWASout <- read.table("UseR2016_Tutorial_Files/GWAStutorialout.txt", header=TRUE, colClasses=c("character", rep("numeric",4))) head(GWASout)  SNPEstimateStd.Errort.valuep.value 1rs12565286 -0.11006048440612 0.0811307149802178-1.35658220728063 0.1751533944922 2rs3094315 -0.08774214467582560.0458130124264677 -1.91522320905346 0.0556863412136138 3rs2286139 -0.1253968023206270.0533236836341205-2.35161552568339 0.0188456227223414 4rs2980319 -0.0973866846369160.0531366340089406-1.83275976081831 0.0670677893265018 5rs2980300 -0.08068613306415050.0521019732939761 -1.54861952365783 0.121720599080485 6rs11240777 0.003296683641471230.0438024109222104 0.0752626070589099 0.940017405693539 ### III. GWAS analysis #### Step 7: Association analysis for typed SNPs$\checkmark$### IV. Post-analytic interrogation #### Step 10: Visualization The results of a GWA analysis are typically visualized using a Manhattan plot: In [15]: # Find the -log_10 of the p-values GWASout$Neg_logP <- -log10(GWASout$p.value) # Merge output with genoBim by SNP name to add position and chromosome number GWASout <- merge(GWASout, genoBim[,c("SNP", "chr", "position")]) head(GWASout)  SNPEstimateStd.Errort.valuep.valueNeg_logPchrposition 1rs10000012 -0.02085290094662520.0544108133922203 -0.383249204460648 0.701597949161597 0.153911689124775 4 1357325 2rs1000002 0.03665472523381010.03523749467771281.04021939042658 0.298433967867149 0.52515174682373 3 183635768 3rs1000003 -0.02929796356718280.0509472366476217 -0.57506482186312 0.565347973331718 0.247684160357594 3 98342907 4rs10000033 0.02337480415042060.03552719384432460.657941188736883 0.510693039327769 0.291840061475068 4 139599898 5rs10000036 0.02126911707137570.03757186210596470.56609164090377 0.571434580103119 0.243033482025455 4 139219262 6rs10000037 -0.02557369319108390.0442322654635925 -0.578168287856148 0.563251030147267 0.249298005308766 4 38924330 In [ ]: library(Cairo) GWASout$type <- "typed"
source("UseR2016_Tutorial_Files/GWAS_Manhattan.R")
GWAS_Manhattan(GWASout)