We will acquire a modest collection of SNP calls from chr17, and project the genotype configurations via principal components to expose population substructure.
Bioconductor’s ldblock package includes utilities for working with collections of VCF, typically decomposed by chromosome.
library(BiocManager)
inplist = rownames(installed.packages())
if (!("snpStats" %in% inplist)) install("snpStats")
if (!("ldblock" %in% inplist)) install("vjcitn/ldblock")
if (!("terravar" %in% inplist)) install("vjcitn/terravar")
library(ldblock)
library(GenomicFiles)
library(VariantAnnotation)
EBI has updated 1000 genomes calls. We will interrogate a tabix-indexed VCF for chr17 via HTTP. An object that organizes multiple chromosomes is created using stack1kg from the ldblock package.
st = stack1kg("17")
We use a ScanVcfParam to define a slice of genome.
sp = ScanVcfParam(geno="GT",
which=GRanges("17", IRanges(32e6,33.75e6)))
myread = readVcfStack(st, param=sp)
readVcfStack
produces a CollapsedVCF instance in memory. See the documentation for the VariantAnnotation package for details on this data structure.
myread
class: CollapsedVCF dim: 49194 2548 rowRanges(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 12 columns: AF, AC, NS, AN, EAS_AF, EUR_AF, AFR_AF, AMR_AF,... info(header(vcf)): Number Type Description AF A Float Estimated allele frequency in the range (0,1) AC A Integer Total number of alternate alleles in called geno... NS 1 Integer Number of samples with data AN 1 Integer Total number of alleles in called genotypes EAS_AF A Float Allele frequency in the EAS populations calculat... EUR_AF A Float Allele frequency in the EUR populations calculat... AFR_AF A Float Allele frequency in the AFR populations calculat... AMR_AF A Float Allele frequency in the AMR populations calculat... SAS_AF A Float Allele frequency in the SAS populations calculat... VT . String indicates what type of variant the line represents EX_TARGET 0 Flag indicates whether a variant is within the exon p... DP 1 Integer Approximate read depth; some reads may have been... geno(vcf): SimpleList of length 1: GT geno(header(vcf)): Number Type Description GT 1 String Phased Genotype
We use tools in David Clayton’s snpStats package to achieve a compact representation of (possibly uncertain) genotype calls. This enables us to filter to SNP with MAF exceeding a given threshold, and, later, to compute a PCA.
library(snpStats)
mymat = genotypeToSnpMatrix(myread)
## non-single nucleotide variations are set to NA
cs = col.summary(mymat[[1]])
sum(cs[,"MAF"]>.1, na.rm=TRUE)
non-single nucleotide variations are set to NA
We'll use SNP with MAF exceeding 10%, and build the sample x SNP matrix.
kpsnp = which(cs[,"MAF"]>.1)
cmat = matrix(0, nr=nrow(mymat[[1]]), nc=length(kpsnp))
for (i in seq_len(length(kpsnp))) {
cmat[,i] = as(mymat[[1]][,kpsnp[i]], "numeric")
}
rownames(cmat) = rownames(mymat[[1]])
pp = prcomp(cmat)
dim(pp$x)
library(ggplot2)
library(terravar)
data("geog_1kg")
rownames(geog_1kg) = geog_1kg[,1]
newdf = data.frame(pp$x[,1:4], pop=geog_1kg[rownames(pp$x), "Population"],
superpop = geog_1kg[rownames(pp$x), "superpop"])
ggplot(newdf, aes(x=PC1, y=PC2, colour=superpop)) + geom_point()