# Factored Spectrally Transformed Linear Mixed Models¶

##### Version 0.2.32¶

Microsoft Research, February 19, 2019

### Introduction¶

FaST-LMM, which stands for Factored Spectrally Transformed Linear Mixed Models, is a program for performing both single-SNP and SNP-set genome-wide association studies (GWAS) on extremely large data sets. This release contains the improvements described in Widmer et al., Scientific Reports 2014, and tests for epistasis. See the FaST-LMM annotated bibliography for additional papers that contributed to the FaST-LMM project and associated code.

### Citing FaST-LMM¶

If you use FaST-LMM in any published work, please cite the relevant manuscript(s) describing it.

### Data preparation¶

This version of FaST-LMM is designed for use with randomly ascertained data with Gaussian residuals. If you have case-control data with substantial ascertainment bias, you should first transform your phenotype(s) using LEAP (Weissbrod et al., arXiv 2014), which is available at https://github.com/omerwe/LEAP. If you are analyzing continuous phenotypes with non-Gaussian residuals, you should first transform your phenotype(s) using Warped-LMM (Fusi et al., Nature Commun 2014), available at https://github.com/MicrosoftGenomics/WarpedLMM.

FaST-LMM uses four input files containing (1) the SNP data to be tested, (2) the SNP data used to determine the genetic similarity matrix (GSM) between individuals (which can be different from 1), (3) the phenotype data, and (4, optionally) a set of covariates.

SNP files can be in PLINK format (ped/map, tped/tfam, bed/bim/fam, or fam/dat/map). For the most speed, use the binary format in SNP major order. See the PLINK manual http://pngu.mgh.harvard.edu/~purcell/plink/ (Purcell et al., Am J Hum Genet 2007) for further details. FaST-LMM also supports Hdf5 file format http://www.hdfgroup.org/HDF5/whatishdf5.html. See https://github.com/MicrosoftGenomics/PySnpTools for more details. Note that each SNP is standardized to have mean zero and standard deviation one across all individuals before processing. Missing values are mean imputed.

The required file containing the phenotype uses the PLINK alternate phenotype format with no header. The covariate file also uses this format (with additional columns for multiple covariates).

### Notebook preparation and general use¶

To prepare this notebook to run analyses, please run the following script.

In [1]:
# set some ipython notebook properties
%matplotlib inline

# set degree of verbosity (adapt to INFO for more verbose output)
import logging
logging.basicConfig(level=logging.WARNING)

# set figure sizes
import pylab
pylab.rcParams['figure.figsize'] = (10.0, 8.0)

# set display width for pandas data frames
import pandas as pd
pd.set_option('display.width', 1000)


If you would like to run any of the code below from the command line, first copy it into a file (e.g., test.py), and then run it by typing python text.py at the command line.

If you would like to see all the options for a function just type ? <function_name> to an ipython prompt.

### Single-SNP association testing¶

First, let's run a standard LMM analysis in which the GSM uses (almost) all available SNPs. The model for this analysis is called LMM(all) in Widmer et al., Scientific Reports 2014. We'll apply this model to the synthetic data in tests\datasets\synth. The data has 500 samples with 5000 SNPs, and was generated from a Balding-Nichols model with FST=0.05.

When using a linear mixed model for association analysis, it is important to avoid proximal contamination (Lippert et al., Nat Meth 2011). To understand proximal contamination, first note that a LMM with no fixed effects, using a realized relationship matrix for the GSM (as FaST-LMM does), is mathematically equivalent to linear regression of the SNPs on the phenotype, with weights integrated over independent Normal distributions having the same variance (e.g., Hayes et al., Genet Res 2009). That is, a LMM using a given set of SNPs for the GSM is equivalent to a form of linear regression using those SNPs as covariates to correct for confounding. This equivalence implies that, when testing a given SNP, that SNP (and SNPs physically close to it) should be excluded from the computation of the GSM. If not, when testing a particular SNP, we would also be using that same SNP as a covariate, making the log likelihood of the null model higher than it should be, thus leading to deflation of the test statistic and loss of power.

Excluding the SNP you are testing and those SNPs in close proximity to it from the GSM in a naïve way is extremely computationally expensive. A computationally efficient approach for performing the exclusion is to use a GSM computed from all but chromosome $i$ when testing SNPs on chromosome $i$ (Lippert et al., Nat Meth 2011). We call this approach leave out one chromosome (LOOC). The analysis here does this.

In [2]:
# import the algorithm
from fastlmm.association import single_snp

# set up data
##############################
bed_fn = "../../tests/datasets/synth/all"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"

# run gwas
###################################################################
results_df = single_snp(bed_fn,  pheno_fn, covar=cov_fn, count_A1=True)

# manhattan plot
import pylab
import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()

# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["PValue"].values, xlim=[0,5], ylim=[0,5])

# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)

lambda=1.0078

Out[2]:
sid_index SNP Chr GenDist ChrPos PValue SnpWeight SnpWeightSE SnpFractVarExpl Mixing Nullh2
0 52 snp495_m0_.01m1_.04 5 4052 4052 2.990684e-23 -0.418653 0.040052 0.424521 0.0 0.451117
1 392 snp1422_m0_.49m1_.5 3 2392 2392 8.251922e-23 0.416495 0.040300 0.420587 0.0 0.279710
2 650 snp1200_m0_.37m1_.36 3 2650 2650 3.048007e-14 -0.328870 0.042021 0.331240 0.0 0.279710
3 3 snp433_m0_.14m1_.11 3 2003 2003 9.202499e-10 0.268289 0.042973 0.269670 0.0 0.279710
4 274 snp2832_m0_.46m1_.1 4 3274 3274 7.069762e-04 -0.170421 0.050003 0.151124 0.0 0.542046
5 13 snp1413_m0_.04m1_.03 3 2013 2013 8.161238e-04 0.148719 0.044157 0.149377 0.0 0.279710
6 214 snp2804_m0_.16m1_.3 3 2214 2214 1.239806e-03 -0.150705 0.046396 0.144180 0.0 0.279710
7 117 snp751_m0_.04m1_.25 1 117 117 1.527432e-03 0.152430 0.047827 0.141523 0.0 0.614963
8 265 snp1440_m0_.35m1_.32 4 3265 3265 1.771049e-03 -0.136281 0.043358 0.139610 0.0 0.542046
9 307 snp2162_m0_.61m1_.42 2 1307 1307 1.816576e-03 0.143296 0.045700 0.139280 0.0 0.534262

#### Improving power: LMM(all+select)¶

In Widmer et al., Scientific Reports 2014, we have shown that power can be increased while still maintaining control of type I error by using two GSMs: one based on all SNPs (G0) and one based on selected SNPs that are highly correlated with the phenotype (G1). The model is called LMM(select + all). This approach has greater computational demands (we recommend using a cluster computer when analyzing large data sets). Here is an example of how to apply this model to the synthetic data.

In [3]:
# example for two kernel feature selection
# this takes a couple of minutes to run on a 20-proc machine.
from fastlmm.association import single_snp_all_plus_select

from fastlmm.util.runner import LocalMultiProc

# define file names
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"

# find the chr5 SNPs

#select the 2nd kernel and run GWAS

import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()


Out[3]:
sid_index SNP Chr GenDist ChrPos PValue SnpWeight SnpWeightSE SnpFractVarExpl Mixing Nullh2
0 52 snp495_m0_.01m1_.04 5 4052 4052 6.246232e-33 -0.395548 0.030731 0.499629 0.688079 0.545958
1 307 snp3970_m0_.53m1_.17 5 4307 4307 2.415292e-04 0.149395 0.040400 0.163479 0.688079 0.545958
2 177 snp3853_m0_.1m1_.04 5 4177 4177 2.162433e-03 -0.109143 0.035401 0.136856 0.688079 0.545958
3 46 snp2743_m0_.03m1_.1 5 4046 4046 2.178166e-03 0.109598 0.035574 0.136760 0.688079 0.545958
4 236 snp726_m0_.28m1_.31 5 4236 4236 2.831891e-03 0.104920 0.034969 0.133250 0.688079 0.545958

Aside: In some applications, you may want to use two kernels constructed from two pre-specified sets of SNPs (i.e., with no feature selection). Here we show how to do that and how to simultaneously find h2 and the mixing weight between the kernels.

In [4]:
# example script for two kernel without feature selection
import numpy as np
import pysnptools.util
from fastlmm.association import single_snp

# define file names
bed_fn = "../../tests/datasets/synth/all"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"

# select data
###################################################################

# partition snps on chr6 vs rest
G1_chr = 6

# test on chr5

results_df = single_snp(test_snps, pheno_fn, K0=G0, K1=G1, covar=cov_fn, GB_goal=2, count_A1=True)

import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()

Out[4]:
sid_index SNP Chr GenDist ChrPos PValue SnpWeight SnpWeightSE SnpFractVarExpl Mixing Nullh2
0 52 snp495_m0_.01m1_.04 5 4052 4052 2.990684e-23 -0.418653 0.040052 0.424521 0.0 0.451117
1 236 snp726_m0_.28m1_.31 5 4236 4236 2.229561e-03 0.133405 0.043400 0.136587 0.0 0.451117
2 785 snp4608_m0_.35m1_.39 5 4785 4785 5.489019e-03 0.122558 0.043942 0.124139 0.0 0.451117
3 239 snp2615_m0_.49m1_.25 5 4239 4239 8.029456e-03 -0.123956 0.046572 0.118546 0.0 0.451117
4 937 snp1115_m0_.1m1_.1 5 4937 4937 8.030124e-03 0.116014 0.043589 0.118545 0.0 0.451117

#### Improving speed and memory use when there is little family strucutre: LMM(select)+PCs¶

In the same publication, we have shown that a simpler, more computationally efficient model can be used when the data is confounded only by population structure and not by family structure or cryptic relatedness. Under these circumstances, we have found that a model with a single GSM based on selected SNPs in combination with principle components as fixed-effect covariates yields good control of type I error and power.

This model, called LMM(select)+PCs, should be used with caution. Even if you explicitly remove closely related individuals from your data, cryptic relatedness may remain.

To use this model, first identify the principle components to be used with the PCgeno algorithm as described in Widmer et al., Scientific Reports 2014.

Then you can call single_snp_select with these PCs as covariates.

In [5]:
from pysnptools.snpreader import Bed
from fastlmm.util import compute_auto_pcs
from fastlmm.association import single_snp_select

# define file names
bed_fn = "../../tests/datasets/synth/all"
phen_fn = "../../tests/datasets/synth/pheno_10_causals.txt"

# find number of PCs
pcs = compute_auto_pcs(bed_fn,count_A1=True)
print "selected number of PCs:", pcs["vals"].shape[1]

# test on chr5

results_df = single_snp_select(test_snps=test_snps, G=snp_reader, pheno=phen_fn, covar=pcs, GB_goal=2)

import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()


selected number of PCs: 1

C:\Source\carlk\fastlmm\fastlmm\inference\linear_regression.py:373: RuntimeWarning: invalid value encountered in divide
F = yS / denom
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda2_64\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda2_64\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda2_64\lib\site-packages\scipy\stats\_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)

Out[5]:
sid_index SNP Chr GenDist ChrPos PValue SnpWeight SnpWeightSE SnpFractVarExpl Mixing Nullh2
0 52 snp495_m0_.01m1_.04 5 4052 4052 1.679173e-33 -0.401931 0.030898 0.503978 0.0 0.374751
1 307 snp3970_m0_.53m1_.17 5 4307 4307 2.492670e-04 0.153034 0.041475 0.163287 0.0 0.374751
2 236 snp726_m0_.28m1_.31 5 4236 4236 1.132823e-03 0.115687 0.035331 0.145316 0.0 0.374751
3 177 snp3853_m0_.1m1_.04 5 4177 4177 2.461932e-03 -0.108263 0.035571 0.135267 0.0 0.374751
4 46 snp2743_m0_.03m1_.1 5 4046 4046 2.504740e-03 0.109775 0.036131 0.135036 0.0 0.374751

### Epistasis¶

You can test for epistatic interactions between pairs of SNPs as well. Here is an example analysis applied to the same synthetic data. Note that this version of the code uses a likelihood ratio test based on maximum-likelihood estimates. A REML-based version is in the works.

In [6]:
# import the algorithm and reader
from fastlmm.association import epistasis

# define file names
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"

# partition data into the first 50 SNPs on chr1 and all but chr1

# run epistasis analysis
results_df = epistasis(test_snps, pheno_fn, G0=G0, covar=cov_fn)

# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["PValue"].values, xlim=[0,5], ylim=[0,5])

# print head of results data frame
pd.set_option('display.width', 1000)

lambda=0.9234

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda2_64\lib\site-packages\numpy\lib\scimath.py:120: RuntimeWarning: invalid value encountered in less
if any(isreal(x) & (x < 0)):

Out[6]:
SNP0 Chr0 GenDist0 ChrPos0 SNP1 Chr1 GenDist1 ChrPos1 PValue NullLogLike AltLogLike
0 snp1754_m0_.19m1_.42 1.0 38.0 38.0 snp2376_m0_.38m1_.49 1.0 20.0 20.0 0.000451 -700.878718 -694.724583
1 snp625_m0_.03m1_.07 1.0 0.0 0.0 snp3000_m0_.28m1_.34 1.0 5.0 5.0 0.001306 -700.387891 -695.220652
2 snp250_m0_.23m1_.19 1.0 15.0 15.0 snp2377_m0_.06m1_.02 1.0 30.0 30.0 0.002279 -698.768488 -694.113511
3 snp376_m0_.17m1_.17 1.0 9.0 9.0 snp625_m0_.03m1_.07 1.0 0.0 0.0 0.002287 -700.206861 -695.555229
4 snp125_m0_.64m1_.41 1.0 21.0 21.0 snp1754_m0_.19m1_.42 1.0 38.0 38.0 0.002404 -700.873861 -696.267594
5 snp2878_m0_.82m1_.58 1.0 42.0 42.0 snp1878_m0_.58m1_.36 1.0 48.0 48.0 0.002568 -698.579519 -694.033792
6 snp1128_m0_.14m1_.13 1.0 35.0 35.0 snp1625_m0_.4m1_.47 1.0 23.0 23.0 0.003067 -700.445456 -696.061998
7 snp3_m0_.55m1_.46 1.0 32.0 32.0 snp1753_m0_.17m1_.19 1.0 28.0 28.0 0.004327 -700.735073 -696.664365
8 snp250_m0_.23m1_.19 1.0 15.0 15.0 snp3_m0_.55m1_.46 1.0 32.0 32.0 0.004486 -700.927439 -696.889564
9 snp1753_m0_.17m1_.19 1.0 28.0 28.0 snp251_m0_.1m1_.49 1.0 33.0 33.0 0.006503 -700.728933 -697.026244

### SNP-set association testing¶

SNP-set association testing is performed similarly to single-SNP testing, except we test sets of SNPs by putting them together into one GSM, separate from any (optional) background GSM that corrects for confounding. Both LRT ("lrt") and score ("sc_davies") tests are supported. The LRT is computed by default, but you can switch to the score test by setting test=“sc_davies”. The score test can be more conservative in some settings (and hence can have less power), but it is closed form and often can be computed much faster. The algorithms in their current form are described in Lippert et al., Bioinformatics 2014.

Here is an example that uses the LRT and no background GSM--note the inflation in the results.

In [7]:
# this will take a few minutes to run

# import the algorithm and reader
from fastlmm.association import snp_set
import datetime

# define file names
test_snps_fn = "../../tests/datasets/synth/chr1"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
set_list_fn = "../../tests/datasets/synth/chr1.sets.txt"
G0_fn = None

# run SNP-set analysis
results_df = snp_set(test_snps=test_snps_fn, G0=G0_fn, set_list=set_list_fn, pheno=pheno_fn, covar=cov_fn, test="lrt")

# qq plot
from fastlmm.util.stats import plotp

# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)

Out[7]:
SetId LogLikeAlt LogLikeNull P-value #SNPs_in_Set #ExcludedSNPs chrm pos. range Alt_h2 Alt_a2
0 set_65 -703.888501 -707.609745 0.002564 10 0 1 650-659 0.029725 0.0
1 set_86 -704.240753 -707.609745 0.003768 10 0 1 860-869 0.030321 0.0
2 set_70 -704.525274 -707.609745 0.005158 10 0 1 700-709 0.029790 0.0
3 set_18 -704.920236 -707.609745 0.008020 10 0 1 180-189 0.024803 0.0
4 set_45 -705.410762 -707.609745 0.014031 10 0 1 450-459 0.023440 0.0
5 set_84 -705.507312 -707.609745 0.015693 10 0 1 840-849 0.024002 0.0
6 set_98 -705.623163 -707.609745 0.017965 10 0 1 980-989 0.022271 0.0
7 set_80 -705.728130 -707.609745 0.020327 10 0 1 800-809 0.023943 0.0
8 set_73 -706.000699 -707.609745 0.028153 10 0 1 730-739 0.019835 0.0
9 set_92 -706.166636 -707.609745 0.034473 10 0 1 920-929 0.019799 0.0

As just mentioned, we see inflation of the test statistic because no background GSM was used. When you include a background GSM (using G0) type I error is better controlled. As in the single SNP case, proximal contamination is avoided using LOOC.

In [8]:
#Here we give a G0, the GSM, created via Bed file with all the chroms except chrom 1 (the test snps)

#WARNING: Even on a fast machine, this takes 24 minutes (but, running with test="sc_davies" takes seconds)

G0_fn = "../../tests/datasets/synth/chrNot1.bed"

# run SNP-set analysis
results_df = snp_set(test_snps=test_snps_fn, G0=G0_fn, set_list=set_list_fn, pheno=pheno_fn, covar=cov_fn, test="lrt")

# qq plot
from fastlmm.util.stats import plotp

# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)

constructing LMM - this should only happen once.

Out[8]:
SetId LogLikeAlt LogLikeNull P-value #SNPs_in_Set #ExcludedSNPs chrm pos. range Alt_h2 Alt_h2_1
0 set_60 -697.449026 -699.619196 0.016062 10 0 1 600-609 0.630039 0.024977
1 set_11 -697.979257 -699.619196 0.028124 10 0 1 110-119 0.618464 0.022458
2 set_14 -698.359356 -699.619196 0.042932 10 0 1 140-149 0.623297 0.019099
3 set_80 -698.401148 -699.619196 0.045045 10 0 1 800-809 0.619344 0.017966
4 set_70 -698.430620 -699.619196 0.046607 10 0 1 700-709 0.577376 0.017370
5 set_45 -698.442867 -699.619196 0.047275 10 0 1 450-459 0.592120 0.015451
6 set_86 -698.518404 -699.619196 0.051648 10 0 1 860-869 0.592057 0.017528
7 set_92 -698.547797 -699.619196 0.053477 10 0 1 920-929 0.621041 0.017506
8 set_65 -698.714102 -699.619196 0.065409 10 0 1 650-659 0.584255 0.016500
9 set_93 -698.820185 -699.619196 0.074734 10 0 1 930-939 0.618306 0.012515

The G1 option available for single-marker testing is not yet supported for set testing. With the same caveats as previously described, however, you can use PC covariates and a G0 based on selected SNPs when your data is confounded, for example, by population structure.

### Prediction¶

We can train on one set of examples and make predictions on another set.

In [9]:
from pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np

# define file names
cov_fn = "../../tests/datasets/synth/cov.txt"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"

# Divide the data into train (all but the last 10 individuals) and test (the last 10 individuals)
# (the cov and pheno will automatially be divided to match)

# In the style of scikit-learn, create a predictor and then train it.
fastlmm = FastLMM(GB_goal=2)
fastlmm.fit(K0_train=train,X=cov_fn,y=pheno_fn)

# Now predict with it
mean, covariance = fastlmm.predict(K0_whole_test=test,X=cov_fn)

print "Predicted means and stdevs"
print mean.val[:,0]
print np.sqrt(np.diag(covariance.val))

#Plot actual phenotype and predicted phenotype
whole_pheno = Pheno(pheno_fn)
pylab.plot(actual_pheno.val,"r.")
pylab.plot(mean.val,"b.")
pylab.errorbar(np.arange(mean.iid_count),mean.val,yerr=np.sqrt(np.diag(covariance.val)),fmt='.')
pylab.xlabel('testing examples')
pylab.ylabel('phenotype, actual (red) and predicted (blue with stdev)')
pylab.show()

Predicted means and stdevs
[-0.01441869 -0.28643401 -0.25428545  0.08895201 -0.29511007 -0.39238035
0.06844375 -0.43282078 -0.21734716 -0.35522388]
[ 0.95795615  0.96181176  0.95260426  0.94991531  0.9604778   0.95411448
0.96020394  0.9817148   0.95198989  0.96271045]


### Appendix¶

Columns in the output are as follows:

SNP or Set The SNP or set identifier tested.

PValue The P value computed for the SNP tested.

NullLogLike The log likelihood of the null model.

AltLogLike The log likelihood of the alternative model.

Global statistics, such as sample size and the number of SNPs tested are printed to stdout at the end of the run.

###### Single-SNP testing only:¶

Chr The chromosome identifier for the SNP tested or 0 if unplaced. Taken from the PLINK file.

GenDist The genetic distance of the SNP on the chromosome. Taken from the PLINK file. Any units are allowed, but typically centimorgans or morgans are used.

ChrPos The base-pair position of the SNP on the chromosome (bp units). Taken from the PLINK file.

SnpWeight The fixed-effect weight of the SNP.

SnpWeightSE The standard error of the SnpWeight.

Nullh2 The narrow sense heritability given by h2 $=σ_g^2/(σ_g^2+σ_e^2)$ on the null model.

###### SNP-set testing only:¶

#SNPs_in_Set The number of markers in the set.

Alt_h2 The value found in the alt. model, given by $\sigma^2(h2(a2*K_S+(1-a2) * K_C)+(1-h2)*I)$, where $K_C$ is the NxN matrix to correct for confounding, and $K_S$ is the matrix containing SNPs to be tested.

Alt_a2 See Alt_h2.

Null_h2 The value found in the null model, given by $\sigma^2(h2*K_C+1-h2)*I)$, where $K_C$ is the $N \times N$ matrix to correct for confounding.