Please updated to the latest version on GitHub by the following command
pip install -U git+https://github.com/single-cell-genetics/vireo
In this notebook, we show an example on how to align donors by their genotypes:
The idea is the same for the two cases: align the donors with giving least genotype difference, either using categorical genotype value or genotype probability.
import numpy as np
import matplotlib.pyplot as plt
import vireoSNP
print("vireoSNP version: %s" %vireoSNP.__version__)
vireoSNP version: 0.5.8
The simple option is to use the wrap function vireoSNP.vcf.match_VCF_samples
res = vireoSNP.vcf.match_VCF_samples('../data/donors.cellSNP.vcf.gz',
'../data/outs/cellSNP_noGT/GT_donors.vireo.vcf.gz',
GT_tag1 = 'PL', GT_tag2='PL')
Shape for Geno Prob in VCF1: (3784, 4, 3) Shape for Geno Prob in VCF2: (3784, 4, 3) n_variants in VCF1, VCF2 and matched: 3784, 3784, 3783 aligned donors: ['MantonCB1' 'MantonCB2' 'MantonCB3' 'MantonCB4'] ['donor2' 'donor1' 'donor3' 'donor0']
fig = plt.figure()
vireoSNP.plot.heat_matrix(res['matched_GPb_diff'],
res['matched_donors1'] ,
res['matched_donors2'] )
plt.title("Geno Prob Delta: %d SNPs" %(res['matched_n_var']))
plt.tight_layout()
plt.show()
Alternatively, you can use the build-in functions to perform each step, including vireoSNP.load_VCF to load VCF file and vireoSNP.optimal_match to align donors that give minimal genotype differences by Hungarian algorithm.
Make sure your run vireo on the example data already
cd ../
mkdir data/outs
vireo -c data/cellSNP_mat -N 4 -o data/outs/cellSNP_noGT --randSeed 2
so you will have the estimated donor genotype ../data/outs/cellSNP_noGT/GT_donors.vireo.vcf.gz
Now, we will align the estimated donor genotype to another VCF estimated from bulk RNA-seq: ../data/donors.cellSNP.vcf.gz
Note: Often the known donor genotype VCF file can be very big, make sure you filter out unwanted variants first, as vireoSNP.load_VCF is not as efficient as bcftools
bcftools view donor.vcf.gz -R cellSNP.cells.vcf.gz -Oz -o sub.vcf.gz
Also, add -s or -S for subsetting samples.
GT_tag0 = 'PL' # common ones: GT, GP, PL
vcf_dat0 = vireoSNP.vcf.load_VCF("../data/donors.cellSNP.vcf.gz",
biallelic_only=True, sparse=False,
format_list=[GT_tag0])
GPb0_var_ids = np.array(vcf_dat0['variants'])
GPb0_donor_ids = np.array(vcf_dat0['samples'])
GPb0_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat0['GenoINFO'][GT_tag0], GT_tag0)
GPb0_tensor.shape
(3784, 4, 3)
print(GPb0_var_ids[:4])
print(GPb0_donor_ids)
['chr1_1065797_G_C' 'chr1X_1217251_C_A' 'chr1_1230695_G_A' 'chr1_1722625_A_T'] ['MantonCB1' 'MantonCB2' 'MantonCB3' 'MantonCB4']
GT_tag1 = 'PL' # common ones: GT, GP, PL
vcf_dat1 = vireoSNP.vcf.load_VCF("../data/outs/cellSNP_noGT/GT_donors.vireo.vcf.gz",
biallelic_only=True, sparse=False,
format_list=[GT_tag1])
GPb1_var_ids = np.array(vcf_dat1['variants'])
GPb1_donor_ids = np.array(vcf_dat1['samples'])
GPb1_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat1['GenoINFO'][GT_tag1], GT_tag1)
GPb1_tensor.shape
(3784, 4, 3)
print(GPb1_var_ids[:4])
print(GPb1_donor_ids)
['1_1065797_G_C' '1_1217251_C_A' '1_1230695_G_A' '1_1722625_A_T'] ['donor0' 'donor1' 'donor2' 'donor3']
# match variants
mm_idx = vireoSNP.vcf.match_SNPs(GPb1_var_ids, GPb0_var_ids)
idx1 = np.where(mm_idx != None)[0] #remove None for unmatched
idx0 = mm_idx[idx1].astype(int)
GPb1_var_ids_use = GPb1_var_ids[idx1]
GPb0_var_ids_use = GPb0_var_ids[idx0]
GPb1_tensor_use = GPb1_tensor[idx1]
GPb0_tensor_use = GPb0_tensor[idx0]
# match donors
idx0, idx1, GPb_diff = vireoSNP.base.optimal_match(GPb0_tensor_use, GPb1_tensor_use,
axis=1, return_delta=True)
print("aligned donors:")
print(GPb0_donor_ids[idx0])
print(GPb1_donor_ids[idx1])
aligned donors: ['MantonCB1' 'MantonCB2' 'MantonCB3' 'MantonCB4'] ['donor2' 'donor1' 'donor3' 'donor0']
GPb_diff[idx0, idx1], GPb_diff[idx0, idx1].sum()
(array([0.14587468, 0.18310327, 0.21593441, 0.22427656]), 0.7691889176247009)
GPb_diff
array([[0.43964819, 0.44643109, 0.14587468, 0.43192166], [0.41247473, 0.18310327, 0.43770159, 0.42725593], [0.42989822, 0.41561379, 0.42926244, 0.21593441], [0.22427656, 0.40927258, 0.43158556, 0.42760225]])
Note, in this example data set the genotype estimation is not perfect as it is only based on ~250 cells in 10x data, namely not a decent coverage. Nevertheless, it is clear enough to find the donor identity.
fig = plt.figure()
vireoSNP.plot.heat_matrix(GPb_diff[idx0, :][:, idx1],
GPb0_donor_ids[idx0],
GPb1_donor_ids[idx1])
plt.title("Geno Prob Delta: %d SNPs" %(len(GPb0_var_ids_use)))
plt.tight_layout()
plt.show()