Now we've got output vcf's from Snpeff and VEP. What we need to do is import them into a dataframe. Unfortunately this is not completely straightfoward since they encode their output into the Info column.
##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )' ">
An actual line might look like:
EFF=DOWNSTREAM(MODIFIER||28|||CFTR|processed_transcript|CODING|ENST00000610149||1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000429014|2|1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000608965|3|1),INTRON(MODIFIER|||c.3207+2163A>G|1150|CTTNBP2|protein_coding|CODING|ENST00000446636|20|1|WARNING_TRANSCRIPT_NO_START_CODON),INTRON(MODIFIER|||c.4746+2163A>G|1663|CTTNBP2|protein_coding|CODING|ENST00000160373|22|1),INTRON(MODIFIER|||n.*2660+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000441556|22|1),INTRON(MODIFIER|||n.*463+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000445366|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cTc/cCc|p.Leu155Pro/c.464T>C|155|CFTR|protein_coding|CODING|ENST00000600166|4|1|WARNING_TRANSCRIPT_NO_START_CODON)
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF">
An actual line might look like:
CSQ=C|83992|XM_005250635.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250635.1:c.2649+2163A>G|||||,C|ENSESTG00000008055|ENSESTT00000020365|Transcript|downstream_gene_variant|||||||||||||2163|-1|||||||protein_coding||||||,C|83992|NM_033427.2|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|NM_033427.2:c.4746+2163A>G|||||,C|83992|XM_005250634.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250634.1:c.4692+2163A>G|||||,C|CCDS5774.1|CCDS5774.1|Transcript|intron_variant||||||||||||||-1|||||||protein_coding|CCDS5774.1:c.4746+2163A>G|||||,C||ENSR00000683158|RegulatoryFeature|regulatory_region_variant|||||||||||||||||||||||||||
Chr Start End Ref Alt Func.ensgene Gene.ensgene ExonicFunc.ensgene AAChange.ensgene phastConsElements46way esp6500si_all 1000g2012apr_all snp138 SIFT_score SIFT_score_converted SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_score_converted LRT_pred MutationTaster_score MutationTaster_score_converted MutationTaster_pred MutationAssessor_score MutationAssessor_score_converted MutationAssessor_pred FATHMM_score FATHMM_score_converted FATHMM_pred RadialSVM_score RadialSVM_score_converted RadialSVM_pred LR_score LR_pred GERP++_RS phyloP 29way_logOdds Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo Otherinfo
An actual line might look like:
7 117171151 117171151 - TA exonic ENSG00000001626 frameshift insertion ENSG00000001626:ENST00000003084:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000426809:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000454343:exon4:c.472_473insTA:p.S158fs Score=645;Name=lod=560 het 7 117171151 . A ATA . . . GT 0/1
It helps to keep in mind what we are trying to accomplish with this bit of data processing. We want to be able to compare the classification of every Transcript-Variant pair between tools. There are two things we'll need to do to our data to accomplish this goal.
from IPython.display import HTML
import pandas as pd
import numpy as np
import os
# snp_eff_ensembl_df = pd.read_table("./cftr_snpeff_ensembl.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df = pd.read_table("./snpeff_test.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df.head()
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SAMPLE | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 117105737 | . | C | A | . | . | EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... | GT | 0/1 |
1 | 7 | 117105737 | . | C | G | . | . | EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... | GT | 0/1 |
2 | 7 | 117105737 | . | C | T | . | . | EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... | GT | 0/1 |
3 | 7 | 117105737 | . | C | CA | . | . | EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... | GT | 0/1 |
4 | 7 | 117105737 | . | C | CG | . | . | EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... | GT | 0/1 |
5 rows × 10 columns
Since multiple transcripts are represented on every line of our imported VCF, we'll need to break these apart.
First, we are just going to split on ,
and |
and then trim off the remaining garbage. Next, we'll just set the column headers to what we find in the INFO header of the VCF. Finally, we'll make the index of the individual classifications correspond to our imported VCF table.
info_df = snp_eff_ensembl_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
var_class_df = info_df.apply(lambda x: pd.Series(x.split('(')))
detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change',
'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding',
'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'WARNINGS']
del var_class_df[1]
var_class_df.columns = ['Effect']
var_class_df = var_class_df.join(detail_df)
var_class_df.index = var_class_df.index.droplevel(-1)
var_class_df.head()
Effect | Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change | Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number | WARNINGS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INTERGENIC | MODIFIER | 1 | NaN | |||||||||
0 | UPSTREAM | MODIFIER | 101 | CFTR | processed_transcript | CODING | ENST00000546407 | 1 | NaN | ||||
1 | INTERGENIC | MODIFIER | 1 | NaN | |||||||||
1 | UPSTREAM | MODIFIER | 101 | CFTR | processed_transcript | CODING | ENST00000546407 | 1 | NaN | ||||
2 | INTERGENIC | MODIFIER | 1 | NaN |
This is just like an inner join on our index. The effect will be to have duplicate variants and unique classifications per row.
combo_df = snp_eff_ensembl_df.join(var_class_df)
combo_df.replace(to_replace='', value=np.nan, inplace=True)
HTML(combo_df[combo_df["Effect"] == "XON"].head().to_html())
Int64Index([], dtype=int64) | Empty DataFrame |
If what we did above seemed odd, here we will remove some of the confusion.
First, we'll get rid of rows not associated with a transcript. Why? It would be very hard to programically compare variants not associated with a transcript, so let's just drop the rows where Transcript_ID is null
Next, we'll reindex the table on two fields. First the genomic position and then the Transript_ID. These fields will correspond to the fields upon which we'll join the output from the other tools.
# combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
# combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID'])
# HTML(combo_df_drop_nulls.head().to_html())
Now we'll import the data from VEP and annovar in a similar manner and store all this in an HDF5 data store and then we can do our analysis and export in anoter notebook. Note the use of context managers. If you don't close a hdf5 store you corrupt it.
store_file="classified_variant_store.h5"
try:
os.remove(store_file)
except OSError:
pass
store = pd.HDFStore(store_file)
store.close()
def build_snpeff_hdf():
with pd.get_store("./classified_variant_store.h5") as store:
for snpeff_file in ["cftr_snpeff_ensembl.vcf", "cftr_snpeff_refseq.vcf"]:
snpeff_df = pd.read_table(snpeff_file, skiprows=7, header=0, usecols=range(10))
var_class_df = snpeff_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
var_class_df = var_class_df.apply(lambda x: pd.Series(x.split('(')))
detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
num_col = len(detail_df.columns)
detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change',
'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding',
'Transcript_ID', 'Exon_Rank', 'Genotype_Number', "Warnings", "Errors"][:num_col]
del var_class_df[1]
var_class_df.columns = ['Effect']
var_class_df = var_class_df.join(detail_df)
var_class_df.index = var_class_df.index.droplevel(-1)
var_class_df.head()
snpeff_df = snpeff_df.join(var_class_df)
# combo_df.replace(to_replace='', value=np.nan, inplace=True)
# combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
# combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID'])
snpeff_df.dropna(axis=1, how='all')
store[snpeff_file.rstrip(".vcf")] = snpeff_df
build_snpeff_hdf()
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->['ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE', 'Effect', 'Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'Warnings']] warnings.warn(ws, PerformanceWarning)
def build_vep_hdf():
with pd.get_store("./classified_variant_store.h5") as store:
for vep_file in [ "cftr_vep_refseq.vcf", "cftr_vep_ensembl.vcf"]:
vep_df = pd.read_table(vep_file, skiprows=5, header=0, usecols=range(10))
info_df = vep_df["INFO"].str.lstrip("CSQ=").str.split(',').apply(pd.Series, 1).stack()
detail_df = info_df.apply(lambda x: pd.Series(x.split('|')))
headers = "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF"
headers_list = headers.split("|")
detail_df.columns = headers_list
consq_series = detail_df["Consequence"].str.split('&').apply(pd.Series, 1).stack()
consq_series.index = consq_series.index.droplevel(-1)
consq_series.name = "Consequence"
del detail_df["Consequence"]
detail_df = detail_df.join(consq_series)
# Note that it is key to drop index level in the reverse order that they are built up
detail_df.index = detail_df.index.droplevel(-1)
vep_df = vep_df.join(detail_df)
# vep_ensembl_df.replace(to_replace='', value=np.nan, inplace=True)
vep_df = vep_df.dropna(axis=1, how='all')
vep_df = vep_df.drop_duplicates()
store[vep_file.rstrip(".vcf")] = vep_df
build_vep_hdf()
def combine_effects(row, suffix):
if row["Func" + suffix] == "exonic":
return row["ExonicFunc" + suffix]
else:
return row["Func" + suffix]
def combine_hgvs(row, suffix):
if row["Func" + suffix] == "splicing":
return row["hgvs"]
else:
return row["AAChange" + suffix]
def build_annovar_hdf():
with pd.get_store("./classified_variant_store.h5") as store:
for annovar_file in ["cftr_annovar_ensembl.tsv", "cftr_annovar_refseq.tsv"]:
annovar_df = pd.read_table(annovar_file)
if annovar_file == "cftr_annovar_ensembl.tsv":
suffix = ".ensgene"
else:
suffix = ".refgene"
func = annovar_df["Func" + suffix].str.split(";").apply(pd.Series, 1).stack()
func.index = func.index.droplevel(-1)
func.name = "Func" + suffix
del annovar_df["Func" + suffix]
annovar_df = annovar_df.join(func)
annovar_df.loc[annovar_df["Func" + suffix] == 'intergenic', ["Gene" + suffix]] = np.nan
gene_series = annovar_df["Gene" + suffix].str.extract(r'(\w+)\(*')
print gene_series.head()
gene_series.name = "Gene"
annovar_df = annovar_df.join(gene_series)
annovar_df["combined_effect"] = annovar_df.apply(combine_effects, axis=1, args=(suffix,))
annovar_df.drop_duplicates(inplace=True)
splice_hgvs_series = annovar_df["Gene" + suffix].str.findall(r'\w+\((.+)\)').apply(pd.Series).stack()
splice_hgvs_series.index = splice_hgvs_series.index.droplevel(-1)
splice_hgvs_series.name = "hgvs"
annovar_df = annovar_df.join(splice_hgvs_series)
annovar_df["hgvs"] = annovar_df.apply(combine_hgvs, axis=1, args=(suffix,))
del annovar_df["Gene" + suffix]
#We need to keep our original labels to join on in the next step
annovar_df = annovar_df.rename(columns={'Otherinfo.2':'POS', "Otherinfo.4": "REF", "Otherinfo.5": "ALT"})
annovar_df = annovar_df.dropna(axis=1, how='all')
store[annovar_file.rstrip(".tsv")] = annovar_df
build_annovar_hdf()
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: Gene.ensgene, dtype: object 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: Gene.refgene, dtype: object
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.ensgene', 'AAChange.ensgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.ensgene', 'Gene', 'combined_effect', 'hgvs']] warnings.warn(ws, PerformanceWarning) /Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.refgene', 'AAChange.refgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.refgene', 'Gene', 'combined_effect', 'hgvs']] warnings.warn(ws, PerformanceWarning)