Processing returned VCF's

Now we've got output vcf's from Snpeff and VEP. What we need to do is import them into a dataframe. Unfortunately this is not completely straightfoward since they encode their output into the Info column.

Snpeff's formatting looks like:

##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )' ">

An actual line might look like:

EFF=DOWNSTREAM(MODIFIER||28|||CFTR|processed_transcript|CODING|ENST00000610149||1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000429014|2|1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000608965|3|1),INTRON(MODIFIER|||c.3207+2163A>G|1150|CTTNBP2|protein_coding|CODING|ENST00000446636|20|1|WARNING_TRANSCRIPT_NO_START_CODON),INTRON(MODIFIER|||c.4746+2163A>G|1663|CTTNBP2|protein_coding|CODING|ENST00000160373|22|1),INTRON(MODIFIER|||n.*2660+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000441556|22|1),INTRON(MODIFIER|||n.*463+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000445366|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cTc/cCc|p.Leu155Pro/c.464T>C|155|CFTR|protein_coding|CODING|ENST00000600166|4|1|WARNING_TRANSCRIPT_NO_START_CODON)

VEP's formatting looks like:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF">

An actual line might look like:

CSQ=C|83992|XM_005250635.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250635.1:c.2649+2163A>G|||||,C|ENSESTG00000008055|ENSESTT00000020365|Transcript|downstream_gene_variant|||||||||||||2163|-1|||||||protein_coding||||||,C|83992|NM_033427.2|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|NM_033427.2:c.4746+2163A>G|||||,C|83992|XM_005250634.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250634.1:c.4692+2163A>G|||||,C|CCDS5774.1|CCDS5774.1|Transcript|intron_variant||||||||||||||-1|||||||protein_coding|CCDS5774.1:c.4746+2163A>G|||||,C||ENSR00000683158|RegulatoryFeature|regulatory_region_variant|||||||||||||||||||||||||||

Annovar's formatting looks like:

Chr Start   End Ref Alt Func.ensgene    Gene.ensgene    ExonicFunc.ensgene  AAChange.ensgene    phastConsElements46way  esp6500si_all   1000g2012apr_all    snp138  SIFT_score  SIFT_score_converted    SIFT_pred   Polyphen2_HDIV_score    Polyphen2_HDIV_pred Polyphen2_HVAR_score    Polyphen2_HVAR_pred LRT_score   LRT_score_converted LRT_pred    MutationTaster_score    MutationTaster_score_converted  MutationTaster_pred MutationAssessor_score  MutationAssessor_score_converted    MutationAssessor_pred   FATHMM_score    FATHMM_score_converted  FATHMM_pred RadialSVM_score RadialSVM_score_converted   RadialSVM_pred  LR_score    LR_pred GERP++_RS   phyloP  29way_logOdds   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo   Otherinfo

An actual line might look like:

7   117171151   117171151   -   TA  exonic  ENSG00000001626 frameshift insertion    ENSG00000001626:ENST00000003084:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000426809:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000454343:exon4:c.472_473insTA:p.S158fs   Score=645;Name=lod=560                                                                                                                          het 7   117171151   .   A   ATA .   .   .   GT  0/1


Utimate objective

It helps to keep in mind what we are trying to accomplish with this bit of data processing. We want to be able to compare the classification of every Transcript-Variant pair between tools. There are two things we'll need to do to our data to accomplish this goal.

In [1]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import os
In [5]:
# snp_eff_ensembl_df = pd.read_table("./cftr_snpeff_ensembl.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df = pd.read_table("./snpeff_test.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df.head()
Out[5]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
0 7 117105737 . C A . . EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... GT 0/1
1 7 117105737 . C G . . EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... GT 0/1
2 7 117105737 . C T . . EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... GT 0/1
3 7 117105737 . C CA . . EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... GT 0/1
4 7 117105737 . C CG . . EFF=INTERGENIC(MODIFIER||||||||||1),UPSTREAM(M... GT 0/1

5 rows × 10 columns

Break classifications onto separate lines

Since multiple transcripts are represented on every line of our imported VCF, we'll need to break these apart.

First, we are just going to split on , and | and then trim off the remaining garbage. Next, we'll just set the column headers to what we find in the INFO header of the VCF. Finally, we'll make the index of the individual classifications correspond to our imported VCF table.

In [49]:
info_df = snp_eff_ensembl_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
var_class_df = info_df.apply(lambda x: pd.Series(x.split('(')))
detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 
                     'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 
                     'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'WARNINGS']
del var_class_df[1]
var_class_df.columns = ['Effect']
var_class_df = var_class_df.join(detail_df)
var_class_df.index = var_class_df.index.droplevel(-1)
var_class_df.head()
Out[49]:
Effect Effect_Impact Functional_Class Codon_Change Amino_Acid_Change Amino_Acid_length Gene_Name Transcript_BioType Gene_Coding Transcript_ID Exon_Rank Genotype_Number WARNINGS
0 INTERGENIC MODIFIER 1 NaN
0 UPSTREAM MODIFIER 101 CFTR processed_transcript CODING ENST00000546407 1 NaN
1 INTERGENIC MODIFIER 1 NaN
1 UPSTREAM MODIFIER 101 CFTR processed_transcript CODING ENST00000546407 1 NaN
2 INTERGENIC MODIFIER 1 NaN

Join the classifications back to the orginal table

This is just like an inner join on our index. The effect will be to have duplicate variants and unique classifications per row.

In [50]:
combo_df = snp_eff_ensembl_df.join(var_class_df)
combo_df.replace(to_replace='', value=np.nan, inplace=True) 
HTML(combo_df[combo_df["Effect"] == "XON"].head().to_html())
Out[50]:
Int64Index([], dtype=int64) Empty DataFrame

Tweak the index

If what we did above seemed odd, here we will remove some of the confusion.

First, we'll get rid of rows not associated with a transcript. Why? It would be very hard to programically compare variants not associated with a transcript, so let's just drop the rows where Transcript_ID is null

Next, we'll reindex the table on two fields. First the genomic position and then the Transript_ID. These fields will correspond to the fields upon which we'll join the output from the other tools.

In [51]:
# combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
# combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID']) 
# HTML(combo_df_drop_nulls.head().to_html())

Get all the data

Now we'll import the data from VEP and annovar in a similar manner and store all this in an HDF5 data store and then we can do our analysis and export in anoter notebook. Note the use of context managers. If you don't close a hdf5 store you corrupt it.

In [3]:
store_file="classified_variant_store.h5"
try:
    os.remove(store_file)
except OSError:
    pass
store = pd.HDFStore(store_file)
store.close()
In [4]:
def build_snpeff_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for snpeff_file in ["cftr_snpeff_ensembl.vcf", "cftr_snpeff_refseq.vcf"]:
            snpeff_df = pd.read_table(snpeff_file, skiprows=7, header=0, usecols=range(10))
            var_class_df = snpeff_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
            var_class_df = var_class_df.apply(lambda x: pd.Series(x.split('(')))
            detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
            num_col = len(detail_df.columns)
            detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 
                                 'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 
                                 'Transcript_ID', 'Exon_Rank', 'Genotype_Number', "Warnings", "Errors"][:num_col]
            del var_class_df[1]
            var_class_df.columns = ['Effect']
            var_class_df = var_class_df.join(detail_df)
            var_class_df.index = var_class_df.index.droplevel(-1)
            var_class_df.head()
            snpeff_df = snpeff_df.join(var_class_df)
            # combo_df.replace(to_replace='', value=np.nan, inplace=True)
            # combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
            # combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID'])
            snpeff_df.dropna(axis=1, how='all')
            store[snpeff_file.rstrip(".vcf")] = snpeff_df

build_snpeff_hdf()
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE', 'Effect', 'Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'Warnings']]

  warnings.warn(ws, PerformanceWarning)
In [ ]:
def build_vep_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for vep_file in [ "cftr_vep_refseq.vcf", "cftr_vep_ensembl.vcf"]:
            vep_df = pd.read_table(vep_file, skiprows=5, header=0, usecols=range(10))
            info_df = vep_df["INFO"].str.lstrip("CSQ=").str.split(',').apply(pd.Series, 1).stack()
            detail_df = info_df.apply(lambda x: pd.Series(x.split('|')))
            headers = "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF"
            headers_list = headers.split("|")
            detail_df.columns = headers_list
            consq_series = detail_df["Consequence"].str.split('&').apply(pd.Series, 1).stack()
            consq_series.index = consq_series.index.droplevel(-1)
            consq_series.name = "Consequence"
            del detail_df["Consequence"]
            detail_df = detail_df.join(consq_series)
            # Note that it is key to drop index level in the reverse order that they are built up
            detail_df.index = detail_df.index.droplevel(-1)
            vep_df = vep_df.join(detail_df)
            # vep_ensembl_df.replace(to_replace='', value=np.nan, inplace=True)
            vep_df = vep_df.dropna(axis=1, how='all')
            vep_df = vep_df.drop_duplicates()
            store[vep_file.rstrip(".vcf")] = vep_df

build_vep_hdf()
In [5]:
def combine_effects(row, suffix):
    if row["Func" + suffix] == "exonic":
        return row["ExonicFunc" + suffix]
    else:
        return row["Func" + suffix]
    
def combine_hgvs(row, suffix):
    if row["Func" + suffix] == "splicing":
        return row["hgvs"]
    else:
        return row["AAChange" + suffix]

def build_annovar_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for annovar_file in ["cftr_annovar_ensembl.tsv", "cftr_annovar_refseq.tsv"]:
            annovar_df = pd.read_table(annovar_file)
            if annovar_file == "cftr_annovar_ensembl.tsv":
                suffix = ".ensgene"
            else:
                suffix = ".refgene"
            func = annovar_df["Func" + suffix].str.split(";").apply(pd.Series, 1).stack()
            func.index = func.index.droplevel(-1)
            func.name = "Func" + suffix
            del annovar_df["Func" + suffix]
            annovar_df = annovar_df.join(func)
            annovar_df.loc[annovar_df["Func" + suffix] == 'intergenic', ["Gene" + suffix]] = np.nan
            gene_series = annovar_df["Gene" + suffix].str.extract(r'(\w+)\(*')
            print gene_series.head()
            gene_series.name = "Gene"
            annovar_df = annovar_df.join(gene_series)
            annovar_df["combined_effect"] = annovar_df.apply(combine_effects, axis=1, args=(suffix,))
            annovar_df.drop_duplicates(inplace=True)
            splice_hgvs_series = annovar_df["Gene" + suffix].str.findall(r'\w+\((.+)\)').apply(pd.Series).stack()
            splice_hgvs_series.index = splice_hgvs_series.index.droplevel(-1)
            splice_hgvs_series.name = "hgvs"
            annovar_df = annovar_df.join(splice_hgvs_series)
            annovar_df["hgvs"] = annovar_df.apply(combine_hgvs, axis=1, args=(suffix,))
            del annovar_df["Gene" + suffix]
            #We need to keep our original labels to join on in the next step
            annovar_df = annovar_df.rename(columns={'Otherinfo.2':'POS', "Otherinfo.4": "REF", "Otherinfo.5": "ALT"})
            annovar_df = annovar_df.dropna(axis=1, how='all')
            store[annovar_file.rstrip(".tsv")] = annovar_df        
build_annovar_hdf()
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Gene.ensgene, dtype: object
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Gene.refgene, dtype: object
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.ensgene', 'AAChange.ensgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.ensgene', 'Gene', 'combined_effect', 'hgvs']]

  warnings.warn(ws, PerformanceWarning)
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.refgene', 'AAChange.refgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.refgene', 'Gene', 'combined_effect', 'hgvs']]

  warnings.warn(ws, PerformanceWarning)