Defining Sequence Ontology Mappings

The goal here is to determine if what Snpeff's label means (generally) the same thing as VEP's label means the same thing as what ANNOVAR's label. Unfortunately each classifier uses slightly differt terms so we need to normalize the terms over all three tools. There is some hope that all tools will adopt the terms as defined by The Sequence Ontology Project, but as of May 2014 we still need create this mapping.

First, let's take a look at the terms each tool uses.

SNPeff

SNPeff classifies variants by region and effect.

Region

  • NONE
  • INTERGENIC
  • UPSTREAM
  • UTR_5_PRIME
  • SPLICE_SITE_ACCEPTOR
  • SPLICE_SITE_DONOR
  • SPLICE_SITE_REGION
  • EXON or NONE
  • EXON
  • INTRON
  • UTR_3_PRIME
  • DOWNSTREAM
  • REGULATION

Effect

  • NONE
  • CHROMOSOME
  • CUSTOM
  • CDS
  • INTERGENIC
  • INTERGENIC_CONSERVED
  • UPSTREAM
  • UTR_5_PRIME
  • UTR_5_DELETED
  • START_GAINED
  • SPLICE_SITE_ACCEPTOR
  • SPLICE_SITE_DONOR
  • SPLICE_SITE_REGION
  • INTRAGENIC
  • START_LOST
  • SYNONYMOUS_START
  • NON_SYNONYMOUS_START
  • GENE
  • TRANSCRIPT
  • EXON
  • EXON_DELETED
  • NON_SYNONYMOUS_CODING
  • SYNONYMOUS_CODING
  • FRAME_SHIFT
  • CODON_CHANGE
  • CODON_INSERTION
  • CODON_CHANGE_PLUS_CODON_INSERTION
  • CODON_DELETION
  • CODON_CHANGE_PLUS_CODON_DELETION
  • STOP_GAINED
  • SYNONYMOUS_STOP
  • STOP_LOST
  • RARE_AMINO_ACID
  • INTRON
  • INTRON_CONSERVED
  • UTR_3_PRIME
  • UTR_3_DELETED
  • DOWNSTREAM
  • REGULATION

VEP

VEP uses the sequence ontology terms as the SO project defines them.

  • transcript_ablation
  • splice_donor_variant
  • splice_acceptor_variant
  • stop_gained
  • frameshift_variant
  • stop_lost
  • initiator_codon_variant
  • inframe_insertion
  • inframe_deletion
  • missense_variant
  • transcript_amplification
  • splice_region_variant
  • incomplete_terminal_codon_variant
  • synonymous_variant
  • stop_retained_variant
  • coding_sequence_variant
  • mature_miRNA_variant
  • 5_prime_UTR_variant
  • 3_prime_UTR_variant
  • non_coding_exon_variant
  • nc_transcript_variant
  • intron_variant
  • NMD_transcript_variant
  • upstream_gene_variant
  • downstream_gene_variant
  • TFBS_ablation
  • TFBS_amplification
  • TF_binding_site_variant
  • regulatory_region_variant
  • regulatory_region_ablation
  • regulatory_region_amplification
  • feature_elongation
  • feature_truncation
  • intergenic_variant

ANNOVAR

ANNOVAR does something similar to SNPeff in that it breaks it's classification into a region and an effect

Region

  • exonic
  • splicing
  • ncRNA
  • UTR5
  • UTR3
  • intronic
  • upstream
  • downstream
  • intergenic

Effect

  • frameshift insertion
  • frameshift deletion
  • frameshift block substitution
  • stopgain
  • stoploss
  • nonframeshift insertion
  • nonframeshift deletion
  • nonframeshift block substitution
  • nonsynonymous SNV
  • synonymous SNV
  • unknown

Mapping

I can't decide what resolution the buckets should be (ie feature type, or just variant category-LOF, Missense, Syn, etc). Note that SNPeff catagorizes variants that chage the start/stop codon to another start/stop codon as both synonymous and non-synonomous. Also note that VEP doesn't differentiate between SNPs and mod 3 indels. Lastly, the annotation of splice site variants is tricky. ANNOVAR annotates anything within 2bp of a splice site (on either side). This is similar to the splice_donor_variant and splice_acceptor_variant, except what with this classification variants on the exonic side of the splice site are not counted. The splice_region_variant has no analogous match in ANNOVAR and will be ignored in this analysis.

Normalized SO Name ANNOVAR VEP SNPeff
frameshift_variant frameshift_deletion, frameshift_insertion, frameshift_block_substitution frameshift_variant FRAME_SHIFT
stop_gained stopgain stop_gained STOP_GAINED
stop_lost stoploss stop_lost STOP_LOST
splicing_variant splicing splice_donor_variant, splice_acceptor_variant SPLICE_SITE_DONOR, SPLICE_SITE_ACCEPTOR
inframe_variant nonframeshift_deletion, nonframeshift_insertion inframe_insertion, inframe_deletion CODON_INSERTION, CODON_CHANGE_PLUS_CODON_INSERTION, CODON_DELETION, CODON_CHANGE_PLUS_CODON_DELETION
nonsynonymous_variant nonsynonymous_SNV, nonframeshift_block_substitution initiator_codon_variant, missense_variant, stop_retained_variant, incomplete_terminal_codon_variant CODON_CHANGE, NON_SYNONYMOUS_CODING, NON_SYNONYMOUS_START, NON_SYNONYMOUS_STOP, START_LOST
synonymous_variant synonymous_SNV synonymous_variant SYNONYMOUS_CODING, SYNONYMOUS_START, SYNONYMOUS_STOP
3_prime_UTR_variant UTR3 3_prime_UTR_variant UTR_3_PRIME, UTR_3_DELETED
5_prime_UTR_variant UTR5 5_prime_UTR_variant UTR_5_PRIME, UTR_5_DELETED, START_GAINED
upstream_gene_variant upstream upstream_gene_variant UPSTREAM
downstream_gene_variant downstream downstream_gene_variant DOWNSTREAM
regulatory_region_variant N/A regulatory_region_variant, regulatory_region_ablation, regulatory_region_amplification REGULATION
intron_variant intronic intron_variant INTRON, INTRON_CONSERVED
intergenic_variant intergenic intergenic_variant INTERGENIC, INTERGENIC_CONSERVED
Ignored unknown, exonic, ncRNA transcript_ablation , coding_sequence_variant, splice_region_variant, feature_truncation, feature_elongation, TF_binding_site_variant, TFBS_amplification, TFBS_ablation, NMD_transcript_variant, nc_transcript_variant, non_coding_exon_variant, mature_miRNA_variant EXON, GENE, EXON_DELETED, CDS, CHROMOSOME, SPLICE_SITE_REGION, SPLICE_SITE_BRANCH, SPLICE_SITE_BRANCH_U12, MICRO_RNA, INTRAGENIC
In [1]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
snpeff_map = {"SNPeff":"Normalized SO Name",
"FRAME_SHIFT":"frameshift_variant",
"STOP_GAINED":"stop_gained",
"STOP_LOST":"stop_lost",
"SPLICE_SITE_DONOR": "splicing_variant", 
"SPLICE_SITE_ACCEPTOR": "splicing_variant", 
"CODON_INSERTION": "inframe_variant",
"CODON_CHANGE_PLUS_CODON_INSERTION": "inframe_variant",
"CODON_DELETION": "inframe_variant",
"CODON_CHANGE_PLUS_CODON_DELETION": "inframe_variant",
"NON_SYNONYMOUS_CODING": "nonsynonymous_variant",
"NON_SYNONYMOUS_START": "nonsynonymous_variant",
"NON_SYNONYMOUS_STOP": "nonsynonymous_variant",
"START_LOST": "nonsynonymous_variant",
"CODON_CHANGE": "nonsynonymous_variant",
"SYNONYMOUS_CODING": "synonymous_variant",
"SYNONYMOUS_START": "synonymous_variant",
"SYNONYMOUS_STOP": "synonymous_variant",
"UTR_3_PRIME": "3_prime_UTR_variant", 
"UTR_3_DELETED": "3_prime_UTR_variant",
"UTR_5_PRIME": "5_prime_UTR_variant", 
"UTR_5_DELETED": "5_prime_UTR_variant",
"START_GAINED": "5_prime_UTR_variant",
"UPSTREAM":"upstream_gene_variant",
"DOWNSTREAM":"downstream_gene_variant",
"REGULATION":"regulatory_region_variant",
"INTRON": "intron_variant", 
"INTRON_CONSERVED": "intron_variant", 
"INTERGENIC":"intergenic_variant", 
"INTERGENIC_CONSERVED":"intergenic_variant",
"EXON": "ignored", 
"GENE": "ignored", 
"EXON_DELETED": "ignored", 
"CDS": "ignored", 
"CHROMOSOME ": "ignored",
"INTRAGENIC":"ignored",
"MICRO_RNA":"ignored",
"SPLICE_SITE_REGION": "ignored", 
"SPLICE_SITE_BRANCH": "ignored", 
"SPLICE_SITE_BRANCH_U12": "ignored"}

#annovar is not consistent with their naming of functional consequences, hence the multiple representations of frameshift_deletion, etc.
annovar_map = {
"frameshift deletion": "frameshift_variant",
"frameshift_deletion": "frameshift_variant",
"frameshift insertion": "frameshift_variant",
"frameshift_insertion": "frameshift_variant", 
"frameshift_block_substitution": "frameshift_variant",
"stopgain": "stop_gained",
"stopgain SNV": "stop_gained",
"stoploss": "stop_lost",
"stoploss SNV": "stop_lost",
"splicing": "splicing_variant",
"nonframeshift_deletion": "inframe_variant",
"nonframeshift deletion": "inframe_variant",
"nonframeshift_insertion": "inframe_variant",
"nonframeshift insertion": "inframe_variant",
"nonframeshift_block_substitution": "nonsynonymous_variant",
"nonsynonymous_SNV": "nonsynonymous_variant",
"nonsynonymous SNV": "nonsynonymous_variant",
"synonymous_SNV": "synonymous_variant",
"synonymous SNV": "synonymous_variant",
"UTR3": "3_prime_UTR_variant",
"UTR5": "5_prime_UTR_variant",
"ncRNA": "ignored",
"ncRNA_exonic": "ignored",
"ncRNA_UTR3": "ignored",
"ncRNA_UTR5": "ignored",
"ncRNA_intronic": "ignored",
"upstream": "upstream_gene_variant",
"downstream": "downstream_gene_variant",
"intronic": "intron_variant",
"intergenic": "intergenic_variant",
"unknown": "ignored", 
"exonic": "ignored"}

vep_map= {"frameshift_variant": "frameshift_variant",
"stop_gained": "stop_gained",
"stop_lost": "stop_lost",
"splice_donor_variant": "splicing_variant",
"splice_acceptor_variant": "splicing_variant",
"inframe_insertion": "inframe_variant",
"inframe_deletion": "inframe_variant",
"initiator_codon_variant": "nonsynonymous_variant",
"missense_variant": "nonsynonymous_variant",
"stop_retained_variant": "nonsynonymous_variant",
"incomplete_terminal_codon_variant": "nonsynonymous_variant",
"synonymous_variant": "synonymous_variant",
"3_prime_UTR_variant": "3_prime_UTR_variant",
"5_prime_UTR_variant": "5_prime_UTR_variant",
"upstream_gene_variant": "upstream_gene_variant",
"downstream_gene_variant": "downstream_gene_variant",
"regulatory_region_variant": "regulatory_region_variant",
"regulatory_region_ablation": "regulatory_region_variant",
"regulatory_region_amplification": "regulatory_region_variant",
"intron_variant": "intron_variant",
"intergenic_variant": "intergenic_variant",
"transcript_ablation": "ignored",
"coding_sequence_variant": "ignored",
"splice_region_variant": "ignored",
"feature_truncation": "ignored",
"feature_elongation": "ignored",
"TF_binding_site_variant": "ignored",
"TFBS_amplification": "ignored",
"TFBS_ablation": "ignored",
"NMD_transcript_variant": "ignored",
"nc_transcript_variant": "ignored",
"non_coding_exon_variant": "ignored",
"mature_miRNA_variant": "ignored"}
In [3]:
with pd.get_store('classified_variant_store.h5') as store:
    print store.keys()
['/cftr_annovar_ensembl', '/cftr_annovar_ensembl_subset', '/cftr_annovar_refseq', '/cftr_annovar_refseq_subset', '/cftr_snpeff_ensembl', '/cftr_snpeff_ensembl_subset', '/cftr_snpeff_refseq', '/cftr_snpeff_refseq_subset', '/cftr_vep_ensembl', '/cftr_vep_ensembl_subset', '/cftr_vep_refseq', '/cftr_vep_refseq_subset', '/vep_subset_ensembl']
In [4]:
for snpeff_store in ["cftr_snpeff_ensembl", "cftr_snpeff_refseq"]:
    with pd.get_store('classified_variant_store.h5') as store:
        snpeff_df = store.get(snpeff_store)
        snpeff_df["normalized_so"] = snpeff_df.apply(lambda row: snpeff_map[row["Effect"]], axis=1)
        subset_snpeff = snpeff_df[["POS", "ID", "REF", "ALT", "Effect", "Gene_Name", "Transcript_ID", "Amino_Acid_Change"]]
        subset_snpeff.rename(columns={"Amino_Acid_Change": "hgvs_snpeff"}, inplace=True)
        subset_snpeff["normalized_so_snpeff"] = subset_snpeff.apply(lambda row: snpeff_map[row["Effect"]], axis=1)
        store[snpeff_store + "_subset"] = subset_snpeff
        print(str.title(" ".join(snpeff_store.split("_")) + " Metrics"))
        print subset_snpeff["Effect"].value_counts()
        print("\n")
        print subset_snpeff.count()
        print("\n\n")
        
del snpeff_df
del subset_snpeff
Cftr Snpeff Ensembl Metrics
INTRON                               439230
FRAME_SHIFT                          113758
DOWNSTREAM                            99185
UPSTREAM                              83809
NON_SYNONYMOUS_CODING                 30988
EXON                                  30736
SPLICE_SITE_REGION                    23430
CODON_CHANGE_PLUS_CODON_INSERTION     15905
CODON_INSERTION                       12429
UTR_5_PRIME                           10280
SYNONYMOUS_CODING                      9371
STOP_GAINED                            8127
CODON_CHANGE_PLUS_CODON_DELETION       7606
UTR_3_PRIME                            6664
CODON_DELETION                         6403
SPLICE_SITE_ACCEPTOR                   2976
SPLICE_SITE_DONOR                      2976
INTERGENIC                             1403
START_GAINED                            286
START_LOST                              125
STOP_LOST                               115
NON_SYNONYMOUS_START                      8
UTR_5_DELETED                             6
SYNONYMOUS_STOP                           5
INTRAGENIC                                3
dtype: int64


POS                     905824
ID                      905824
REF                     905824
ALT                     905824
Effect                  905824
Gene_Name               905824
Transcript_ID           905824
hgvs_snpeff             905824
normalized_so_snpeff    905824
dtype: int64



Cftr Snpeff Refseq Metrics
INTRON                               88950
FRAME_SHIFT                          35675
INTERGENIC                           32500
UPSTREAM                             14129
NON_SYNONYMOUS_CODING                 9721
DOWNSTREAM                            9173
SPLICE_SITE_REGION                    6690
CODON_CHANGE_PLUS_CODON_INSERTION     4996
CODON_INSERTION                       3890
UTR_3_PRIME                           2982
SYNONYMOUS_CODING                     2940
STOP_GAINED                           2557
CODON_CHANGE_PLUS_CODON_DELETION      2377
CODON_DELETION                        2016
UTR_5_PRIME                           1851
SPLICE_SITE_ACCEPTOR                   806
SPLICE_SITE_DONOR                      806
START_GAINED                            44
START_LOST                              31
STOP_LOST                               30
INTRAGENIC                               6
NON_SYNONYMOUS_START                     2
SYNONYMOUS_STOP                          1
dtype: int64


POS                     222173
ID                      222173
REF                     222173
ALT                     222173
Effect                  222173
Gene_Name               222173
Transcript_ID           222173
hgvs_snpeff             222173
normalized_so_snpeff    222173
dtype: int64



/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/frame.py:2175: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
  **kwargs)
In [5]:
for vep_store in ["cftr_vep_ensembl", "cftr_vep_refseq"]:
    with pd.get_store('classified_variant_store.h5') as store:
        vep_df = store.get(vep_store)
        vep_df["normalized_so"] = vep_df.apply(lambda row: vep_map[row["Consequence"]], axis=1)
        subset_vep = vep_df[["POS", "ID", "REF", "ALT", "Gene", "Consequence", "Feature", "HGVSc", "HGVSp" ]]
        subset_vep["hgvs_vep"] = subset_vep["HGVSp"].map(str) + subset_vep["HGVSc"]
        del subset_vep["HGVSp"]
        del subset_vep["HGVSc"]
        subset_vep["normalized_so_vep"] = subset_vep.apply(lambda row: vep_map[row["Consequence"]], axis=1)
        store[vep_store + "_subset"] = subset_vep
        print(str.title(" ".join(vep_store.split("_")) + " Metrics"))
        print subset_vep["Consequence"].value_counts()
        print("\n")
        print subset_vep.count()
        print("\n\n")

del vep_df
del subset_vep
Cftr Vep Ensembl Metrics
intron_variant                       434622
feature_elongation                   362882
feature_truncation                   132807
nc_transcript_variant                128229
frameshift_variant                   113706
downstream_gene_variant               99177
upstream_gene_variant                 83853
missense_variant                      30988
non_coding_exon_variant               30608
inframe_insertion                     28494
splice_region_variant                 24192
NMD_transcript_variant                23212
regulatory_region_variant             16236
inframe_deletion                      13995
5_prime_UTR_variant                   10257
synonymous_variant                     9359
3_prime_UTR_variant                    6654
stop_gained                            4877
splice_donor_variant                   2208
splice_acceptor_variant                2208
coding_sequence_variant                 590
stop_lost                               127
incomplete_terminal_codon_variant        56
initiator_codon_variant                  55
stop_retained_variant                     9
dtype: int64


POS                  1559401
ID                   1559401
REF                  1559401
ALT                  1559401
Gene                 1559401
Consequence          1559401
Feature              1559401
hgvs_vep             1559401
normalized_so_vep    1559401
dtype: int64



Cftr Vep Refseq Metrics
intron_variant               283392
feature_elongation           254590
frameshift_variant           105713
downstream_gene_variant       97429
feature_truncation            92432
upstream_gene_variant         87006
missense_variant              28780
inframe_insertion             26482
splice_region_variant         19656
regulatory_region_variant     16236
inframe_deletion              13006
5_prime_UTR_variant           11277
3_prime_UTR_variant            9279
intergenic_variant             9198
synonymous_variant             8687
stop_gained                    4582
splice_acceptor_variant        1794
splice_donor_variant           1794
coding_sequence_variant         510
initiator_codon_variant         141
stop_lost                       130
stop_retained_variant             6
dtype: int64


POS                  1072120
ID                   1072120
REF                  1072120
ALT                  1072120
Gene                 1072120
Consequence          1072120
Feature              1072120
hgvs_vep             1072120
normalized_so_vep    1072120
dtype: int64



In [7]:
for annovar_store in ["cftr_annovar_ensembl", "cftr_annovar_refseq"]:
    with pd.get_store('classified_variant_store.h5') as store:
        annovar_df = store.get(annovar_store)
        annovar_df["normalized_so"] = annovar_df.apply(lambda row: annovar_map[row["combined_effect"]], axis=1)
        subset_annovar = annovar_df[["POS", "REF", "ALT", "Gene", "combined_effect", "hgvs" ]]
        subset_annovar["normalized_so_annovar"] = subset_annovar.apply(lambda row: annovar_map[row["combined_effect"]], axis=1)
        store[annovar_store + "_subset"] = subset_annovar
        print(str.title(" ".join(annovar_store.split("_")) + " Metrics"))
        print subset_annovar["combined_effect"].value_counts()
        print("\n")
        print subset_annovar.count()
        print("\n\n")

del annovar_df
del subset_annovar
Cftr Annovar Ensembl Metrics
intronic                   86193
frameshift insertion       26252
ncRNA_intronic             15085
nonsynonymous SNV           9730
frameshift deletion         8811
nonframeshift insertion     8111
intergenic                  7924
UTR5                        5704
nonframeshift deletion      4387
UTR3                        3918
synonymous SNV              2941
stopgain SNV                2031
unknown                     1875
splicing                    1590
ncRNA_exonic                1515
upstream                    1408
ncRNA_UTR3                   969
stoploss SNV                  20
ncRNA_UTR5                     8
dtype: int64


POS                      188472
REF                      188472
ALT                      188472
Gene                     180548
combined_effect          188472
hgvs                      65454
normalized_so_annovar    188472
dtype: int64



Cftr Annovar Refseq Metrics
intronic                   87338
frameshift insertion       26252
intergenic                 13650
nonsynonymous SNV           9730
upstream                    9682
downstream                  9178
frameshift deletion         8808
nonframeshift insertion     8111
nonframeshift deletion      4390
UTR3                        2963
synonymous SNV              2941
stopgain SNV                2031
UTR5                        1848
splicing                    1456
stoploss SNV                  20
dtype: int64


POS                      188398
REF                      188398
ALT                      188398
Gene                     174748
combined_effect          188398
hgvs                      63531
normalized_so_annovar    188398
dtype: int64



/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['REF', 'ALT', 'Gene', 'combined_effect', 'hgvs', 'normalized_so_annovar']]

  warnings.warn(ws, PerformanceWarning)
In [8]:
for tx_set in ["ensembl", "refseq"]:
    with pd.get_store('classified_variant_store.h5') as store:
        snpeff_df = store.get("cftr_snpeff_" + tx_set + "_subset")
        vep_df = store.get("cftr_vep_" + tx_set + "_subset")
        annovar_df = store.get("cftr_annovar_" + tx_set + "_subset")
    vc_snpeff = snpeff_df.groupby(["normalized_so_snpeff"]).size()
    vc_snpeff.name = "SNPeff"
    vc_vep = vep_df.groupby(["normalized_so_vep"]).size()
    vc_vep.name = "VEP"
    vc_annovar = annovar_df.groupby(["normalized_so_annovar"]).size()
    vc_annovar.name = "Annovar"
    vc_df = pd.DataFrame([vc_snpeff, vc_vep, vc_annovar])
    vc_df.transpose().plot(kind="barh", fontsize=13, figsize=(16,8))
    print vc_df.transpose()
                           SNPeff     VEP  Annovar
3_prime_UTR_variant          6664    6654     3918
5_prime_UTR_variant         10572   10257     5704
downstream_gene_variant     99185   99177      NaN
frameshift_variant         113758  113706    35063
ignored                     54169  702520    19452
inframe_variant             42343   42489    12498
intergenic_variant           1403     NaN     7924
intron_variant             439230  434622    86193
nonsynonymous_variant       31121   31108     9730
regulatory_region_variant     NaN   16236      NaN
splicing_variant             5952    4416     1590
stop_gained                  8127    4877     2031
stop_lost                     115     127       20
synonymous_variant           9376    9359     2941
upstream_gene_variant       83809   83853     1408

[15 rows x 3 columns]
                           SNPeff     VEP  Annovar
3_prime_UTR_variant          2982    9279     2963
5_prime_UTR_variant          1895   11277     1848
downstream_gene_variant      9173   97429     9178
frameshift_variant          35675  105713    35060
ignored                      6696  367188      NaN
inframe_variant             13279   39488    12501
intergenic_variant          32500    9198    13650
intron_variant              88950  283392    87338
nonsynonymous_variant        9754   28927     9730
regulatory_region_variant     NaN   16236      NaN
splicing_variant             1612    3588     1456
stop_gained                  2557    4582     2031
stop_lost                      30     130       20
synonymous_variant           2941    8687     2941
upstream_gene_variant       14129   87006     9682

[15 rows x 3 columns]