Processing returned VCF's¶

Now we've got output vcf's from Snpeff and VEP. What we need to do is import them into a dataframe. Unfortunately this is not completely straightfoward since they encode their output into the Info column.

Snpeff's formatting looks like:¶

##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )' ">

An actual line might look like:

EFF=DOWNSTREAM(MODIFIER||28|||CFTR|processed_transcript|CODING|ENST00000610149||1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000429014|2|1),EXON(MODIFIER|||||CFTR|processed_transcript|CODING|ENST00000608965|3|1),INTRON(MODIFIER|||c.3207+2163A>G|1150|CTTNBP2|protein_coding|CODING|ENST00000446636|20|1|WARNING_TRANSCRIPT_NO_START_CODON),INTRON(MODIFIER|||c.4746+2163A>G|1663|CTTNBP2|protein_coding|CODING|ENST00000160373|22|1),INTRON(MODIFIER|||n.*2660+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000441556|22|1),INTRON(MODIFIER|||n.*463+2163A>G||CTTNBP2|nonsense_mediated_decay|CODING|ENST00000445366|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cTc/cCc|p.Leu155Pro/c.464T>C|155|CFTR|protein_coding|CODING|ENST00000600166|4|1|WARNING_TRANSCRIPT_NO_START_CODON)

VEP's formatting looks like:¶

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF">

An actual line might look like:

CSQ=C|83992|XM_005250635.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250635.1:c.2649+2163A>G|||||,C|ENSESTG00000008055|ENSESTT00000020365|Transcript|downstream_gene_variant|||||||||||||2163|-1|||||||protein_coding||||||,C|83992|NM_033427.2|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|NM_033427.2:c.4746+2163A>G|||||,C|83992|XM_005250634.1|Transcript|intron_variant||||||||||||||-1||CTTNBP2|||||protein_coding|XM_005250634.1:c.4692+2163A>G|||||,C|CCDS5774.1|CCDS5774.1|Transcript|intron_variant||||||||||||||-1|||||||protein_coding|CCDS5774.1:c.4746+2163A>G|||||,C||ENSR00000683158|RegulatoryFeature|regulatory_region_variant|||||||||||||||||||||||||||

Annovar's formatting looks like:¶

Chr	Start	End	Ref	Alt	Func.ensgene	Gene.ensgene	ExonicFunc.ensgene	AAChange.ensgene	phastConsElements46way	esp6500si_all	1000g2012apr_all	snp138	SIFT_score	SIFT_score_converted	SIFT_pred	Polyphen2_HDIV_score	Polyphen2_HDIV_pred	Polyphen2_HVAR_score	Polyphen2_HVAR_pred	LRT_score	LRT_score_converted	LRT_pred	MutationTaster_score	MutationTaster_score_converted	MutationTaster_pred	MutationAssessor_score	MutationAssessor_score_converted	MutationAssessor_pred	FATHMM_score	FATHMM_score_converted	FATHMM_pred	RadialSVM_score	RadialSVM_score_converted	RadialSVM_pred	LR_score	LR_pred	GERP++_RS	phyloP	29way_logOdds	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo	Otherinfo

An actual line might look like:

7	117171151	117171151	-	TA	exonic	ENSG00000001626	frameshift insertion	ENSG00000001626:ENST00000003084:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000426809:exon4:c.472_473insTA:p.S158fs,ENSG00000001626:ENST00000454343:exon4:c.472_473insTA:p.S158fs	Score=645;Name=lod=560																															het	7	117171151	.	A	ATA	.	.	.	GT	0/1

Utimate objective¶

It helps to keep in mind what we are trying to accomplish with this bit of data processing. We want to be able to compare the classification of every Transcript-Variant pair between tools. There are two things we'll need to do to our data to accomplish this goal.

In [1]:

from IPython.display import HTML
import pandas as pd
import numpy as np
import os

In [5]:

# snp_eff_ensembl_df = pd.read_table("./cftr_snpeff_ensembl.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df = pd.read_table("./snpeff_test.vcf", skiprows=7, header=0, usecols=range(10))
snp_eff_ensembl_df.head()

Out[5]:

	#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
0	7	117105737	.	C	A	.	.	EFF=INTERGENIC(MODIFIER\|\|\|\|\|\|\|\|\|\|1),UPSTREAM(M...	GT	0/1
1	7	117105737	.	C	G	.	.	EFF=INTERGENIC(MODIFIER\|\|\|\|\|\|\|\|\|\|1),UPSTREAM(M...	GT	0/1
2	7	117105737	.	C	T	.	.	EFF=INTERGENIC(MODIFIER\|\|\|\|\|\|\|\|\|\|1),UPSTREAM(M...	GT	0/1
3	7	117105737	.	C	CA	.	.	EFF=INTERGENIC(MODIFIER\|\|\|\|\|\|\|\|\|\|1),UPSTREAM(M...	GT	0/1
4	7	117105737	.	C	CG	.	.	EFF=INTERGENIC(MODIFIER\|\|\|\|\|\|\|\|\|\|1),UPSTREAM(M...	GT	0/1

5 rows × 10 columns

Break classifications onto separate lines¶

Since multiple transcripts are represented on every line of our imported VCF, we'll need to break these apart.

First, we are just going to split on , and | and then trim off the remaining garbage. Next, we'll just set the column headers to what we find in the INFO header of the VCF. Finally, we'll make the index of the individual classifications correspond to our imported VCF table.

In [49]:

info_df = snp_eff_ensembl_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
var_class_df = info_df.apply(lambda x: pd.Series(x.split('(')))
detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 
                     'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 
                     'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'WARNINGS']
del var_class_df[1]
var_class_df.columns = ['Effect']
var_class_df = var_class_df.join(detail_df)
var_class_df.index = var_class_df.index.droplevel(-1)
var_class_df.head()

Out[49]:

	Effect	Effect_Impact	Codon_Change	Gene_Name	Transcript_BioType	Gene_Coding	Transcript_ID	Genotype_Number	WARNINGS
0	INTERGENIC	MODIFIER						1	NaN
0	UPSTREAM	MODIFIER	101	CFTR	processed_transcript	CODING	ENST00000546407	1	NaN
1	INTERGENIC	MODIFIER						1	NaN
1	UPSTREAM	MODIFIER	101	CFTR	processed_transcript	CODING	ENST00000546407	1	NaN
2	INTERGENIC	MODIFIER						1	NaN

Join the classifications back to the orginal table¶

This is just like an inner join on our index. The effect will be to have duplicate variants and unique classifications per row.

In [50]:

combo_df = snp_eff_ensembl_df.join(var_class_df)
combo_df.replace(to_replace='', value=np.nan, inplace=True) 
HTML(combo_df[combo_df["Effect"] == "XON"].head().to_html())

Out[50]:

Int64Index([], dtype=int64)

Empty DataFrame

Tweak the index¶

If what we did above seemed odd, here we will remove some of the confusion.

First, we'll get rid of rows not associated with a transcript. Why? It would be very hard to programically compare variants not associated with a transcript, so let's just drop the rows where Transcript_ID is null

Next, we'll reindex the table on two fields. First the genomic position and then the Transript_ID. These fields will correspond to the fields upon which we'll join the output from the other tools.

In [51]:

# combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
# combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID']) 
# HTML(combo_df_drop_nulls.head().to_html())

Get all the data¶

Now we'll import the data from VEP and annovar in a similar manner and store all this in an HDF5 data store and then we can do our analysis and export in anoter notebook. Note the use of context managers. If you don't close a hdf5 store you corrupt it.

In [3]:

store_file="classified_variant_store.h5"
try:
    os.remove(store_file)
except OSError:
    pass
store = pd.HDFStore(store_file)
store.close()

In [4]:

def build_snpeff_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for snpeff_file in ["cftr_snpeff_ensembl.vcf", "cftr_snpeff_refseq.vcf"]:
            snpeff_df = pd.read_table(snpeff_file, skiprows=7, header=0, usecols=range(10))
            var_class_df = snpeff_df["INFO"].str.replace("EFF=", "").str.split(',').apply(pd.Series, 1).stack()
            var_class_df = var_class_df.apply(lambda x: pd.Series(x.split('(')))
            detail_df = var_class_df[1].str.rstrip(")").apply(lambda x: pd.Series(x.split('|')))
            num_col = len(detail_df.columns)
            detail_df.columns = ['Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 
                                 'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 
                                 'Transcript_ID', 'Exon_Rank', 'Genotype_Number', "Warnings", "Errors"][:num_col]
            del var_class_df[1]
            var_class_df.columns = ['Effect']
            var_class_df = var_class_df.join(detail_df)
            var_class_df.index = var_class_df.index.droplevel(-1)
            var_class_df.head()
            snpeff_df = snpeff_df.join(var_class_df)
            # combo_df.replace(to_replace='', value=np.nan, inplace=True)
            # combo_df_drop_nulls = combo_df[combo_df["Transcript_ID"].isnull() == False]
            # combo_df_drop_nulls = combo_df_drop_nulls.set_index(keys=['POS', 'Transcript_ID'])
            snpeff_df.dropna(axis=1, how='all')
            store[snpeff_file.rstrip(".vcf")] = snpeff_df

build_snpeff_hdf()

/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE', 'Effect', 'Effect_Impact', 'Functional_Class', 'Codon_Change', 'Amino_Acid_Change', 'Amino_Acid_length', 'Gene_Name', 'Transcript_BioType', 'Gene_Coding', 'Transcript_ID', 'Exon_Rank', 'Genotype_Number', 'Warnings']]

  warnings.warn(ws, PerformanceWarning)

In [ ]:

def build_vep_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for vep_file in [ "cftr_vep_refseq.vcf", "cftr_vep_ensembl.vcf"]:
            vep_df = pd.read_table(vep_file, skiprows=5, header=0, usecols=range(10))
            info_df = vep_df["INFO"].str.lstrip("CSQ=").str.split(',').apply(pd.Series, 1).stack()
            detail_df = info_df.apply(lambda x: pd.Series(x.split('|')))
            headers = "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|STRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|GMAF|BIOTYPE|HGVSc|HGVSp|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF"
            headers_list = headers.split("|")
            detail_df.columns = headers_list
            consq_series = detail_df["Consequence"].str.split('&').apply(pd.Series, 1).stack()
            consq_series.index = consq_series.index.droplevel(-1)
            consq_series.name = "Consequence"
            del detail_df["Consequence"]
            detail_df = detail_df.join(consq_series)
            # Note that it is key to drop index level in the reverse order that they are built up
            detail_df.index = detail_df.index.droplevel(-1)
            vep_df = vep_df.join(detail_df)
            # vep_ensembl_df.replace(to_replace='', value=np.nan, inplace=True)
            vep_df = vep_df.dropna(axis=1, how='all')
            vep_df = vep_df.drop_duplicates()
            store[vep_file.rstrip(".vcf")] = vep_df

build_vep_hdf()

In [5]:

def combine_effects(row, suffix):
    if row["Func" + suffix] == "exonic":
        return row["ExonicFunc" + suffix]
    else:
        return row["Func" + suffix]
    
def combine_hgvs(row, suffix):
    if row["Func" + suffix] == "splicing":
        return row["hgvs"]
    else:
        return row["AAChange" + suffix]

def build_annovar_hdf():
    with pd.get_store("./classified_variant_store.h5") as store:
        for annovar_file in ["cftr_annovar_ensembl.tsv", "cftr_annovar_refseq.tsv"]:
            annovar_df = pd.read_table(annovar_file)
            if annovar_file == "cftr_annovar_ensembl.tsv":
                suffix = ".ensgene"
            else:
                suffix = ".refgene"
            func = annovar_df["Func" + suffix].str.split(";").apply(pd.Series, 1).stack()
            func.index = func.index.droplevel(-1)
            func.name = "Func" + suffix
            del annovar_df["Func" + suffix]
            annovar_df = annovar_df.join(func)
            annovar_df.loc[annovar_df["Func" + suffix] == 'intergenic', ["Gene" + suffix]] = np.nan
            gene_series = annovar_df["Gene" + suffix].str.extract(r'(\w+)\(*')
            print gene_series.head()
            gene_series.name = "Gene"
            annovar_df = annovar_df.join(gene_series)
            annovar_df["combined_effect"] = annovar_df.apply(combine_effects, axis=1, args=(suffix,))
            annovar_df.drop_duplicates(inplace=True)
            splice_hgvs_series = annovar_df["Gene" + suffix].str.findall(r'\w+\((.+)\)').apply(pd.Series).stack()
            splice_hgvs_series.index = splice_hgvs_series.index.droplevel(-1)
            splice_hgvs_series.name = "hgvs"
            annovar_df = annovar_df.join(splice_hgvs_series)
            annovar_df["hgvs"] = annovar_df.apply(combine_hgvs, axis=1, args=(suffix,))
            del annovar_df["Gene" + suffix]
            #We need to keep our original labels to join on in the next step
            annovar_df = annovar_df.rename(columns={'Otherinfo.2':'POS', "Otherinfo.4": "REF", "Otherinfo.5": "ALT"})
            annovar_df = annovar_df.dropna(axis=1, how='all')
            store[annovar_file.rstrip(".tsv")] = annovar_df        
build_annovar_hdf()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Gene.ensgene, dtype: object
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Gene.refgene, dtype: object

/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.ensgene', 'AAChange.ensgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.ensgene', 'Gene', 'combined_effect', 'hgvs']]

  warnings.warn(ws, PerformanceWarning)
/Users/andrewjesaitis/.virtualenvs/ipython/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Ref', 'Alt', 'ExonicFunc.refgene', 'AAChange.refgene', 'phastConsElements46way', 'snp138', 'SIFT_score', 'SIFT_score_converted', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_score_converted', 'LRT_pred', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_score_converted', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_score_converted', 'FATHMM_pred', 'RadialSVM_score', 'RadialSVM_score_converted', 'RadialSVM_pred', 'LR_score', 'LR_pred', 'Otherinfo', 'Otherinfo.3', 'REF', 'ALT', 'Otherinfo.6', 'Otherinfo.7', 'Otherinfo.8', 'Otherinfo.9', 'Otherinfo.10', 'Func.refgene', 'Gene', 'combined_effect', 'hgvs']]

  warnings.warn(ws, PerformanceWarning)

In [ ]: