Process PanCancer TP53 Mutation Calls¶

This script is still a little dirty... aka hard paths and the like. To generate the calls we ran Mutect on the TCGA BAM files from CGhub. In general our calls are pretty agreeable with the TCGA working group calls despite (possibly) being called on slightly different software and having no manual curation step. We are using this to pick up about 2000 patients for our pan-cancer analysis, while our calls might not be perfect, we expect false-negatives rather than false-positives. If anything this should water down the p-values of our associations.

In [1]:

cd ../src/

/cellar/users/agross/TCGA_Code/TCGA/src

In [2]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [3]:

from Processing.Imports import *

In [4]:

PATH = '/cellar/data/TCGA/protected/exome_andy/p53_all/'

Read in Variant Calls¶

Read in all of the VCF files and pull out the somatic TP53 SNVs.

In [5]:

pts_snv = [f[:12] for f in os.listdir(PATH) if 'call_stats' in f and '~' not in f]
vcf = pd.concat({p: pd.read_table(PATH + '{}_call_stats.txt'.format(p), skiprows=[0]) 
                 for p in pts_snv})
vcf = vcf.reset_index()
vcf = vcf.rename(columns={'level_0':'barcode'})
vcf['barcode'] = vcf['barcode'].map(lambda s: s.replace('_','-'))
pts_snv = map(lambda p: p.replace('_','-'), pts_snv)

In [6]:

vcf = vcf[vcf.contig == 17]
tab = vcf.groupby(['tumor_name','contig']).size().unstack()
vcf = vcf[vcf.judgement == 'KEEP']

Read in all of the indel calls. I have this funky try catch because some of the files have the VCF header but no actual calls, Pandas does not like this. For this reason I keep track of the patients with the pts_indels dict and the indel calls with the indels dict, and then put it all together.

In [7]:

indels = {}
pts_indels = {}
for f in os.listdir(PATH):
    if not f.endswith('indels.txt'):
        continue
    pts_indels[f] = '-'.join(f.split('_')[:3])
    try:
        indels[pts_indels[f]] = pd.read_table(PATH + f, skiprows=102, header=None)
    except ValueError:
        pass
pts_indels = pd.Series(pts_indels)
indels = pd.concat(indels)
indels.index.names = ['barcode','num']
indels.columns = ['chromosome','pos','id','ref','alt','qual','filter','info',
                  'format','tumor','normal']
indels = indels.reset_index(0)
indels = indels[indels['info'] == 'SOMATIC']

Not all variant calling runs were sucessfull. Here we are only using patients with a sucessfull SomaticIndelDetector run and a sucessfull MuTect run.

In [8]:

pts = list(set(pts_indels).intersection(set(pts_snv)))
len(pts), len(pts_indels), len(pts_snv)

Out[8]:

(5689, 5752, 5694)

Annotate the Variants¶

I have not found a good automated way to do this, and the best solution is to upload the variants to the Oncotator web site
They do have a REST API but it annotates about 2 variants per second, which takes a while with this many samples
The solution that I have emerged on it to save the variants to a temporar file in the format that Oncotator likes, upload that file to the website, and read in the oncotator output back into the program.
Yes, this is far from elegant, but works for now.

In [9]:

def format_indels_for_oncotator(s):
    r = {}
    r['chr'] = s['chromosome']
    r['start'] = s['pos']
    r['end'] = int(s['pos']) + len(s['ref']) - 1
    r['reference_allele'] = s['ref']
    r['observed_allele'] = s['alt']
    return pd.Series(r)

def format_snvs_for_oncotator(s):
    r = {}
    r['chr'] = s['contig']
    r['start'] = s['position']
    r['end'] = s['position']
    r['reference_allele'] = s['ref_allele']
    r['observed_allele'] = s['alt_allele']
    r = pd.Series(r)
    return r

In [10]:

onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
onc = onc.reset_index('barcode')
onc = onc[['chr', 'start', 'end', 'reference_allele', 'observed_allele']]
onc.to_csv('tmp.txt', sep=' ', index=False, header=False)

Yes this is a hard path.

In [11]:

ONCOTATOR_FILE = '/cellar/users/agross/Downloads/oncotator_output (9).txt'

Oncotator only outputs annotations for unique variants. So I have to create a lookup for each mutation and then map it across the whole dataset.

In [12]:

tab = pd.read_table(ONCOTATOR_FILE, skiprows=[0])
tt = tab[['Chromosome','Start_position','End_position','Reference_Allele', 'Tumor_Seq_Allele1']]
tt = pd.Series({i: tuple(v) for i,v in tt.iterrows()})

In [13]:

onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)
onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)
onc = pd.concat([onc_indel, onc_vcf])
oo = onc[['chr','start','end','reference_allele','observed_allele']]
oo = [tuple(v) for i,v in oo.iterrows()]
maf = pd.DataFrame(tab.ix[tt[tt == s].index[0]] for s in oo)
maf.index = onc.index
maf = maf.set_index('Hugo_Symbol', append=True)

Here I am turning the MAF file into a vector of patient TP53 mutation status.

In [14]:

vc = maf.Variant_Classification.copy()
vc.index = vc.index.droplevel(1)
vc = (vc.isin(['Silent', 'Intron', "3'UTR", "5'UTR"])==False).groupby(level=0).sum()
vc = vc.ix[pts].fillna(0) > 0
vc.name = 'new'
vc.value_counts()

Out[14]:

False    3348
True     2341
dtype: int64

Save the MAF file and the TP53 mutation status vector for later analysis. This is not a proper MAF file as I append the patients with variant calls, but wildtype TP53 to the end of the file.

In [15]:

wildtype = pd.Series({(p, 'wildtype'):nan for p in vc[vc==False].index})
wildtype.index = pd.MultiIndex.from_tuples(wildtype.index)
wildtype = pd.DataFrame({'Entrez_Gene_Id': wildtype})
maf_wt = maf.append(wildtype)[maf.columns]
maf_wt = maf_wt.dropna(axis=1, how='all')

In [16]:

maf.to_csv('../Extra_Data/p53_calls_pancancer.maf')
vc.to_csv('../Extra_Data/p53_calls_pancancer.csv')

Benchmark against TCGA Variant Calls¶

Read in variant calls from MAFS. This meta.csv file is generated in the get_all_MAFs analysis notebook.

In [17]:

old = pd.read_csv('../Data/MAFs/meta.csv')
old['Tumor_Sample_Barcode'] = map(lambda s: s[:12], old['Tumor_Sample_Barcode'])

In [18]:

p53_old = old[old.Hugo_Symbol == 'TP53'].set_index('Tumor_Sample_Barcode')['0']
p53_old = p53_old.ix[old.Tumor_Sample_Barcode.unique()].fillna(0)
p53_old = p53_old.groupby(level=0).first()
p53_old.name = 'old'

In [19]:

pd.crosstab(vc>0, p53_old>0)

Out[19]:

old	False	True
new
False	2103	90
True	79	1400

2 rows × 2 columns

We get 94% sensitivity.

In [20]:

1400. / (1400 + 90)

Out[20]:

0.9395973154362416

We get 96% specificity.

In [21]:

2103. / (2103 + 79)

Out[21]:

0.963794683776352