This is a notebook where I show you what data is available for analysis, roughly how it is formatted, and how to make a few plots¶

It is meant to be a "jump start" for your own exploration of the data¶

Resources¶

Figure80 article¶

LINK TO BE ADDED

Paper¶

https://science.sciencemag.org/content/366/6464/490

Preprint¶

https://www.biorxiv.org/content/10.1101/675314v1

Github and archived Zenodo of the pipeline to get from raw reads to the processed data available here¶

https://github.com/mjohnson11/TnSeq_Pipeline and https://zenodo.org/record/3402230#.Xc3brpNKhTY (this notebook is essentially a simplified version of the one in these repositories used to make the paper figures)

Requirements if using this locally:¶

I think just a normal anaconda distribution of python should be fine for everything here, which you probably already have: https://www.anaconda.com/distribution/

Importing various useful libraries / setting up plotting¶

In [1]:

import pandas as pd
import numpy as np
from matplotlib import pyplot as pl
import seaborn as sns
sns.set_style("white")
sns.set_style("ticks")
colors = ['#FFB000', '#648FFF']
%matplotlib notebook

Reading data:¶

In [2]:

## READING DATA
def change_well_format(w):
    if '_' in w:
        plate = int(w[1:3])
        t = 'LK' + str(plate) + '-'
        n = int(w.split('_')[1])
        lets = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
        l = lets[int(np.floor((n-1)/12))]
        return t + l + str(((n-1) % 12) + 1).zfill(2)
    else:
        return w

def get_geno_matrix(seg_names):
    # Data from https://www.nature.com/articles/nature11867, can also be dowloaded here http://genomics-pubs.princeton.edu/YeastCross_BYxRM/
    d = pd.read_csv('files/BYxRM_GenoData.csv') 
    map_genos = {'B': 0, 'R': 1}
    for w in d.keys():
        if change_well_format(w) in seg_names:
            d[change_well_format(w)] = d[w].map(map_genos)
    assert len([s for s in seg_names if s in d.columns]) == len(seg_names)
    return d[['marker'] + seg_names]


# Reading information of segregant fitness in our focal environment (from Jerison et al. 2017)
x_info = pd.read_csv('files/Clones_For_Tn96_Experiment.csv')
seg_to_fit = {i[0]: i[1] for i in x_info.as_matrix(['segregant', 'initial fitness, YPD 30C'])}
# Reading data files containing fitness effect information from the small library experiment (few (~100) mutations in many genetic backgrounds)
tp_all = pd.read_csv('files/TP_data_by_edge.csv')
# This excludes the neutral controls, a few mutations that were unintentionally included in this library, and a few controls that didn't end up getting good enough coverage to really analyze
tp = tp_all.loc[tp_all['Type']=='Experiment']
# Reading same thing for large library experiment (many mutations in a few genetic backgrounds)
bt = pd.read_csv('files/BT_data_by_edge.csv')
# Reading aggregate data on DFE (distribution of fitness effect) data statistics for each segregant
tp_dfe = pd.read_csv('files/TP_DFE_statistics.csv')
bt_dfe = pd.read_csv('files/BT_DFE_statistics.csv')
# For bioinformatic reasons, the BT (large library) data has a longer edge sequence, changing so the two exps are comparable
bt['Long.Edge'] = bt['Edge']
bt['Edge'] = bt['Long.Edge'].str[:15]
# Making a few dictionaries that point from the names I am used to to the dataframes
dats = {'BT': bt, 'TP': tp, 'BT.DFE': bt_dfe, 'TP.DFE': tp_dfe}
exps = {'BT': 'E1', 'TP': 'E2'}
# Getting a list of segregants in each experiment by looking for columns like segregant.mean.s in the dataframe
segs_all = {exp: [i.split('.')[0] for i in dats[exp] if '.mean.s' in i] for exp in exps}
# Getting genotype information on these segregants using data from Bloom et al. 2013
gm = get_geno_matrix(segs_all['TP'])
# Making restricted lists of segregants that have at least 50 mutations with s measured for DFE comparisons
segs_use = {exp: [s for s in segs_all[exp] if len(dats[exp].loc[pd.notnull(dats[exp][s + '.mean.s'])])>=50] for exp in exps}
# Making some 
sorted_segs = {exp: sorted(segs_use[exp], key=lambda x: seg_to_fit[x]) for exp in exps}

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:26: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

Let's just take a look at all this data, to see what's what¶

for tp and bt, each row is a mutation for tp_dfe and bt_dfe, each row is a DFE statistic (or background fitness, which we can treat similarly) for gm, each row is a genotyped allele that is different between RM and BY (and takes each state in ~half the segregants)

In [3]:

tp.iloc[:3]

Out[3]:

	Edge	Type	Edge.ID	Gene.Use	Gene_ORF	Gene_ORF.nearby	briefDescription	briefDescription.nearby	chromosome	description	...	full_model_p	full_model_aic	full_model_r2_95_conf_low	full_model_r2_95_conf_high	full_model_p_values	full_model_params	full_model_coeffs	qtls	resid.qtls	full.model.qtls
0	TGATCATCACGGGAC	Experiment	888	in FLC2	FLC2	FLC2	Putative calcium channel involved in calcium r...	Putative calcium channel involved in calcium r...	chr01	Putative calcium channel involved in calcium r...	...	8.990535e-25	-1049.391200	0.484313	0.664302	3.3295688484889364e-32;1.8707009014831425e-05;...	x;locus_9649106_chr14_401893_C_T;locus_9717516...	-0.009667588479105496;-0.06737728192983725;0.3...	chr14_401893;chr14_374661;chr14_402771	chr14_470303;chr14_420065;chr14_492376	chr14_401893;chr14_374661;chr14_402771\|chr14_4...
1	TCGAAAGCACAGTAG	Experiment	377	nearby SRB2	NaN	SRB2	NaN	Subunit of the RNA polymerase II mediator complex	chr08	NaN	...	5.640931e-01	-83.796365	0.000082	0.396975	0.00020013790535836575;0.5640930907526316	x	-0.049382849134784586;-0.08592702839231847	NaN	NaN	NaN
2	GTTGAACTGGTTGTT	Experiment	197	in STM1	STM1	STM1	Protein required for optimal translation under...	Protein required for optimal translation under...	chr12	Protein required for optimal translation under...	...	2.500225e-07	-803.596564	0.103969	0.347860	1.4054000123711754e-50;0.9145877127045838;4.10...	x;locus_9680363_chr14_433150_G_A	-0.028546289780384954;-0.0024826738390553073;0...	chr14_433150;chr14_420065;chr14_485550	chr14_433150;chr14_420065;chr14_492376	chr14_433150;chr14_420065;chr14_485550

3 rows × 1696 columns

In [4]:

bt.iloc[:3]

Out[4]:

	Edge.ID	Edge	Gene.Use	Gene_ORF	Gene_ORF.nearby	briefDescription	briefDescription.nearby	chromosome	description	description.nearby	...	full_model_r2	full_model_p	full_model_aic	full_model_p_values	full_model_params	full_model_coeffs	qtls	resid.qtls	full.model.qtls	Long.Edge
0	0	TCTCCAAGGGATACT	in YMR085W	YMR085W	YMR085W	Putative protein of unknown function	Putative protein of unknown function	chr13	Putative protein of unknown function; YMR085W ...	Putative protein of unknown function; YMR085W ...	...	0.010223	0.781072	-80.547553	0.26814250806982043;0.7810722126668491	x	0.001490528225059142;0.006953626038516742	NaN	NaN	NaN	TCTCCAAGGGATACTTAACGTTATTCCTTT
1	1	TGTGTCGATTTAGTG	in RKM3	RKM3	RKM3	Ribosomal lysine methyltransferase	Ribosomal lysine methyltransferase	chr02	Ribosomal lysine methyltransferase; specific f...	Ribosomal lysine methyltransferase; specific f...	...	0.030625	0.567432	-114.446981	0.15944797128735522;0.5674322326468317	x	0.001162120166529229;-0.009006462611979207	NaN	NaN	NaN	TGTGTCGATTTAGTGTTAAAGAATGACGTC
2	2	TATGGTGCAGAAAAG	nearby YNL143C	NaN	MEP2\|YNL143C	NaN	Ammonium permease involved in regulation of ps...	chr14	NaN	Ammonium permease involved in regulation of ps...	...	0.144110	0.132863	-171.801874	0.10689612722798757;0.13286265303302433	x	0.0006108064927804045;0.012645873311052722	NaN	NaN	NaN	TATGGTGCAGAAAAGTGGCTCGGAATGAAC

3 rows × 246 columns

In [5]:

tp_dfe.iloc[:3]

Out[5]:

	DFE.statistic	LK4-G06	LK3-G08	LK2-G07	LK4-H03	LK4-D11	LK3-E06	LK1-D12	LK6-C05	LK4-C09	...	full_resid_seg_model_r2	full_resid_seg_model_p	full_resid_seg_model_r2_95_conf_low	full_resid_seg_model_r2_95_conf_high	full_resid_seg_model_p_values	full_resid_seg_model_params	full_resid_seg_model_coeffs	qtls	resid.qtls	full.model.qtls
0	background.fitness	0.017536	0.093806	-0.019195	0.022961	-0.060829	0.032745	-0.057404	-0.059662	-0.087799	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	chr14_381891;chr14_368185;chr14_393024\|chr15_1...	NaN	chr14_381891;chr14_368185;chr14_393024\|chr15_1...
1	mean	-0.021606	-0.038078	-0.022692	-0.033220	-0.011229	-0.026932	-0.010239	-0.010303	-0.016691	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	chr14_381891;chr14_371959;chr14_393024\|chr04_4...	chr14_459667;chr14_414983;chr14_481897	chr14_381891;chr14_371959;chr14_393024\|chr04_4...
2	median	-0.017210	-0.022665	-0.012889	-0.025198	-0.012115	-0.016404	-0.010974	-0.009748	-0.010328	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	chr14_376315;chr14_368185;chr14_393024	chr14_433150;chr14_414983;chr14_485550	chr14_376315;chr14_368185;chr14_393024\|chr14_4...

3 rows × 213 columns

In [6]:

bt_dfe.iloc[:3]

Out[6]:

	DFE.statistic	LK4-A04	LK3-D08	LK3-G02	LK2-F11	LK1-E09	LK4-E02	LK1-C09	LK2-A12	LK4-H11	...	full_resid_seg_model_r2	full_resid_seg_model_p	full_resid_seg_model_r2_95_conf_low	full_resid_seg_model_r2_95_conf_high	full_resid_seg_model_p_values	full_resid_seg_model_params	full_resid_seg_model_coeffs	qtls	resid.qtls	full.model.qtls
0	background.fitness	0.036559	0.023625	-0.021251	0.035349	0.001306	-0.021838	0.097835	-0.029612	-0.043350	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	mean	-0.005015	-0.005416	-0.003760	-0.004865	-0.002509	-0.003774	-0.003754	-0.002583	-0.003060	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	median	0.001086	0.000173	0.000602	0.000504	0.000646	0.000713	0.000439	0.000787	0.000182	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 85 columns

In [7]:

gm.iloc[:3]

Out[7]:

	marker	LK4-G06	LK2-G07	LK4-A04	LK1-D10	LK1-D12	...	LK3-A01	LK1-C11	LK3-D04	LK3-C06	LK2-A01	LK2-G10
0	27915_chr01_27915_T_C	1	1	1	1	1	...	1	1	1	1	1	1
1	28323_chr01_28323_G_A	1	1	1	1	1	...	1	1	1	1	1	1
2	28652_chr01_28652_G_T	1	1	1	1	1	...	1	1	1	1	1	1

3 rows × 163 columns

Here are some basic statistics on the data from the two experiments:¶

In [8]:

tp_seg_edges_measured = [len(tp.loc[pd.notnull(tp[s+'.mean.s'])]) for s in segs_all['TP']]
bt_seg_edges_measured = [len(bt.loc[pd.notnull(bt[s+'.mean.s'])]) for s in segs_all['BT']]

In [9]:

print('BT. Tried 20 segregants.')
print(len(segs_all['BT']), ' segregants included.', np.mean(bt_seg_edges_measured), 'mutations measured on average,', 'min:', np.min(bt_seg_edges_measured), 'max:', np.max(bt_seg_edges_measured))
print(len(bt), 'mutations.', len(bt.loc[bt['num.measured']>0]), 'measured in at least one seg.', 'of those, they are measured in', np.mean(bt.loc[bt['num.measured']>0]['num.measured']), 'on average.')
print('Measured in', np.nanmean(bt.loc[bt['num.measured']>0]['num.measured']), 'segs on average')
print(len(bt.loc[bt['num.sig']>0]), 'significant in at least one seg,', len(bt.loc[bt['num.measured']>0].loc[bt['num.sig']==0]), 'in none,', 
      len(bt.loc[bt['num.measured']>0].loc[bt['num.sig']==0])/len(bt.loc[bt['num.measured']>0]), 'percent')

BT. Tried 20 segregants.
18  segregants included. 413.8888888888889 mutations measured on average, min: 180 max: 614
996 mutations. 710 measured in at least one seg. of those, they are measured in 10.492957746478874 on average.
Measured in 10.492957746478874 segs on average
253 significant in at least one seg, 457 in none, 0.643661971830986 percent

In [10]:

print('TP. Tried 176 segregants.')
print(len(segs_all['TP']), ' segregants included.', np.mean(tp_seg_edges_measured), 'mutations measured on average,', 'min:', np.min(tp_seg_edges_measured), 'max:', np.max(tp_seg_edges_measured))
print(len(tp), 'mutations.', 'Measured in', np.nanmean(tp['num.measured']), 'segs on average')
print(np.min(tp['num.measured']),'is the minimum # segs w measurements for one mutation, and the min # sig effects is:', np.min(tp['num.sig']))

TP. Tried 176 segregants.
162  segregants included. 65.70987654320987 mutations measured on average, min: 1 max: 82
91 mutations. Measured in 116.97802197802197 segs on average
9 is the minimum # segs w measurements for one mutation, and the min # sig effects is: 0

Additional columns in these datasets describe the results of the modeling described in the paper, in terms of the R^2, AIC, etc. for various models of epistasis, the models are background fitness (x), QTLs (qtl), full (both), or resid versions (resid_qtl is the qtl model after regressing out x effects), and the columns like:¶

In [11]:

print([i for i in tp if 'aic' in i or i[-2:] in ['r2', '_p']])

['x_model_r2', 'x_model_p', 'x_model_aic', 'qtl_model_r2', 'qtl_model_p', 'qtl_model_aic', 'resid_qtl_model_r2', 'resid_qtl_model_p', 'resid_qtl_model_aic', 'resid_x_model_r2', 'resid_x_model_p', 'resid_x_model_aic', 'full_model_r2', 'full_model_p', 'full_model_aic']

Plot fitness effect vs. background fitness for a particular mutation¶

In [12]:

def plot_simple(sub, df_row, segs):
    measured = [seg for seg in segs if pd.notnull(df_row[seg + '.mean.s'])]
    xs = [seg_to_fit[measured[s]] for s in range(len(measured))]
    ys = [df_row[measured[s] + '.mean.s'] for s in range(len(measured))]
    ye = [df_row[measured[s] + '.stderr.s'] for s in range(len(measured))]
    sub.axhline(y=0, xmin=0, xmax=1, color='#333333', linestyle='dashed', alpha=0.5, lw=1)
    sub.errorbar(x=xs, y=ys, yerr=ye, marker='o', c='k', linestyle='')   
    sub.set_xlim([-0.16, 0.12])
    sub.set_ylim([-0.2, 0.08])
    sub.set_xlabel('X')
    sub.set_ylabel('S')
    sns.despine()

f, sub = pl.subplots(1,1)
dataframe_row = tp[tp['Gene.Use']=='in RPL16A'].iloc[0]
plot_simple(sub, dataframe_row, segs_use['TP'])

Plot a couple DFEs¶

Note in the paper we plot combined DFEs by background fitness quartiles to make them less noisey, which is why this is different:

In [13]:

def get_dfe(df, segname):
    use_df = df.loc[pd.notnull(df[segname + '.mean.s'])]
    return list(use_df[segname + '.mean.s'])

def simple_plot_dfe_compare(exp):
    f, sub = pl.subplots(1, 1, figsize=(6, 4))
    lowest_fit_dfe = get_dfe(dats[exp], sorted_segs[exp][0])
    highest_fit_dfe = get_dfe(dats[exp], sorted_segs[exp][-1])
    bin_lefts = [(-16.15+i)*0.015-0.005 for i in range(22)]
    sub.hist(lowest_fit_dfe, bins=bin_lefts, label='Least Fit Seg', histtype="step", color="r", weights=np.ones_like(lowest_fit_dfe)/float(len(lowest_fit_dfe)))
    sub.hist(highest_fit_dfe, bins=bin_lefts, label='Most Fit Seg', histtype="step", color="b", weights=np.ones_like(highest_fit_dfe)/float(len(highest_fit_dfe)))
    sub.legend(loc='upper left')
    sns.despine()
    
simple_plot_dfe_compare('BT')
simple_plot_dfe_compare('TP')

OK that concludes the jump-start!¶

If you want to get deeper into the plotting / data exploration you can also check out the (much longer but less well commented) jupyter notebook on https://github.com/mjohnson11/TnSeq_Pipeline

Please email me if you have questions: milo.s.johnson.13@gmail.com