Exporting Burst Data

This notebook is part of a tutorial series for the FRETBursts burst analysis software.

In this notebook, show a few example of how to export FRETBursts burst data to a file.

Please cite FRETBursts in publications or presentations!

Loading the software

We start by loading FRETBursts:

In [1]:
from fretbursts import *
 - Optimized (cython) burst search loaded.
 - Optimized (cython) photon counting loaded.
--------------------------------------------------------------
 You are running FRETBursts (version 0.6.5+5.g5243bc2).

 If you use this software please cite the following paper:

   FRETBursts: An Open Source Toolkit for Analysis of Freely-Diffusing Single-Molecule FRET
   Ingargiola et al. (2016). http://dx.doi.org/10.1371/journal.pone.0160716 

--------------------------------------------------------------
In [2]:
sns = init_notebook()

Downloading the data file

The full list of smFRET measurements used in the FRETBursts tutorials can be found on Figshare.

This is the file we will download:

In [3]:
url = 'http://files.figshare.com/2182601/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5'
You can change the url variable above to download your own data file. This is useful if you are executing FRETBursts online and you want to use your own data file. See First Steps.

Here, we download the data file and put it in a folder named data, inside the notebook folder:

In [4]:
download_file(url, save_dir='./data')
URL:  http://files.figshare.com/2182601/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5
File: 0023uLRpitc_NTP_20dT_0.5GndCl.hdf5
 
File already on disk: /Users/anto/src/FRETBursts/notebooks/data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5 
Delete it to re-download.

NOTE: If you modified the url variable providing an invalid URL the previous command will fail. In this case, edit the cell containing the url and re-try the download.

Loading the data file

Here, we can directly define the file name to be loaded:

In [5]:
filename = "./data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5"
filename
Out[5]:
'./data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5'

Let's check that the file exists:

In [6]:
import os
if os.path.isfile(filename):
    print("Perfect, file found!")
else:
    print("Sorry, file:\n%s not found" % filename)
Perfect, file found!
In [7]:
d = loader.photon_hdf5(filename)

μs-ALEX parameters

At this point, timestamps and detectors numbers are contained in the ph_times_t and det_t attributes of d. Let's print them:

In [8]:
d.ph_times_t, d.det_t
Out[8]:
([array([     146847,      188045,      294124, ..., 47999863658,
         47999877783, 47999955353])],
 [array([0, 1, 1, ..., 1, 1, 0], dtype=uint32)])

We need to define some ALEX parameters:

In [9]:
d.add(det_donor_accept = (0, 1), 
      alex_period = 4000,
      offset = 700,
      D_ON = (2180, 3900), 
      A_ON = (200, 1800))

Here the parameters are:

  • det_donor_accept: donor and acceptor channels
  • alex_period: length of excitation period (in timestamps units)
  • D_ON and A_ON: donor and acceptor excitation windows
  • offset: the offset between the start of alternation and start of timestamping (see also Definition of alternation periods).

To check that the above parameters are correct, we need to plot the histogram of timestamps (modulo the alternation period) and superimpose the two excitation period definitions to it:

In [10]:
bpl.plot_alternation_hist(d)

If the previous alternation histogram looks correct, the corresponding definitions of the excitation periods can be applied to the data using the following command:

In [11]:
loader.alex_apply_period(d)
# Total photons (after ALEX selection):     2,259,522
#  D  photons in D+A excitation periods:      721,537
#  A  photons in D+A excitation periods:    1,537,985
# D+A photons in  D  excitation period:     1,434,842
# D+A photons in  A  excitation period:       824,680

If the previous histogram does not look right, the parameters in the d.add(...) cell can be modified and checked by running the histogram plot cell until everything looks fine. Don't forget to apply the parameters with loader.usalex_apply_period(d) as a last step.

NOTE: After applying the ALEX parameters a new array of timestamps containing only photons inside the excitation periods is created (name d.ph_times_m). To save memory, by default, the old timestamps array (d.ph_times_t) is deleted. Therefore, in the following, when we talk about all-photon selection we always refer to all photons inside both excitation periods.

Background and burst search

In [12]:
d.calc_bg(bg.exp_fit, time_s=30, tail_min_us='auto', F_bg=1.7)
 - Calculating BG rates ... [DONE]
In [13]:
d.burst_search(L=10, m=10, F=6)
 - Performing burst search (verbose=False) ...[DONE]
 - Calculating burst periods ...[DONE]
 - Counting D and A ph and calculating FRET ... 
   - Applying background correction.
   [DONE Counting D/A]

First we filter the bursts to avoid creating big files:

In [14]:
ds = d.select_bursts(select_bursts.size, th1=60)

Exporting Burst Data

By burst-data we mean all the scalar burst parameters, e.g. size, duration, background, etc...

We can easily get a table (a pandas DataFrame) with all the burst data as follows:

In [15]:
bursts = bext.burst_data(ds, include_bg=True, include_ph_index=True)
bursts.head(5)  # display first 5 bursts
Out[15]:
E S bg_aa bg_ad bg_da bg_dd bg_period i_start i_stop na naa nd nda nt size_raw t_start t_stop width_ms spot
0 0.324223 0.571004 1.738086 2.993976 0.249776 1.795806 0 5276 5422 26.006024 60.261914 54.204194 -0.249776 140.472131 147 1.732456 1.735530 3.074012 0
1 0.420886 0.648480 1.630940 2.809410 0.234378 1.685103 0 5961 6173 56.190590 72.369060 77.314897 0.765622 205.874547 213 1.792993 1.795878 2.884513 0
2 0.451800 0.467038 1.578611 2.719270 0.226858 1.631036 0 6757 6986 47.280730 119.421389 57.368964 -0.226858 224.071083 230 2.023545 2.026337 2.791962 0
3 0.166462 0.984595 1.336586 2.302365 0.192077 1.380973 0 10403 10515 17.697635 1.663414 88.619027 -0.192077 107.980076 113 3.113880 3.116244 2.363913 0
4 0.137702 0.998409 1.887150 3.250750 0.271197 1.949821 0 10548 10625 9.749250 0.112850 61.050179 -0.271197 70.912280 78 3.129707 3.133044 3.337650 0

Once we have the DataFrame, saving it to disk in CSV format is trivial:

In [16]:
bursts.to_csv('%s_burst_data.csv' % filename.replace('.hdf5', ''))

Exporting Bursts Timestamps

Exporting timestamps and other photon-data for each bursts is a little trickier because the data is less uniform (i.e. each burst has a different number of photons). In the following example, we will save a csv file with variable-length columns. Each burst is represented by to lines: one line for timestamps and one line for the photon-stream (excitation/emission period) the timestamps belongs to.

Let's start by creating an array of photon streams containing one of the values 0, 1, 2 or 3 for each timestamp. These values will correspond to the DexDem, DexAem, AexDem, AemAem photon streams respectively.

In [17]:
ds.A_ex
Out[17]:
[array([False,  True,  True, ..., False,  True, False], dtype=bool)]
In [18]:
#{0: DexDem, 1:DexAem, 2: AexDem, 3: AemAem}
In [19]:
(ds.A_ex[0].view('int8') << 1) + ds.A_em[0].view('int8')
Out[19]:
array([1, 3, 3, ..., 1, 3, 0], dtype=int8)

Now we define an header documenting the file format. Ww will also include the filename of the measurement.

This is just an example including nanotimes:

In [20]:
header = """\
# BPH2CSV: %s
# Lines per burst: 3
# - timestamps (int64): in 12.5 ns units
# - nanotimes (int16): in raw TCSPC unit (3.3ps)
# - stream (uint8): the photon stream according to the mapping {0: DexDem, 1: DexAem, 2: AexDem, 3: AemAem}
""" % filename
print(header)
# BPH2CSV: ./data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5
# Lines per burst: 3
# - timestamps (int64): in 12.5 ns units
# - nanotimes (int16): in raw TCSPC unit (3.3ps)
# - stream (uint8): the photon stream according to the mapping {0: DexDem, 1: DexAem, 2: AexDem, 3: AemAem}

And this is header we are going to use:

In [21]:
header = """\
# BPH2CSV: %s
# Lines per burst: 2
# - timestamps (int64): in 12.5 ns units
# - stream (uint8): the photon stream according to the mapping {0: DexDem, 1: DexAem, 2: AexDem, 3: AemAem}
""" % filename
print(header)
# BPH2CSV: ./data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5
# Lines per burst: 2
# - timestamps (int64): in 12.5 ns units
# - stream (uint8): the photon stream according to the mapping {0: DexDem, 1: DexAem, 2: AexDem, 3: AemAem}

We can now save the data to disk:

In [22]:
out_fname = '%s_burst_timestamps.csv' % filename.replace('.hdf5', '')
dx = ds
ich = 0

bursts = dx.mburst[ich]
timestamps = dx.ph_times_m[ich]
stream = (dx.A_ex[ich].view('int8') << 1) + dx.A_em[ich].view('int8')
with open(out_fname, 'wt') as f:
    f.write(header)
    for times, period in zip(bl.iter_bursts_ph(timestamps, bursts),
                             bl.iter_bursts_ph(stream, bursts)):
        times.tofile(f, sep=',')
        f.write('\n')
        period.tofile(f, sep=',')
        f.write('\n')

Done!

Read the file back

For consistency check, we can read back the data we just saved. As an exercise we will put the results in a pandas DataFrame which is more convenient than an array for holding this data.

In [23]:
import pandas as pd

We start reading the header and computing some file-specific constants.

In [24]:
with open(out_fname) as f:
    lines = []
    lines.append(f.readline())
    while lines[-1].startswith('#'):
        lines.append(f.readline())
    header = ''.join(lines[:-1])
print(header)
# BPH2CSV: ./data/0023uLRpitc_NTP_20dT_0.5GndCl.hdf5
# Lines per burst: 2
# - timestamps (int64): in 12.5 ns units
# - stream (uint8): the photon stream according to the mapping {0: DexDem, 1: DexAem, 2: AexDem, 3: AemAem}

In [25]:
stream_map = {0: 'DexDem', 1: 'DexAem', 2: 'AexDem', 3: 'AemAem'}
nrows = int(header.split('\n')[1].split(':')[1].strip())
header_len = len(header.split('\n')) - 1
header_len, nrows
Out[25]:
(4, 2)

As a test, we load the data for the first burst into a dataframe, converting the numerical column "streams" into photon-stream names (strings). The new column is of type categorical, so it will take very little space:

In [26]:
burstph = (pd.read_csv(out_fname, skiprows=header_len, nrows=nrows, header=None).T
           .rename(columns={0: 'timestamp', 1: 'stream'}))
burstph.stream = (burstph.stream
                  .apply(lambda x: stream_map[pd.to_numeric(x)])
                  .astype('category', categories=['DexDem', 'DexAem', 'AexDem', 'AemAem'], ordered=True))
burstph
Out[26]:
timestamp stream
0 138596500 AemAem
1 138622303 DexDem
2 138626950 DexDem
3 138641792 AemAem
4 138643022 DexDem
... ... ...
142 138788833 AemAem
143 138790337 DexDem
144 138790703 DexAem
145 138790873 DexAem
146 138842421 DexDem

147 rows × 2 columns

For reading the whole file I use a different approach. First, I load the entire file in two lists of lists (one for timestamps and one for the stream). Next, I create a single DataFrame with a third column indicating the burst index.

In [27]:
import csv
from builtins import int  # python 2 workaround, can be removed on python 3

# Read data in two list of lists
t_list, s_list = [], []
with open(out_fname) as f:
    for i in range(header_len):
        f.readline()
    csvreader = csv.reader(f)    
    for row in csvreader:
        t_list.append([int(v) for v in row])
        s_list.append([int(v) for v in next(csvreader)])

# Turn the inner list into pandas.DataFrame
d_list = []
for ib, (t, s) in enumerate(zip(t_list, s_list)):
    d_list.append(
        pd.DataFrame({'timestamp': t, 'stream': s}, columns=['timestamp', 'stream'])
        .assign(iburst=ib)
    )

# Concatenate dataframes
burstph = pd.concat(d_list, ignore_index=True)

# Convert stream column into categorical
burstph.stream = (burstph.stream
                  .apply(lambda x: stream_map[pd.to_numeric(x)])
                  .astype('category', categories=['DexDem', 'DexAem', 'AexDem', 'AemAem'], ordered=True))
burstph
Out[27]:
timestamp stream iburst
0 138596500 AemAem 0
1 138622303 DexDem 0
2 138626950 DexDem 0
3 138641792 AemAem 0
4 138643022 DexDem 0
... ... ... ...
232327 47909435404 DexAem 987
232328 47909438825 DexAem 987
232329 47909451409 DexDem 987
232330 47909458878 DexAem 987
232331 47909462189 DexDem 987

232332 rows × 3 columns

In [28]:
burstph.dtypes
Out[28]:
timestamp       int64
stream       category
iburst          int64
dtype: object

This was just an example. There are certainly more efficient ways to read the file into a DataFrame. Please feel free to contribute new methods to illustrate a different (more efficient or simpler) approach.


Executed: Sat Nov 18 15:59:43 2017

Duration: 8 seconds.

Autogenerated from: Example - Exporting Burst Data Including Timestamps.ipynb