Notebook

Advanced SolveBio Tutorial¶

2016-11-01 Generating survival curves by cancer type¶

One powerful part of SolveBio is in the ability to filter through datasets quickly in the SolveBio cloud. This means you don't have to download the source data to your computer and run complicated and computationally heavy filtering to bring out the data that you need. This example script shows how you can generate Kaplan-Meier survival curves based on filtering ICGC data.

First we load up the solvebio and plotly modules, to access, filter, and display data. Make sure you already have the solvebio python client installed (see https://docs.solvebio.com/docs/installation for instructions).

In [1]:

from solvebio import login, Dataset, Filter
import plotly.plotly as py
import plotly.tools as tls
from plotly.graph_objs import *

# Load local SolveBio credentials
login()

We'll use the ICGC Donor dataset. You can explore this dataset in your browser with https://my.solvebio.com/data/ICGC/2.0.0-21/Donor.

In [2]:

icgc_donor = Dataset.retrieve('ICGC/2.0.0-21/Donor')
icgc_donor.query()

Out[2]:

|                                       Fields | Data                 |
|----------------------------------------------+----------------------|
|                                          _id | AVVMqqfMiFWN82jP3-04 |
|         cancer_history_first_degree_relative |                      |
|                 cancer_type_prior_malignancy |                      |
|                 disease_status_last_followup |                      |
|                       donor_age_at_diagnosis | 62                   |
|                      donor_age_at_enrollment | 62                   |
|                   donor_age_at_last_followup |                      |
|                        donor_diagnosis_icd10 | C67.9                |
|              donor_interval_of_last_followup |                      |
|                       donor_relapse_interval |                      |
|                           donor_relapse_type | local recurrence     |
|                                    donor_sex | male                 |
|                          donor_survival_time |                      |
|              donor_tumour_stage_at_diagnosis | T2NxMx               |
| donor_tumour_stage_at_diagnosis_supplemental |                      |
|     donor_tumour_staging_system_at_diagnosis | TNM                  |
|                           donor_vital_status |                      |
|                                icgc_donor_id | DO48367              |
|                             prior_malignancy |                      |
|                                 project_code | BLCA-CN              |
|                      study_donor_involved_in |                      |
|                           submitted_donor_id | China_0002_B105      |

... 18,676 more results.

We'll set the initial Kaplan-Meier interval sizes and total interval sizes as well as our initial filters. This particular filter will compare survival curves between the total ICGC dataset (every patient with survival information) and a subset of the ICGC that begins with the project code PACA (pancreatic cancer projects).

In [3]:

# interval sizes are in days
interval_size = 90
total_interval_to_follow = 1825

f1 = Filter()
f2 = Filter(project_code__prefix='PACA')

Now we construct the filters that will bring out the survival information data and start querying the SolveBio ICGC dataset for each set interval.

In [4]:

f1_total = icgc_donor.query(filters=f1).filter(donor_survival_time__gt=0).facets('donor_survival_time').get('donor_survival_time')['count']
f2_total = icgc_donor.query(filters=f2).filter(donor_survival_time__gt=0).facets('donor_survival_time').get('donor_survival_time')['count']

f1_data = [[0, 100]]
f2_data = [[0, 100]]

for day in range(interval_size, total_interval_to_follow, interval_size):
    f1_percent_alive = 100 * float(icgc_donor.query(filters=f1).filter(donor_survival_time__gte=day).facets('donor_survival_time').get('donor_survival_time')['count'])/float(f1_total)
    f1_data += [[day, f1_percent_alive]]
    
    f2_percent_alive = 100 * float(icgc_donor.query(filters=f2).filter(donor_survival_time__gte=day).facets('donor_survival_time').get('donor_survival_time')['count'])/float(f2_total)
    f2_data += [[day, f2_percent_alive]]

Finally, this entire module below plots the survival curves.

In [5]:

trace1 = Scatter(
    name=str(f1),
    x=[_x[0] for _x in f1_data],
    y=[_y[1] for _y in f1_data],
    mode='lines',
    line=Line(
        shape='hv'
    ),
)

trace2 = Scatter(
    name=str(f2),
    x=[_x[0] for _x in f2_data],
    y=[_y[1] for _y in f2_data],
    mode='lines',
    line=Line(
        shape='hv'
    ),
)

data = Data([trace1, trace2])

# Add title to layout object
layout = Layout(
    title='Kaplan-Meier',
    showlegend=True,
    legend=Legend(
        x=0,
        y=100
    ),
    xaxis=XAxis(
        title='Days',
        zeroline=True,
        showline=True,
        tick0=0,
        range=[0, total_interval_to_follow + 1],
    ),
    yaxis=YAxis(
        title='Percent Survival',
        range=[0, 101],
        tick0=0,
        dtick=10,
    )
)
# Make a figure object
fig = Figure(data=data, layout=layout)

# (@) Send fig to Plotly, initialize streaming plot, open new tab
plot_url = py.plot(fig, filename='static-kaplan-meier', auto_open=False)
tls.embed(plot_url)

Out[5]:

With the filtering options in SolveBio, you can quickly do analyses such as plotting survival curves for various patient populations with additional experimental data in ICGC (for example, those with a specific somatic mutation, or overexpression in a gene, or a specific methylation signature, versus those without).