ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to create cohorts
Author:  Lauren Hagen
Created: 2019-06-20
Purpose: Basic overview of creating cohorts
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb
Notes:   This notebook was adapted from work by Shiela Reynolds, 'How to Create TCGA Cohorts part 1' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.

This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial. You can also explore the available tables and data sets before commiting to creating a GCP project though the ISB-CGC BigQuery Table Searcher.

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:

There are also many tutorials and samples available on github (see, in particular, the datalab repo, the Google Genomics project), and our own Community Notebooks.

OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:

In [0]:
from google.cloud import bigquery

Next we will need to Authorize our access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

In [ ]:
!gcloud auth application-default login
In [0]:
# Create a variable for which client to use with BigQuery
project_num = 'your_project_number' # Update with your Google Project number

if project_num == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client('project_num') # Replace your_project_number with your project ID

The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called TCGA_bioclin_v0, owned by the isb-cgc project. A full table identifier is of the form <project_id>.<dataset_id>.<table_id>. Let's start by getting some basic information about the tables in this dataset:

In [0]:
# For each table in the dataset print the number of rows,
# number of bytes and the name of the table
print("Tables:")
for t in list(client.list_tables('isb-cgc.TCGA_bioclin_v0')):
  print(t.table_id)
Tables:
Annotations
Biospecimen
Clinical
clinical_v1

In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.

In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.

Exploring the Clinical data table

Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:

Get Table Schema

In [0]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery --project isb-cgc-02-0001
SELECT column_name
FROM `isb-cgc.TCGA_bioclin_v0.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'Clinical'

# Syntax of the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# WHERE table_catalog=project_name and table_schema=dataset_name and table_name=table_name
Out[0]:
column_name
0 program_name
1 case_barcode
2 case_gdc_id
3 program_dbgap_accession_number
4 project_short_name
5 project_name
6 disease_code
7 gender
8 vital_status
9 race
10 ethnicity
11 age_at_diagnosis
12 days_to_birth
13 days_to_death
14 days_to_initial_pathologic_diagnosis
15 days_to_last_followup
16 days_to_last_known_alive
17 days_to_submitted_specimen_dx
18 clinical_stage
19 clinical_T
20 clinical_N
21 clinical_M
22 pathologic_stage
23 pathologic_T
24 pathologic_N
25 pathologic_M
26 year_of_initial_pathologic_diagnosis
27 tumor_tissue_site
28 primary_neoplasm_melanoma_dx
29 anatomic_neoplasm_subdivision
... ...
43 residual_tumor
44 tumor_type
45 new_tumor_event_after_initial_treatment
46 lymphnodes_examined
47 number_of_lymphnodes_examined
48 number_of_lymphnodes_positive_by_he
49 lymphatic_invasion
50 venous_invasion
51 lymphovascular_invasion_present
52 bcr
53 batch_number
54 tss_code
55 age_began_smoking_in_years
56 year_of_tobacco_smoking_onset
57 stopped_smoking_year
58 tobacco_smoking_history
59 number_pack_years_smoked
60 height
61 weight
62 bmi
63 mononucleotide_and_dinucleotide_marker_panel_a...
64 menopause_status
65 pregnancies
66 hpv_status
67 hpv_calls
68 h_pylori_infection
69 gleason_score_combined
70 psa_value
71 colorectal_cancer
72 history_of_colon_polyps

73 rows × 1 columns

That's a lot of fields! We can also get at the schema programmatically:

Programmatically Get Schema

In [0]:
# Create a reference for the table
table_ref = "isb-cgc.TCGA_bioclin_v0.Clinical"

# Get the table
table = client.get_table(table_ref)
# Create a list of the field names
fieldNames = list(map(lambda tsf: tsf.name, table.schema))
#Create a list of the field types
fieldTypes = list(map(lambda tsf: tsf.field_type, table.schema))
# Print the number of fields and the first 5 fields
print("This table has {} fields. ".format(len(fieldNames)))
print("The first few field names and types are: ")
for i in range(5):
  print("{} - {}".format(fieldNames[i], fieldTypes[i]))
This table has 73 fields. 
The first few field names and types are: 
program_name - STRING
case_barcode - STRING
case_gdc_id - STRING
program_dbgap_accession_number - STRING
project_short_name - STRING

Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:

In [0]:
%%bigquery --project isb-cgc-02-0001
SELECT tobacco_smoking_history,
COUNT(*) AS n
FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
GROUP BY tobacco_smoking_history
ORDER BY n DESC
Out[0]:
tobacco_smoking_history n
0 NaN 8316
1 1.0 865
2 4.0 799
3 2.0 710
4 3.0 568
5 5.0 57

But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).

Find Interesting Fields

In [0]:
# Get the number of Tables
numPatients = table.num_rows
# Print the total number of patients
print(" The {} table describes a total of {} patients. ".format(table.table_id, numPatients))

# let's set a threshold for the minimum number of values that a field should have,
# the maximum number of unique values, and either the highest cancer type or
# the mean and sigma of the row.
minNumPatients = int(numPatients*0.80)
maxNumValues = 50

# Create a variable to be filled in by the for loop with the number
# interesting features
numInteresting = 0
# Create a list to hold the results from the loop below
iList = []

# Loop over the fields and find the number of values with the number of unique
# values and the 
for iField in range(len(fieldNames)):
  aField = fieldNames[iField]
  aType = fieldTypes[iField]
  qString = "SELECT {} FROM `isb-cgc.TCGA_bioclin_v0.Clinical`".format(aField)
  df = client.query(qString).result().to_dataframe()
  summary = df[str(aField)].describe()
  if ( aType == "STRING" ):
    topFrac = float(summary['freq'])/float(summary['count'])
    if ( summary['count'] >= minNumPatients ):
      if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
        if ( topFrac < 0.90 ):
          numInteresting += 1
          iList += [aField]
          print("     > {} has {} values with {} unique ({} occurs {} times)".format(aField, summary['count'], summary['unique'], summary['top'], round(summary['freq'],2)))
  else:
    if ( summary['count'] >= minNumPatients ):
      if ( summary['std'] > 0.1 ):
        numInteresting += 1
        iList += [aField]
        print("     > {} has {} values (mean={}, sigma={}) ".format(aField, summary['count'], round(summary['mean'], 2), round(summary['std'], 2)))

print(" ")
print(" Found {} potentially interesting features: ".format(numInteresting))
print("   ", iList)
 The Clinical table describes a total of 11315 patients. 
     > project_short_name has 11315 values with 33 unique (TCGA-BRCA occurs 1098 times)
     > project_name has 11315 values with 33 unique (Breast Invasive Carcinoma occurs 1098 times)
     > disease_code has 11315 values with 33 unique (BRCA occurs 1098 times)
     > gender has 11160 values with 2 unique (FEMALE occurs 5815 times)
     > vital_status has 11156 values with 2 unique (Alive occurs 7534 times)
     > race has 9835 values with 5 unique (WHITE occurs 8186 times)
     > age_at_diagnosis has 11109.0 values (mean=59.1, sigma=14.42) 
     > days_to_birth has 11041.0 values (mean=-21762.93, sigma=5266.39) 
     > days_to_last_known_alive has 11102.0 values (mean=1037.33, sigma=1040.86) 
     > year_of_initial_pathologic_diagnosis has 11030.0 values (mean=2008.03, sigma=4.32) 
     > person_neoplasm_cancer_status has 10236 values with 2 unique (TUMOR FREE occurs 6507 times)
     > batch_number has 11160.0 values (mean=203.23, sigma=134.56) 
 
 Found 12 potentially interesting features: 
    ['project_short_name', 'project_name', 'disease_code', 'gender', 'vital_status', 'race', 'age_at_diagnosis', 'days_to_birth', 'days_to_last_known_alive', 'year_of_initial_pathologic_diagnosis', 'person_neoplasm_cancer_status', 'batch_number']

The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:

In [0]:
%%bigquery --project isb-cgc-02-0001
SELECT menopause_status, COUNT(*) AS n
FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC
Out[0]:
menopause_status n
0 Post (prior bilateral ovariectomy OR >12 mo si... 1291
1 Pre (<6 months since LMP AND no prior bilatera... 389
2 Peri (6-12 months since last menstrual period) 82
3 Indeterminate (neither Pre or Postmenopausal) 54

We might wonder which specific tumor types have menopause information:

In [0]:
%%bigquery --project isb-cgc-02-0001
SELECT project_short_name, COUNT(*) AS n
FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE menopause_status IS NOT NULL
GROUP BY project_short_name
ORDER BY n DESC
Out[0]:
project_short_name n
0 TCGA-BRCA 1007
1 TCGA-UCEC 517
2 TCGA-CESC 237
3 TCGA-UCS 55
In [0]:
%%bigquery --project isb-cgc-02-0001 
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC
Out[0]:
hpv_status hpv_calls n
0 Negative None 664
1 Positive HPV16 238
2 Positive HPV18 41
3 Positive HPV33 25
4 Positive HPV45 24

TCGA Annotations

An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, an overview of the annoations can be found here. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.

Create a Cohort from Two Tables

Now that we have a better idea of what types of information is available in the Clinical data table, let's create a cohort consisting of female breast-cancer patients, diagnosed at the age of 50 or younger.

In this next code cell, we define several queries with a WITH clause which allows us to use them in a final query. We will then save the query to a Pandas DataFrame to allow it to be analyzed later with a named data frame.

  • the first query, called select_on_annotations, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment';
  • the second query, select_on_clinical selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and
  • the final query joins these two together and returns just those patients that meet the clinical-criteria and do not meet the exclusion-criteria.

Create a Query for a Cohort from Two Tables

In [0]:
# First use the BigQuery Magic Command (%%bigquery) then name the dataframe
# (early_onset_breast_canver), and finallly include your project ID
# (--project project ID).
%%bigquery early_onset_breast_cancer --project isb-cgc-02-0001
WITH
  select_on_annotations AS (
  SELECT
    case_barcode,
    category AS categoryName,
    classification AS classificationName
  FROM
    `isb-cgc.TCGA_bioclin_v0.Annotations`
  WHERE
    ( entity_type="Patient"
      AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
        OR classification="Redaction" ) )
  GROUP BY
    case_barcode,
    categoryName,
    classificationName
),
--
select_on_clinical AS (
  SELECT
    case_barcode,
    vital_status,
    days_to_last_known_alive,
    ethnicity,
    histological_type,
    menopause_status,
    race
  FROM
    `isb-cgc.TCGA_bioclin_v0.Clinical`
  WHERE
    ( disease_code = "BRCA"
      AND age_at_diagnosis<=50
      AND gender="FEMALE" )
)
--
SELECT
  case_barcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    c.case_barcode
  FROM select_on_annotations AS a
  FULL JOIN select_on_clinical AS c
  ON
    a.case_barcode = c.case_barcode
  WHERE
    a.case_barcode IS NOT NULL
    OR c.case_barcode IS NOT NULL
  ORDER BY
    a.classificationName,
    a.categoryName,
    c.case_barcode
)
WHERE
  categoryName IS NULL
  AND classificationName IS NULL
  AND case_barcode IS NOT NULL
ORDER BY
  case_barcode
In [0]:
early_onset_breast_cancer.head()
Out[0]:
case_barcode
0 TCGA-3C-AALI
1 TCGA-4H-AAAK
2 TCGA-5L-AAT0
3 TCGA-A1-A0SH
4 TCGA-A1-A0SJ

Useful Tricks

Before we leave off, here are a few useful tricks for working with BigQuery:

  • If you want to see how much data and which tables are going to be touched by this data, you can use the "dry run" option.
  • You can then build a query as a variable and put the results into a dataframe instead of
In [0]:
# Create a variable with the query as a string
breast_cancer_query = """
WITH
  select_on_annotations AS (
  SELECT
    case_barcode,
    category AS categoryName,
    classification AS classificationName
  FROM
    `isb-cgc.TCGA_bioclin_v0.Annotations`
  WHERE
    ( entity_type="Patient"
      AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
        OR classification="Redaction" ) )
  GROUP BY
    case_barcode,
    categoryName,
    classificationName
),
--
select_on_clinical AS (
  SELECT
    case_barcode,
    vital_status,
    days_to_last_known_alive,
    ethnicity,
    histological_type,
    menopause_status,
    race
  FROM
    `isb-cgc.TCGA_bioclin_v0.Clinical`
  WHERE
    ( disease_code = "BRCA"
      AND age_at_diagnosis<=50
      AND gender="FEMALE" )
)
--
SELECT
  case_barcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    c.case_barcode
  FROM select_on_annotations AS a
  FULL JOIN select_on_clinical AS c
  ON
    a.case_barcode = c.case_barcode
  WHERE
    a.case_barcode IS NOT NULL
    OR c.case_barcode IS NOT NULL
  ORDER BY
    a.classificationName,
    a.categoryName,
    c.case_barcode
)
WHERE
  categoryName IS NULL
  AND classificationName IS NULL
  AND case_barcode IS NOT NULL
ORDER BY
  case_barcode
"""

Since this is a large query, we might want to check the number of bytes that will be processed. We can use a dry run to see how many bytes wil be processed with the query without actually doing a query.

In [0]:
job_config = bigquery.QueryJobConfig()
job_config.dry_run = True
job_config.use_query_cache = False
query_job = client.query(
    (breast_cancer_query),
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
    job_config=job_config,
)  # API request

# A dry run query completes immediately.
assert query_job.state == "DONE"
assert query_job.dry_run

print("This query will process {} Kilobytes.".format(query_job.total_bytes_processed*0.001))
This query will process 788.043 Kilobytes.

For more information on price estimation, please see the Estimating storage and query costs and BigQuery best practices: Controlling costs pages.

Below we can then use the same variable with the query to then run the query and put the result into a Pandas Dataframe for later analysis.

In [0]:
query = client.query(breast_cancer_query)
breast_cancer_query_results = query.result().to_dataframe()
breast_cancer_query_results.head(5)
Out[0]:
case_barcode
0 TCGA-3C-AALI
1 TCGA-4H-AAAK
2 TCGA-5L-AAT0
3 TCGA-A1-A0SH
4 TCGA-A1-A0SJ