ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to create a complext cohort
Author:  Lauren Hagen
Created: 2019-08-12
Purpose: More complex overview of creating cohorts
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_a_complex_cohort.ipynb
Notes:   This notebook was adapted from work by Shiela Reynolds, 'How to Create TCGA Cohorts part 3' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.

This notebook will construct a cohort for a single tumor type based on data availability, while also taking into consideration annotations about the patients or samples.

As you've seen already, in order to work with BigQuery, the first thing we need to do is import the bigquery module:

In [0]:
from google.cloud import bigquery

And we will need to Authenticate ourselves.

In [2]:
!gcloud auth application-default login
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=acrfkWr7xOoV243XaGJFWtUaFIPXXMWRTbZtsKTMmGE&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/qQGcKBsJfiBHp-IaAtot6ViuE6xFuruVxGq60zoersfxKp3re1Qfy6k

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


To take a quick anonymous survey, run:
  $ gcloud alpha survey

Just so that this doesn't get buried in the code below, we are going to specify our tumor-type of interest here. In TCGA each tumor-type is also a separate study within the TCGA project. The studies are referred to based on the 2-4 letter tumor-type abbreviation. A complete list of all study abbreviations, with the full study name can be found on this page. For this particular exercise, we will look at the "Breast invasive carcinoma" study, abbreviated BRCA:

In [0]:
studyName = "TCGA-BRCA"

More information the the BRCA study can be found here. In this notebook, we are going to wind up making use of all of the available data types, so let's have a look at the entire TCGA_hg38_data_v0 dataset in the query below. The tables and data sets available from ISB-CGC in BigQuery can also be explored without login with the ISB-CGC BigQuery Table Searcher.

In [0]:
# Create a client to access the data within BigQuery
client = bigquery.Client('isb-cgc')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project
In [5]:
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('TCGA_hg38_data_v0'))
# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")
Tables:
	Copy_Number_Segment_Masked
	Copy_Number_Segment_Masked_r14
	DNA_Methylation
	DNA_Methylation_chr1
	DNA_Methylation_chr10
	DNA_Methylation_chr11
	DNA_Methylation_chr12
	DNA_Methylation_chr13
	DNA_Methylation_chr14
	DNA_Methylation_chr15
	DNA_Methylation_chr16
	DNA_Methylation_chr17
	DNA_Methylation_chr18
	DNA_Methylation_chr19
	DNA_Methylation_chr2
	DNA_Methylation_chr20
	DNA_Methylation_chr21
	DNA_Methylation_chr22
	DNA_Methylation_chr3
	DNA_Methylation_chr4
	DNA_Methylation_chr5
	DNA_Methylation_chr6
	DNA_Methylation_chr7
	DNA_Methylation_chr8
	DNA_Methylation_chr9
	DNA_Methylation_chrX
	DNA_Methylation_chrY
	Protein_Expression
	RNAseq_Gene_Expression
	Somatic_Mutation
	Somatic_Mutation_DR10
	Somatic_Mutation_DR6
	Somatic_Mutation_DR7
	miRNAseq_Expression
	miRNAseq_Isoform_Expression
	tcga_metadata_data_hg38_r14

In this next code cell, we define an SQL query called DNU_patients which finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment'. It will display the output of the query and then save the data to a Pandas DataFrame.

Now we'll use the query defined above to get the "Do Not Use" list of participants (aka patients):

Note: you will need to update 'your_project_number' with your project number before continuing with the notebook

In [0]:
%%bigquery DNU_patients --your_project_number

SELECT
  case_barcode,
  category AS categoryName,
  classification AS classificationName
FROM
  `isb-cgc.TCGA_bioclin_v0.Annotations`
WHERE
  ( entity_type="Patient"
    AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
      OR classification="Redaction" ) )
GROUP BY
  case_barcode,
  categoryName,
  classificationName
ORDER BY
  case_barcode
In [7]:
DNU_patients.describe()
Out[7]:
case_barcode categoryName classificationName
count 212 212 212
unique 212 8 2
top TCGA-BP-5006 History of unacceptable prior treatment relate... Notification
freq 1 137 137

Now we're gong to use the query defined previously in a function that builds a "clean" list of patients in the specified study, with available molecular data, and without any disqualifying annotations.

In [0]:
def buildCleanBarcodeList ( studyName, bqDataset, barcodeType, DNUList):
  print("in buildCleanBarcodeList ... ", studyName) # Print a start statement
  ULists = [] # Define an empty list for Unique Barcodes
  print("  --> looping over data tables: ") # Print an indication of when the loop is beginning
  barcodeField = barcodeType # Set the barcodeField for each loop
  # For each dataset in the bqDataset table
  for t in bqDataset:
    currTable = "`isb-cgc.TCGA_hg38_data_v0." + t.table_id + "`" # Set the current table
    barcodeField = barcodeType # reset the barcode Field for each loop
    # Try the simple query that will get the barcode list from most of the tables
    try:
      # Set query string
      get_barcode_list = """
        SELECT {} as barcode
        FROM {}
        WHERE project_short_name="{}"
       GROUP BY {}
        ORDER BY {}
       """.format(barcodeField, currTable, studyName, barcodeField, barcodeField)
      # Query BigQuery
      results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
      # Store the results into a dataframe
      results = results.result().to_dataframe()
      # If the simple query does not run, try joining the two tables to add the
      # project_short_name to the the query as it is not in some of the Methylation tables
    except:
      try:
         get_barcode_list = """
            WITH a AS(
                SELECT {}
                FROM {}),
              b AS ( SELECT {}, project_short_name
                FROM `isb-cgc.TCGA_hg38_data_v0.Copy_Number_Segment_Masked`)
            SELECT {} as barcode
            FROM (
              SELECT a.{} AS {}, b.project_short_name AS project_short_name
              FROM a
              JOIN b
              ON a.{} = b.{})
            WHERE project_short_name='{}'
            GROUP BY {}
            ORDER BY {}
            """.format(barcodeField, currTable, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, studyName, barcodeField, barcodeField)
         # Query BigQuery
         results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
         # Store the results into a dataframe
         results = results.result().to_dataframe()
      except:
        try:
          barcodeField = "sample_barcode_tumor"
          get_barcode_list = """
            SELECT {} as barcode
            FROM {}
            WHERE project_short_name="{}"
            GROUP BY {}
            ORDER BY {}
            """.format(barcodeField, currTable, studyName, barcodeField, barcodeField)
          # Query BigQuery
          results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
          # Store the results into a dataframe
          results = results.result().to_dataframe()
        except:
          # Create an empty dataframe if none of the above queries return values
          results = pd.DataFrame()
    # If there is data in the result dataframe, add the results to ULists
    if ( len(results) > 0):
        print("      ", t.table_id, "  --> ", len(results["barcode"]), "unique barcodes.")
        ULists += [ results ]
    else:
        print("      ", t.table_id, " --> no data returned")
  print("  --> we have {} lists to merge".format(len(ULists)))
  masterList = [] # Create a list for the master list of barcodes
  # Add barcodes to the master list with no repeating barcodes
  for aList in ULists:
    for aBarcode in aList["barcode"]:
      if ( aBarcode not in masterList ):
        masterList += [aBarcode]
  print("  --> which results in a single list with {} barcodes".format(len(masterList)))
  print("  --> removing DNU barcodes")
  cleanList = []
  # For each barcode in the master check to see if it is in the DNU
  # list and then add it to the clean list
  for aBarcode in masterList:
    if ( aBarcode not in DNUList[barcodeField].tolist() ):
      cleanList += [ aBarcode ]
    else:
      print("      excluding this barcode", aBarcode)
  print("  --> returning a clean list with {} barcodes".format(len(cleanList)))
  return(cleanList)
In [14]:
barcodeType = "case_barcode"
cleanPatientList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNU_patients )
in buildCleanBarcodeList ...  TCGA-BRCA
  --> looping over data tables: 
       Copy_Number_Segment_Masked   -->  1096 unique barcodes.
       Copy_Number_Segment_Masked_r14   -->  1098 unique barcodes.
       DNA_Methylation   -->  1095 unique barcodes.
       DNA_Methylation_chr1   -->  1095 unique barcodes.
       DNA_Methylation_chr10   -->  1095 unique barcodes.
       DNA_Methylation_chr11   -->  1095 unique barcodes.
       DNA_Methylation_chr12   -->  1095 unique barcodes.
       DNA_Methylation_chr13   -->  1095 unique barcodes.
       DNA_Methylation_chr14   -->  1095 unique barcodes.
       DNA_Methylation_chr15   -->  1095 unique barcodes.
       DNA_Methylation_chr16   -->  1095 unique barcodes.
       DNA_Methylation_chr17   -->  1095 unique barcodes.
       DNA_Methylation_chr18   -->  1095 unique barcodes.
       DNA_Methylation_chr19   -->  1095 unique barcodes.
       DNA_Methylation_chr2   -->  1095 unique barcodes.
       DNA_Methylation_chr20   -->  1095 unique barcodes.
       DNA_Methylation_chr21   -->  1095 unique barcodes.
       DNA_Methylation_chr22   -->  1095 unique barcodes.
       DNA_Methylation_chr3   -->  1095 unique barcodes.
       DNA_Methylation_chr4   -->  1095 unique barcodes.
       DNA_Methylation_chr5   -->  1095 unique barcodes.
       DNA_Methylation_chr6   -->  1095 unique barcodes.
       DNA_Methylation_chr7   -->  1095 unique barcodes.
       DNA_Methylation_chr8   -->  1095 unique barcodes.
       DNA_Methylation_chr9   -->  1095 unique barcodes.
       DNA_Methylation_chrX   -->  1095 unique barcodes.
       DNA_Methylation_chrY   -->  1095 unique barcodes.
       Protein_Expression   -->  887 unique barcodes.
       RNAseq_Gene_Expression   -->  1092 unique barcodes.
       Somatic_Mutation   -->  986 unique barcodes.
       Somatic_Mutation_DR10   -->  986 unique barcodes.
       Somatic_Mutation_DR6   -->  986 unique barcodes.
       Somatic_Mutation_DR7   -->  986 unique barcodes.
       miRNAseq_Expression   -->  1079 unique barcodes.
       miRNAseq_Isoform_Expression   -->  1079 unique barcodes.
       tcga_metadata_data_hg38_r14   -->  1098 unique barcodes.
  --> we have 36 lists to merge
  --> which results in a single list with 1098 barcodes
  --> removing DNU barcodes
      excluding this barcode TCGA-5L-AAT1
      excluding this barcode TCGA-A8-A084
      excluding this barcode TCGA-A8-A08F
      excluding this barcode TCGA-A8-A08S
      excluding this barcode TCGA-A8-A09E
      excluding this barcode TCGA-A8-A09K
      excluding this barcode TCGA-AR-A2LL
      excluding this barcode TCGA-AR-A2LR
      excluding this barcode TCGA-BH-A0B6
      excluding this barcode TCGA-BH-A1F5
      excluding this barcode TCGA-D8-A146
  --> returning a clean list with 1087 barcodes

Now we are going to repeat the same process, but at the sample barcode level. Most patients will have provided two samples, a "primary tumor" sample, and a "normal blood" sample, but in some cases additional or different types of samples may have been provided, and sample-level annotations may exist that should result in samples being excluded from most downstream analyses.

In [0]:
%%bigquery DNUsamples --project your_project_number

# there are many different types of annotations that are at the "sample" level
# in the Annotations table, and most of them seem like they should be "disqualifying"
# annotations, so for now we will just return all sample barcodes with sample-level
# annotations
SELECT
  sample_barcode
FROM
  `isb-cgc.TCGA_bioclin_v0.Annotations`
WHERE
  (entity_type="Sample")
GROUP BY
  sample_barcode
ORDER BY
  sample_barcode
In [16]:
DNUsamples.describe()
Out[16]:
sample_barcode
count 113
unique 113
top TCGA-06-0149-01A
freq 1

And now we can re-use the previously defined function get a clean list of sample-level barcodes:

In [27]:
barcodeType = "sample_barcode"
cleanSampleList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNUsamples  )
in buildCleanBarcodeList ...  TCGA-BRCA
  --> looping over data tables: 
       Copy_Number_Segment_Masked   -->  2219 unique barcodes.
       Copy_Number_Segment_Masked_r14   -->  2224 unique barcodes.
       DNA_Methylation   -->  1205 unique barcodes.
       DNA_Methylation_chr1   -->  1205 unique barcodes.
       DNA_Methylation_chr10   -->  1205 unique barcodes.
       DNA_Methylation_chr11   -->  1205 unique barcodes.
       DNA_Methylation_chr12   -->  1205 unique barcodes.
       DNA_Methylation_chr13   -->  1205 unique barcodes.
       DNA_Methylation_chr14   -->  1205 unique barcodes.
       DNA_Methylation_chr15   -->  1205 unique barcodes.
       DNA_Methylation_chr16   -->  1205 unique barcodes.
       DNA_Methylation_chr17   -->  1205 unique barcodes.
       DNA_Methylation_chr18   -->  1205 unique barcodes.
       DNA_Methylation_chr19   -->  1205 unique barcodes.
       DNA_Methylation_chr2   -->  1205 unique barcodes.
       DNA_Methylation_chr20   -->  1205 unique barcodes.
       DNA_Methylation_chr21   -->  1205 unique barcodes.
       DNA_Methylation_chr22   -->  1205 unique barcodes.
       DNA_Methylation_chr3   -->  1205 unique barcodes.
       DNA_Methylation_chr4   -->  1205 unique barcodes.
       DNA_Methylation_chr5   -->  1205 unique barcodes.
       DNA_Methylation_chr6   -->  1205 unique barcodes.
       DNA_Methylation_chr7   -->  1205 unique barcodes.
       DNA_Methylation_chr8   -->  1205 unique barcodes.
       DNA_Methylation_chr9   -->  1205 unique barcodes.
       DNA_Methylation_chrX   -->  1205 unique barcodes.
       DNA_Methylation_chrY   -->  1205 unique barcodes.
       Protein_Expression   -->  937 unique barcodes.
       RNAseq_Gene_Expression   -->  1217 unique barcodes.
       Somatic_Mutation   -->  986 unique barcodes.
       Somatic_Mutation_DR10   -->  986 unique barcodes.
       Somatic_Mutation_DR6   -->  986 unique barcodes.
       Somatic_Mutation_DR7   -->  986 unique barcodes.
       miRNAseq_Expression   -->  1202 unique barcodes.
       miRNAseq_Isoform_Expression   -->  1202 unique barcodes.
       tcga_metadata_data_hg38_r14   -->  3298 unique barcodes.
  --> we have 36 lists to merge
  --> which results in a single list with 3339 barcodes
  --> removing DNU barcodes
      excluding this barcode TCGA-B6-A1KC-01A
  --> returning a clean list with 3338 barcodes

Now we're going to double-check first that we keep only sample-level barcodes that correspond to patients in the "clean" list of patients, and then we'll also filter the list of patients in case there are patients with no samples remaining in the "clean" list of samples.

In [28]:
finalSampleList = []
for aSample in cleanSampleList:
  aPatient = aSample[:12]
  if ( aPatient not in cleanPatientList ):
    print("     excluding this sample in the final pass: ", aSample)
  else:
    finalSampleList += [aSample]
    
print(" Length of final sample list: {} ".format(len(finalSampleList)))
     excluding this sample in the final pass:  TCGA-5L-AAT1-01A
     excluding this sample in the final pass:  TCGA-5L-AAT1-10A
     excluding this sample in the final pass:  TCGA-A8-A084-01A
     excluding this sample in the final pass:  TCGA-A8-A084-10A
     excluding this sample in the final pass:  TCGA-A8-A08F-01A
     excluding this sample in the final pass:  TCGA-A8-A08F-10A
     excluding this sample in the final pass:  TCGA-A8-A08S-01A
     excluding this sample in the final pass:  TCGA-A8-A08S-10A
     excluding this sample in the final pass:  TCGA-A8-A09E-01A
     excluding this sample in the final pass:  TCGA-A8-A09E-10A
     excluding this sample in the final pass:  TCGA-A8-A09K-01A
     excluding this sample in the final pass:  TCGA-A8-A09K-10A
     excluding this sample in the final pass:  TCGA-AR-A2LL-01A
     excluding this sample in the final pass:  TCGA-AR-A2LL-10A
     excluding this sample in the final pass:  TCGA-AR-A2LR-01A
     excluding this sample in the final pass:  TCGA-AR-A2LR-10A
     excluding this sample in the final pass:  TCGA-BH-A0B6-01A
     excluding this sample in the final pass:  TCGA-BH-A0B6-10A
     excluding this sample in the final pass:  TCGA-BH-A1F5-01A
     excluding this sample in the final pass:  TCGA-BH-A1F5-11A
     excluding this sample in the final pass:  TCGA-D8-A146-01A
     excluding this sample in the final pass:  TCGA-D8-A146-10A
     excluding this sample in the final pass:  TCGA-5L-AAT1-01Z
     excluding this sample in the final pass:  TCGA-A8-A084-01Z
     excluding this sample in the final pass:  TCGA-A8-A08F-01Z
     excluding this sample in the final pass:  TCGA-A8-A08S-01Z
     excluding this sample in the final pass:  TCGA-A8-A09E-01Z
     excluding this sample in the final pass:  TCGA-A8-A09K-01Z
     excluding this sample in the final pass:  TCGA-AR-A2LL-01Z
     excluding this sample in the final pass:  TCGA-AR-A2LR-01Z
     excluding this sample in the final pass:  TCGA-BH-A0B6-01Z
     excluding this sample in the final pass:  TCGA-BH-A1F5-01Z
     excluding this sample in the final pass:  TCGA-D8-A146-01Z
 Length of final sample list: 3305 
In [29]:
finalPatientList = []
for aSample in finalSampleList:
  aPatient = aSample[:12]
  if ( aPatient not in finalPatientList ):
    finalPatientList += [ aPatient ]
    
print(" Lenth of final patient list: {} ".format(len(finalPatientList)))

for aPatient in cleanPatientList:
  if ( aPatient not in finalPatientList ):
    print("     --> patient removed in final pass: ", aPatient)
 Lenth of final patient list: 1087 

We're also interested in knowing what types of samples we have. The codes for the major types of samples are:

  • 01 : primary solid tumor
  • 02 : recurrent solid tumor
  • 03 : primary blood derived cancer
  • 06 : metastatic
  • 10 : blood derived normal
  • 11 : solid tissue normal

and a complete list of all sample type codes and their definitions can be found here.

In [30]:
sampleCounts = {}
for aSample in finalSampleList:
  sType = str(aSample[13:15])
  if ( sType not in sampleCounts ): sampleCounts[sType] = 0
  sampleCounts[sType] += 1
  
for aKey in sorted(sampleCounts):
  print("     {} samples of type {} ".format(sampleCounts[aKey], aKey))
     2145 samples of type 01 
     8 samples of type 06 
     991 samples of type 10 
     161 samples of type 11 

Now we are going to create a simple dataframe with all of the sample barcodes and the associated patient (participant) barcodes so that we can write this to a BigQuery "cohort" table.

In [31]:
import pandas as pd

patientBarcodes = []
sampleBarcodes = []
for aSample in finalSampleList:
  sampleBarcodes += [aSample]
  patientBarcodes += [aSample[:12]]
df = pd.DataFrame ( { 'ParticipantBarcode': patientBarcodes,
                      'SampleBarcode': sampleBarcodes } )
df.describe()
Out[31]:
ParticipantBarcode SampleBarcode
count 3305 3305
unique 1087 3305
top TCGA-A7-A0DB TCGA-AN-A0XT-01A
freq 5 1

As a next step you may want to consider is to put the data into your GCP. An example of how to move files in and out of GCP and BigQuery can be found here along with other tutorial notebooks.