Check out more notebooks at our Community Notebooks Repository!
Title: How to create a complext cohort
Author: Lauren Hagen
Created: 2019-08-12
Purpose: More complex overview of creating cohorts
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_a_complex_cohort.ipynb
Notes: This notebook was adapted from work by Sheila Reynolds, 'How to Create TCGA Cohorts part 3' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.
This notebook will construct a cohort for a single tumor type based on data availability, while also taking into consideration annotations about the patients or samples.
As you've seen already, in order to work with BigQuery, the first thing we need to do is import the bigquery module:
from google.cloud import bigquery
And we will need to Authenticate ourselves.
!gcloud auth application-default login
Go to the following link in your browser: https://accounts.google.com/o/oauth2/auth?code_challenge=acrfkWr7xOoV243XaGJFWtUaFIPXXMWRTbZtsKTMmGE&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth Enter verification code: 4/qQGcKBsJfiBHp-IaAtot6ViuE6xFuruVxGq60zoersfxKp3re1Qfy6k Credentials saved to file: [/content/.config/application_default_credentials.json] These credentials will be used by any library that requests Application Default Credentials. To generate an access token for other uses, run: gcloud auth application-default print-access-token To take a quick anonymous survey, run: $ gcloud alpha survey
Just so that this doesn't get buried in the code below, we are going to specify our tumor-type of interest here. In TCGA each tumor-type is also a separate study within the TCGA project. The studies are referred to based on the 2-4 letter tumor-type abbreviation. A complete list of all study abbreviations, with the full study name can be found on this page. For this particular exercise, we will look at the "Breast invasive carcinoma" study, abbreviated BRCA:
studyName = "TCGA-BRCA"
More information the the BRCA study can be found here. In this notebook, we are going to wind up making use of all of the available data types, so let's have a look at the entire TCGA_hg38_data_v0
dataset in the query below. The tables and data sets available from ISB-CGC in BigQuery can also be explored without login with the ISB-CGC BigQuery Table Searcher.
# Create a client to access the data within BigQuery
client = bigquery.Client('isb-cgc')
# Create a variable of datasets
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('TCGA_hg38_data_v0'))
# If there are tables then print their names,
# else print that there are no tables
if tables:
for table in tables:
print("\t{}".format(table.table_id))
else:
print("\tThis dataset does not contain any tables.")
Tables: Copy_Number_Segment_Masked Copy_Number_Segment_Masked_r14 DNA_Methylation DNA_Methylation_chr1 DNA_Methylation_chr10 DNA_Methylation_chr11 DNA_Methylation_chr12 DNA_Methylation_chr13 DNA_Methylation_chr14 DNA_Methylation_chr15 DNA_Methylation_chr16 DNA_Methylation_chr17 DNA_Methylation_chr18 DNA_Methylation_chr19 DNA_Methylation_chr2 DNA_Methylation_chr20 DNA_Methylation_chr21 DNA_Methylation_chr22 DNA_Methylation_chr3 DNA_Methylation_chr4 DNA_Methylation_chr5 DNA_Methylation_chr6 DNA_Methylation_chr7 DNA_Methylation_chr8 DNA_Methylation_chr9 DNA_Methylation_chrX DNA_Methylation_chrY Protein_Expression RNAseq_Gene_Expression Somatic_Mutation Somatic_Mutation_DR10 Somatic_Mutation_DR6 Somatic_Mutation_DR7 miRNAseq_Expression miRNAseq_Isoform_Expression tcga_metadata_data_hg38_r14
In this next code cell, we define an SQL query called DNU_patients
which finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment'. It will display the output of the query and then save the data to a Pandas DataFrame.
Now we'll use the query defined above to get the "Do Not Use" list of participants (aka patients):
** Note: you will need to update 'your_project_number' with your project number before continuing with the notebook **
%%bigquery DNU_patients --your_project_number
SELECT
case_barcode,
category AS categoryName,
classification AS classificationName
FROM
`isb-cgc.TCGA_bioclin_v0.Annotations`
WHERE
( entity_type="Patient"
AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
OR classification="Redaction" ) )
GROUP BY
case_barcode,
categoryName,
classificationName
ORDER BY
case_barcode
DNU_patients.describe()
case_barcode | categoryName | classificationName | |
---|---|---|---|
count | 212 | 212 | 212 |
unique | 212 | 8 | 2 |
top | TCGA-BP-5006 | History of unacceptable prior treatment relate... | Notification |
freq | 1 | 137 | 137 |
Now we're gong to use the query defined previously in a function that builds a "clean" list of patients in the specified study, with available molecular data, and without any disqualifying annotations.
def buildCleanBarcodeList ( studyName, bqDataset, barcodeType, DNUList):
print("in buildCleanBarcodeList ... ", studyName) # Print a start statement
ULists = [] # Define an empty list for Unique Barcodes
print(" --> looping over data tables: ") # Print an indication of when the loop is beginning
barcodeField = barcodeType # Set the barcodeField for each loop
# For each dataset in the bqDataset table
for t in bqDataset:
currTable = "`isb-cgc.TCGA_hg38_data_v0." + t.table_id + "`" # Set the current table
barcodeField = barcodeType # reset the barcode Field for each loop
# Try the simple query that will get the barcode list from most of the tables
try:
# Set query string
get_barcode_list = """
SELECT {} as barcode
FROM {}
WHERE project_short_name="{}"
GROUP BY {}
ORDER BY {}
""".format(barcodeField, currTable, studyName, barcodeField, barcodeField)
# Query BigQuery
results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
# Store the results into a dataframe
results = results.result().to_dataframe()
# If the simple query does not run, try joining the two tables to add the
# project_short_name to the the query as it is not in some of the Methylation tables
except:
try:
get_barcode_list = """
WITH a AS(
SELECT {}
FROM {}),
b AS ( SELECT {}, project_short_name
FROM `isb-cgc.TCGA_hg38_data_v0.Copy_Number_Segment_Masked`)
SELECT {} as barcode
FROM (
SELECT a.{} AS {}, b.project_short_name AS project_short_name
FROM a
JOIN b
ON a.{} = b.{})
WHERE project_short_name='{}'
GROUP BY {}
ORDER BY {}
""".format(barcodeField, currTable, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, studyName, barcodeField, barcodeField)
# Query BigQuery
results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
# Store the results into a dataframe
results = results.result().to_dataframe()
except:
try:
barcodeField = "sample_barcode_tumor"
get_barcode_list = """
SELECT {} as barcode
FROM {}
WHERE project_short_name="{}"
GROUP BY {}
ORDER BY {}
""".format(barcodeField, currTable, studyName, barcodeField, barcodeField)
# Query BigQuery
results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)
# Store the results into a dataframe
results = results.result().to_dataframe()
except:
# Create an empty dataframe if none of the above queries return values
results = pd.DataFrame()
# If there is data in the result dataframe, add the results to ULists
if ( len(results) > 0):
print(" ", t.table_id, " --> ", len(results["barcode"]), "unique barcodes.")
ULists += [ results ]
else:
print(" ", t.table_id, " --> no data returned")
print(" --> we have {} lists to merge".format(len(ULists)))
masterList = [] # Create a list for the master list of barcodes
# Add barcodes to the master list with no repeating barcodes
for aList in ULists:
for aBarcode in aList["barcode"]:
if ( aBarcode not in masterList ):
masterList += [aBarcode]
print(" --> which results in a single list with {} barcodes".format(len(masterList)))
print(" --> removing DNU barcodes")
cleanList = []
# For each barcode in the master check to see if it is in the DNU
# list and then add it to the clean list
for aBarcode in masterList:
if ( aBarcode not in DNUList[barcodeField].tolist() ):
cleanList += [ aBarcode ]
else:
print(" excluding this barcode", aBarcode)
print(" --> returning a clean list with {} barcodes".format(len(cleanList)))
return(cleanList)
barcodeType = "case_barcode"
cleanPatientList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNU_patients )
in buildCleanBarcodeList ... TCGA-BRCA --> looping over data tables: Copy_Number_Segment_Masked --> 1096 unique barcodes. Copy_Number_Segment_Masked_r14 --> 1098 unique barcodes. DNA_Methylation --> 1095 unique barcodes. DNA_Methylation_chr1 --> 1095 unique barcodes. DNA_Methylation_chr10 --> 1095 unique barcodes. DNA_Methylation_chr11 --> 1095 unique barcodes. DNA_Methylation_chr12 --> 1095 unique barcodes. DNA_Methylation_chr13 --> 1095 unique barcodes. DNA_Methylation_chr14 --> 1095 unique barcodes. DNA_Methylation_chr15 --> 1095 unique barcodes. DNA_Methylation_chr16 --> 1095 unique barcodes. DNA_Methylation_chr17 --> 1095 unique barcodes. DNA_Methylation_chr18 --> 1095 unique barcodes. DNA_Methylation_chr19 --> 1095 unique barcodes. DNA_Methylation_chr2 --> 1095 unique barcodes. DNA_Methylation_chr20 --> 1095 unique barcodes. DNA_Methylation_chr21 --> 1095 unique barcodes. DNA_Methylation_chr22 --> 1095 unique barcodes. DNA_Methylation_chr3 --> 1095 unique barcodes. DNA_Methylation_chr4 --> 1095 unique barcodes. DNA_Methylation_chr5 --> 1095 unique barcodes. DNA_Methylation_chr6 --> 1095 unique barcodes. DNA_Methylation_chr7 --> 1095 unique barcodes. DNA_Methylation_chr8 --> 1095 unique barcodes. DNA_Methylation_chr9 --> 1095 unique barcodes. DNA_Methylation_chrX --> 1095 unique barcodes. DNA_Methylation_chrY --> 1095 unique barcodes. Protein_Expression --> 887 unique barcodes. RNAseq_Gene_Expression --> 1092 unique barcodes. Somatic_Mutation --> 986 unique barcodes. Somatic_Mutation_DR10 --> 986 unique barcodes. Somatic_Mutation_DR6 --> 986 unique barcodes. Somatic_Mutation_DR7 --> 986 unique barcodes. miRNAseq_Expression --> 1079 unique barcodes. miRNAseq_Isoform_Expression --> 1079 unique barcodes. tcga_metadata_data_hg38_r14 --> 1098 unique barcodes. --> we have 36 lists to merge --> which results in a single list with 1098 barcodes --> removing DNU barcodes excluding this barcode TCGA-5L-AAT1 excluding this barcode TCGA-A8-A084 excluding this barcode TCGA-A8-A08F excluding this barcode TCGA-A8-A08S excluding this barcode TCGA-A8-A09E excluding this barcode TCGA-A8-A09K excluding this barcode TCGA-AR-A2LL excluding this barcode TCGA-AR-A2LR excluding this barcode TCGA-BH-A0B6 excluding this barcode TCGA-BH-A1F5 excluding this barcode TCGA-D8-A146 --> returning a clean list with 1087 barcodes
Now we are going to repeat the same process, but at the sample barcode level. Most patients will have provided two samples, a "primary tumor" sample, and a "normal blood" sample, but in some cases additional or different types of samples may have been provided, and sample-level annotations may exist that should result in samples being excluded from most downstream analyses.
%%bigquery DNUsamples --project your_project_number
# there are many different types of annotations that are at the "sample" level
# in the Annotations table, and most of them seem like they should be "disqualifying"
# annotations, so for now we will just return all sample barcodes with sample-level
# annotations
SELECT
sample_barcode
FROM
`isb-cgc.TCGA_bioclin_v0.Annotations`
WHERE
(entity_type="Sample")
GROUP BY
sample_barcode
ORDER BY
sample_barcode
DNUsamples.describe()
sample_barcode | |
---|---|
count | 113 |
unique | 113 |
top | TCGA-06-0149-01A |
freq | 1 |
And now we can re-use the previously defined function get a clean list of sample-level barcodes:
barcodeType = "sample_barcode"
cleanSampleList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNUsamples )
in buildCleanBarcodeList ... TCGA-BRCA --> looping over data tables: Copy_Number_Segment_Masked --> 2219 unique barcodes. Copy_Number_Segment_Masked_r14 --> 2224 unique barcodes. DNA_Methylation --> 1205 unique barcodes. DNA_Methylation_chr1 --> 1205 unique barcodes. DNA_Methylation_chr10 --> 1205 unique barcodes. DNA_Methylation_chr11 --> 1205 unique barcodes. DNA_Methylation_chr12 --> 1205 unique barcodes. DNA_Methylation_chr13 --> 1205 unique barcodes. DNA_Methylation_chr14 --> 1205 unique barcodes. DNA_Methylation_chr15 --> 1205 unique barcodes. DNA_Methylation_chr16 --> 1205 unique barcodes. DNA_Methylation_chr17 --> 1205 unique barcodes. DNA_Methylation_chr18 --> 1205 unique barcodes. DNA_Methylation_chr19 --> 1205 unique barcodes. DNA_Methylation_chr2 --> 1205 unique barcodes. DNA_Methylation_chr20 --> 1205 unique barcodes. DNA_Methylation_chr21 --> 1205 unique barcodes. DNA_Methylation_chr22 --> 1205 unique barcodes. DNA_Methylation_chr3 --> 1205 unique barcodes. DNA_Methylation_chr4 --> 1205 unique barcodes. DNA_Methylation_chr5 --> 1205 unique barcodes. DNA_Methylation_chr6 --> 1205 unique barcodes. DNA_Methylation_chr7 --> 1205 unique barcodes. DNA_Methylation_chr8 --> 1205 unique barcodes. DNA_Methylation_chr9 --> 1205 unique barcodes. DNA_Methylation_chrX --> 1205 unique barcodes. DNA_Methylation_chrY --> 1205 unique barcodes. Protein_Expression --> 937 unique barcodes. RNAseq_Gene_Expression --> 1217 unique barcodes. Somatic_Mutation --> 986 unique barcodes. Somatic_Mutation_DR10 --> 986 unique barcodes. Somatic_Mutation_DR6 --> 986 unique barcodes. Somatic_Mutation_DR7 --> 986 unique barcodes. miRNAseq_Expression --> 1202 unique barcodes. miRNAseq_Isoform_Expression --> 1202 unique barcodes. tcga_metadata_data_hg38_r14 --> 3298 unique barcodes. --> we have 36 lists to merge --> which results in a single list with 3339 barcodes --> removing DNU barcodes excluding this barcode TCGA-B6-A1KC-01A --> returning a clean list with 3338 barcodes
Now we're going to double-check first that we keep only sample-level barcodes that correspond to patients in the "clean" list of patients, and then we'll also filter the list of patients in case there are patients with no samples remaining in the "clean" list of samples.
finalSampleList = []
for aSample in cleanSampleList:
aPatient = aSample[:12]
if ( aPatient not in cleanPatientList ):
print(" excluding this sample in the final pass: ", aSample)
else:
finalSampleList += [aSample]
print(" Length of final sample list: {} ".format(len(finalSampleList)))
excluding this sample in the final pass: TCGA-5L-AAT1-01A excluding this sample in the final pass: TCGA-5L-AAT1-10A excluding this sample in the final pass: TCGA-A8-A084-01A excluding this sample in the final pass: TCGA-A8-A084-10A excluding this sample in the final pass: TCGA-A8-A08F-01A excluding this sample in the final pass: TCGA-A8-A08F-10A excluding this sample in the final pass: TCGA-A8-A08S-01A excluding this sample in the final pass: TCGA-A8-A08S-10A excluding this sample in the final pass: TCGA-A8-A09E-01A excluding this sample in the final pass: TCGA-A8-A09E-10A excluding this sample in the final pass: TCGA-A8-A09K-01A excluding this sample in the final pass: TCGA-A8-A09K-10A excluding this sample in the final pass: TCGA-AR-A2LL-01A excluding this sample in the final pass: TCGA-AR-A2LL-10A excluding this sample in the final pass: TCGA-AR-A2LR-01A excluding this sample in the final pass: TCGA-AR-A2LR-10A excluding this sample in the final pass: TCGA-BH-A0B6-01A excluding this sample in the final pass: TCGA-BH-A0B6-10A excluding this sample in the final pass: TCGA-BH-A1F5-01A excluding this sample in the final pass: TCGA-BH-A1F5-11A excluding this sample in the final pass: TCGA-D8-A146-01A excluding this sample in the final pass: TCGA-D8-A146-10A excluding this sample in the final pass: TCGA-5L-AAT1-01Z excluding this sample in the final pass: TCGA-A8-A084-01Z excluding this sample in the final pass: TCGA-A8-A08F-01Z excluding this sample in the final pass: TCGA-A8-A08S-01Z excluding this sample in the final pass: TCGA-A8-A09E-01Z excluding this sample in the final pass: TCGA-A8-A09K-01Z excluding this sample in the final pass: TCGA-AR-A2LL-01Z excluding this sample in the final pass: TCGA-AR-A2LR-01Z excluding this sample in the final pass: TCGA-BH-A0B6-01Z excluding this sample in the final pass: TCGA-BH-A1F5-01Z excluding this sample in the final pass: TCGA-D8-A146-01Z Length of final sample list: 3305
finalPatientList = []
for aSample in finalSampleList:
aPatient = aSample[:12]
if ( aPatient not in finalPatientList ):
finalPatientList += [ aPatient ]
print(" Lenth of final patient list: {} ".format(len(finalPatientList)))
for aPatient in cleanPatientList:
if ( aPatient not in finalPatientList ):
print(" --> patient removed in final pass: ", aPatient)
Lenth of final patient list: 1087
We're also interested in knowing what types of samples we have. The codes for the major types of samples are:
and a complete list of all sample type codes and their definitions can be found here.
sampleCounts = {}
for aSample in finalSampleList:
sType = str(aSample[13:15])
if ( sType not in sampleCounts ): sampleCounts[sType] = 0
sampleCounts[sType] += 1
for aKey in sorted(sampleCounts):
print(" {} samples of type {} ".format(sampleCounts[aKey], aKey))
2145 samples of type 01 8 samples of type 06 991 samples of type 10 161 samples of type 11
Now we are going to create a simple dataframe with all of the sample barcodes and the associated patient (participant) barcodes so that we can write this to a BigQuery "cohort" table.
import pandas as pd
patientBarcodes = []
sampleBarcodes = []
for aSample in finalSampleList:
sampleBarcodes += [aSample]
patientBarcodes += [aSample[:12]]
df = pd.DataFrame ( { 'ParticipantBarcode': patientBarcodes,
'SampleBarcode': sampleBarcodes } )
df.describe()
ParticipantBarcode | SampleBarcode | |
---|---|---|
count | 3305 | 3305 |
unique | 1087 | 3305 |
top | TCGA-A7-A0DB | TCGA-AN-A0XT-01A |
freq | 5 | 1 |
As a next step you may want to consider is to put the data into your GCP. An example of how to move files in and out of GCP and BigQuery can be found here along with other tutorial notebooks.