Notebook

ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Updated: 2023-08
Purpose: Painless intro to working with ISB-CGC in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:   This Quick Start Guide gives an overview of the data available in ISB-CGC and getting started with a basic example in python.

Quick Start Guide to ISB-CGC in BigQuery¶

Account Set-up¶

To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our Quick-Start Guide documentation page.

Libraries needed for the Notebook¶

This notebook requires the BigQuery API to be loaded (click here for more information) allowing access to BigQuery programmatically.

In [ ]:

# GCP libraries
from google.cloud import bigquery
from google.colab import auth

Overview of ISB-CGC¶

The ISB-CGC provides interactive and programmatic access to data hosted by institutes such as the Genomic Data Commons (GDC) and Proteomic Data Commons (PDC) from the National Cancer Institute (NCI) while leveraging many aspects of the Google Cloud Platform. You can also import your data, analyze it side by side with the datasets, and share your data when you see fit. The ISB-CGC hosts carefully curated high-level clinical, biospecimen, and molecular datasets and tables in Google BigQuery, including data from programs such as The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Clinical Proteomic Tumor Analysis Consortium (CPTAC). For more information can be found at our Programs and Data Sets page. This data can be explored via python, Google Cloud Console and/or our BigQuery Table Search tool.

Example of Accessing BigQuery Data with Python¶

Log into Google Cloud Storage and Authenticate ourselves¶

Steps to authenticate yourself:

Run the code block to authenticate yourself with your Google Cloud Login
A second tab will open or follow the link provided
Follow prompts to Authorize your account to use Google Cloud SDK
Copy code provided and paste into the box under the Command
Press Enter

Alternative authentication methods

In [ ]:

# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

Creating a client and using a billing project¶

To access BigQuery, you will need a Google Cloud Project for queries to be billed to. If you need to create a Project, instructions on how to create one can be found on our Quick-Start Guide page.

A BigQuery Client object with the billing Project needs to be created to interface with BigQuery.

Note: Any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects.

In [ ]:

# Create a variable for which client to use with BigQuery
project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project Id

In [ ]:

# Create a BigQuery Client
if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed
  print('Please update the project number with your Google Cloud Project')
else: client = bigquery.Client(project_id)

View ISB-CGC Datasets and Tables in BigQuery¶

Let us look at the datasets available through ISB-CGC that are in BigQuery.

In [ ]:

# Which project to view datasets
project_with_data = 'isb-cgc-bq'

# Create a variable of datasets
datasets = list(client.list_datasets(project_with_data))

# If there are datasets available then print their names,
# else print that there are no datasets available
if datasets:
    print(f"Datasets in project {project_with_data}:")
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print(f"{project_with_data} project does not contain any datasets.")

The ISB-CGC has two datasets for each Program or source. One dataset contains the most current data, and the other contains versioned tables, which serve as an archive for reproducibility. The current tables are labeled with "_current" and are updated when new data is released. For more information, visit our ISB-CGC BigQuery Projects page. Let's see which tables are under the TCGA dataset.

In [ ]:

dataset_with_data = 'TCGA_versioned'

print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables(f'{project_with_data}.{dataset_with_data}'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")

In [ ]:

Query ISB-CGC BigQuery Tables¶

In this section, we will create a string variable with our SQL then call to BigQuery and save the result to a dataframe.

Syntax for the query¶

SELECT # Select a few columns to view
  proj__project_id, # GDC project
  submitter_id, # case barcode
  proj__name # GDC project name
FROM # Which table in BigQuery in the format of `project.dataset.table`
  `project_name.dataset_name.table_name` # From the GDC TCGA Clinical Dataset
LIMIT
  5 # Limit to 5 rows as the dataset is very large and we only want to see a few results

Note: LIMIT only limits the number of rows returned and not the number of rows that the query looks at

In [ ]:

query = ("""
  SELECT
    proj__project_id,
    submitter_id,
    proj__name
  FROM
    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
  LIMIT
    5""")
result = client.query(query).to_dataframe()  # API request
print(result)

Resources¶

There are several ways to access and explore the data hosted by ISB-CGC.

ISB-CGC
- ISB-CGC WebApp
  - Provides a graphical interface to file and case data
  - Cohort creation
  - File exploration
- ISB-CGC BigQuery Table Search
  - Provides a table search for available ISB-CGC BigQuery Tables
- ISB-CGC APIs
  - Provides programmatic access to metadata
Google Cloud
- Google Cloud Platform
  - Access and store data in Google Cloud Storage and BigQuery via User Interfaces or programmatically
Suggested Programming Languages and Programs to use
SQL
- Can be used directly in BigQuery Console
- Or via API in Python or R
Python
R
- RStudio
- RStudio.Cloud
Command Line Interfaces
- Cloud Shell via Project Console
- CLOUD SDK
Getting Started for Free:
- Free Cloud Credits from ISB-CGC for Cancer Research
- Google Free Tier with up to 1TB of free queries a month

Useful ISB-CGC Links:

Useful Google Tutorials: