ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Purpose: Painless intro to working in the cloud

Quick Start Guide to ISB-CGC


This Quick Start Guide is intended give an overview of the data available, to walk you though the steps of setting up your accounts, and get started with a basic example in python. If you have read the R version, you can skip to the Example section.

Access Requirements

Access Suggestions

  • Favored Programming Language (R or Python)
  • Favored IDE (RStudio or Jupyter)

Outline for this Notebook

  • Quick Overview of ISB-CGC
  • About the Data on ISB-CGC
  • Overview How to Access Data
  • Account Set up
  • ISB-CGC Web Interface
  • Google Cloud Platform (GCP) and BigQuery Overview
  • Example of Accessing Data with Python
  • Where to go next

Overview of ISB-CGC

The ISB-CGC provides both interactive and programmatic access to data hosted by institutes such as the Genomic Data Commons (GDC) of the National Cancer Institute (NCI) and the Wellcome Trust Sanger Institute while leveraging many aspects of the Google Cloud Platform. You can also import your own data to analyze it side by side with the datasets and share your data when you see fit.

In [0]:
#@title Introduction to ISB-CGC Video
#@markdown This 12 minute video goes over an introduction to ISB-CGC
from IPython.display import YouTubeVideo
YouTubeVideo('RQsLKDTciWk', width=600, height=400)
#@markdown For more videos check out: [ISB-CGC Video Tutorial Series](

About the Data in the Cloud

The main data that is hosted on the cloud is The Cancer Genome Atlas (TCGA) data which was a large-scale multi-disciplinary collaboration started by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). Some of the hosted data types and files include RNA-Seq FASTQ, DNA-Seq and RNA-Seq BAM Files, Genome-Wide SNP6 array CEL files, and Variant-calls in VCF files along with a number of other datasets including data from Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and Cancer Cell Line Encyclopedia (CCLE) programs. ISB-CGC hosts several tables in BigQuery with data from the TCGA, TARGET, and CCLE along with reference tables and Catalogue Of Somatic Mutations In Cancer (COSMIC) data sets from the Wellcome Trust Sanger Institute. ISB-CGC is adding more data sets all the time, so if you have suggestions for a datasets to be added please email: [email protected]

For more information, please visit: Programs and Data Sets and Data in BigQuery

Overview of How to Access Data

There are several ways to access the Data that is hosted by ISB-CGC.

Account Set-up

If not completed prior to reading this guide

  1. Log in or create a Gmail account
  • Can be use your institutional email if it is a Google Identity
  1. Create a GCP Project using a GMail account
  1. Authorize your account for dbGaP in the ISB-CGC WebApp (required for viewing controlled access data)
  • To access controlled data, users must first be authenticated by NIH (via the ISB-CGC web-app). Upon successful authentication, user dbGaP authorization will be verified. These two steps are required before the user’s Google identity is added to the access control list (ACL) for the controlled data. At this time, this access must be renewed every 24 hours.
  • Please view Accessing Controlled-Access Data if you need help with this step.
  1. Register your GCP project in the ISB-CGC WebApp
  1. Enable the following required Google Cloud APIs:
    • Google Compute Engine
    • Google Genomics
    • Google BigQuery
    • Google Cloud Logging
    • Google Cloud Pub/Sub
  1. Install optional software such as:

ISB-CGC Web Interface

The ISB-CGC Web Interface is an interactive web-based application to access and explore the rich TCGA, TARGET, and CCLE datasets with more datasets being added regularly. Through the WebApp you can create Cohorts, lists of Favorite Genes, miRNA, and Variables. The Cohorts and Variables can be used in Workbooks to allow you to quickly analyze and export datasets by mixing and matching the selections. The ISB-CGC Web Interface also allows you to view and analyze available pathology and radiology images associated with selected cohort data.

Google Cloud Platform and BigQuery Overview

The Google Cloud Platform Console is the web-based interface to your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, acces the Cloud Shell command line, etc. Click here to download a quick tour from ISB-CGC of the GCP Console. You'll want to remember that any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects. Here is how to check which project is your current project.

"BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in." Source ISB-CGC has uploaded multiple cancer genomic datasets into BigQuery tables that are open-source such as TCGA and TARGET Clinical, Biospecimen and Molecular Data along with dataset megadata. This data can be accessed from the Google Cloud Platform Console web-UI, programmatically with R, and programmatically with python through Cloud Datalab or Colab. Check out our Community Notebook Repository for example notebooks.

Example of Accessing Data with Python

Log into Google Cloud Storage and Authenticate ourselves

  1. Authenticate yourself with your Google Cloud Login
  2. A second tab will open or follow the link provided
  3. Follow prompts to Authorize your account to use Google Cloud SDK
  4. Copy code provided and paste into the box under the Command
  5. Press Enter

Alternatives for Authentication can be found here

In [0]:
# Run a command line command with the bang (!) and gcloud
!gcloud auth application-default login 
Go to the following link in your browser:

Enter verification code: 4/XQGk8wtHV404M8mfwbkdcZjmj-DpxkeKCnUvD3hh4y8XCWa00jfNoww

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token

To take a quick anonymous survey, run:
  $ gcloud alpha survey

View Datasets and Tables in BigQuery

Let us look at the datasets available through ISB-CGC that are in BigQuery. You will need to load the BigQuery API and set the client (click here for more information).

In [0]:
# Load BigQuery API
from import bigquery

# Create a client to access the data within BigQuery
client = bigquery.Client('isb-cgc')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then print their names,
# else print that there are no data sets available
if datasets:
    print("Datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
    print("{} project does not contain any datasets.".format(project))
Datasets in project isb-cgc:

Let us see which tables are under the TCGA_bioclin_v0 dataset.

In [0]:
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('isb-cgc.TCGA_bioclin_v0'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
    print("\tThis dataset does not contain any tables.")

Access BigQuery to call a table

First you'll want to call to BigQuery with a magic command and then you can use Standard SQL to write your query. Click here for more on IPython Magic Commands for BigQuery. The result will be a Pandas Dataframe.

In [0]:
# Call to BigQuery with a magic command
# and replace PROJECT_ID with your project ID Number
%%bigquery --project PROJECT_ID
SELECT # Select a few columns to view
FROM # From the TCGA Clinical Dataset
LIMIT # Limit to 5 rows as the dataset is very large and we only want to see a few results

# Syntax for the above query
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields
program_name case_barcode project_short_name

Now that wasn't so difficult! Have fun exploring and analyzing the ISB-CGC Data!

Where to Go Next

Explore, Discover, and Analyze the Data provided by ISB-CGC along with side by side with your own! :)

ISB-CGC Links:

Google Tutorials: