Notebook

ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   How to perform complex joins
Author:  David L Gibbs
Created: 2020-04-21
Purpose: Basic overview of complex joins in BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_complex_joins.ipynb
Notes:   This covers joining multiple tables based on a sample identifier.

This notebook will show you how to perform a complex join using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial. You can also explore the available tables and data sets before commiting to creating a GCP project though the ISB-CGC BigQuery Table Searcher.

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:

BigQuery
the BigQuery web UI
- where you can run queries interactively
Jupyter Notebooks
Google Cloud Datalab
- interactive cloud-based platform for analyzing data built on the Jupyter Notebooks
Google Colaboratory
- Free Jupyter Notebook environment that runs in your browser

There are also many tutorials and samples available on github (see, in particular, the datalab repo, the Google Genomics project), and our own Community Notebooks.

OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:

In [ ]:

!pip3 install google

In [ ]:

!pip3 install google.cloud.bigquery

In [ ]:

import google.cloud.bigquery as bigquery

Next, we need to Authorize ourselves. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

In [ ]:

!gcloud auth application-default login
# If you're in a notebook, this might not allow for a reponse to be entered.
# in that case, in Jupyter Lab you can open a terminal and do it there. (same command)

In [1]:

# Create a variable for which client to use with BigQuery
project_num = 'your_project_number' # Update with your Google Project number

if project_num == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client(project_num) # Replace your_project_number with your project ID

Please update the project number with your Google Cloud Project

In [ ]:

%load_ext google.cloud.bigquery

BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are found in datasets TCGA_bioclin_v0, platform_reference, and TCGA_hg19_data_v0, all owned by the isb-cgc** project. A full table identifier is of the form <project_id>.<dataset_id>.<table_id>.

Join Join Join Join¶

Our strategy is going to based on linking up a series of subtables, rather than trying to do all the joins in one go. It's a little like rolling up a snowball.

Let's start by looking at the clinical data table, and we'll build up the query, table by table.

Starting at the top¶

Notice that in the 'group by', we use integers that reference the selected terms.

In [ ]:

%%bigquery --project project_num 

WITH
clinTab AS (
    select
      project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
    from
      `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
    where
      project_short_name = "TCGA-BRCA" AND
      Clin.age_at_diagnosis < 50 
    group by 
      1,2,3,4
)

select * from clinTab limit 5

First Join¶

Here, each row of the clinical table is going to get joined to each protein-and-barcode combo.

In [ ]:

%%bigquery pdf1 --project project_num

WITH
clinTab AS (
    select
      project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
    from
      `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
    where
      project_short_name = "TCGA-BRCA" AND
      Clin.age_at_diagnosis < 50 
    group by 
      1,2,3,4
),
protTab AS (
    select
         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, 
         Prot.gene_name as gene_name, Prot.protein_name,
         clinTab.age_at_diagnosis, clinTab.pathologic_stage
    from
        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
    join
         clinTab
    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
    where
        regexp_contains(Prot.gene_name, '^P')
    group by
          1,2,3,4,5,6,7
)

select * from protTab

In [ ]:

pdf1.head()

In [ ]:

pdf1.shape

Second Join¶

Now we'll bring in variant data. Notice in the 'mutTab', the features that were originally from the clinical table, are now part of Prot. We are also going to join only deletions which are called 'DEL's.

In [ ]:

%%bigquery mdf1 --project project_num 

WITH
clinTab AS (
    select
        project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
    from
        `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
    where
        project_short_name = "TCGA-BRCA" AND
        Clin.age_at_diagnosis < 50 
    group by 
        1,2,3,4
),

protTab AS (
    select
         clinTab.age_at_diagnosis, clinTab.pathologic_stage,
         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, 
         Prot.gene_name as gene_name, Prot.protein_name
    from
        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
    join
         clinTab
    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
    where
        regexp_contains(Prot.gene_name, '^P')
    group by
          1,2,3,4,5,6,7
),

mutTab AS (
    select
        prot.age_at_diagnosis, prot.pathologic_stage,
        Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, 
        Prot.gene_name as gene_name, Prot.protein_name,
        mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status 
    from
        `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut
    join
        protTab as prot
    on prot.case_barcode = mut.case_barcode
        and prot.project_short_name = mut.project_short_name
        and prot.gene_name = mut.Hugo_Symbol
    where
        mut.Variant_Type = 'DEL'
    group by 1,2,3,4,5,6,7,8,9,10
)

select * from mutTab

In [ ]:

mdf1.head()

In [ ]:

mdf1.shape

Third Join -- detour¶

Now to bring in the methylation data. This data, from Illumina methylation arrays, the Human Methylation 27k and 450k platforms, is indexed by probe IDs. That means we first need to do ane extra join to bring in gene symbols. Also, observe that the methylation annotation is a nested table. In the table schema, you will notice the UCSC field is a record with three components, RefGene_Name, RefGene_Group, and RefGene_Accession. To make our work easier, we want to take this nested table and make it a 'flat' table. For some documentation on doing that see this doc.

In [ ]:

%%bigquery prdf1 --project project_num 

SELECT
    IlmnID,
    RefGene_Name,
    RefGene_Group,
    Infinium_Design_Type
FROM
    `isb-cgc.platform_reference.methylation_annotation`,
UNNEST(UCSC) 
limit 10
    

In [ ]:

prdf1.head()

Since we made our annotation table flat, we can join it in using the regular method.

In [ ]:

%%bigquery ardf1 --project project_num 

WITH
probes AS (
  SELECT
    IlmnID,
    RefGene_Name,
    RefGene_Group,
    Infinium_Design_Type
  FROM
    `isb-cgc.platform_reference.methylation_annotation`,
    UNNEST(UCSC) ),
   
methAnnot as (
select
  Methyl.probe_id,Methyl.beta_value,Methyl.platform,
  Methyl.case_barcode, Methyl.project_short_name,
  probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type
from
  (select * from `isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` limit 100) Methyl  -- SUB-TABLE HERE!
join
  probes
on probes.IlmnID = Methyl.probe_id
where
    project_short_name = "TCGA-BRCA"
)
    
select * from methAnnot limit 10

In [ ]:

ardf1.head()

In [ ]:

ardf1.shape

Third Join -- making the turn¶

Now that we've mapped probe ids to gene symbols, we'll bring in methylation data.

In [ ]:

%%bigquery bigdf --project project_num 

WITH
clinTab AS (
    select
        project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
    from
        `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
    where
        project_short_name = "TCGA-BRCA" AND
        Clin.age_at_diagnosis < 50 
    group by 
        1,2,3,4
),

protTab AS (
    select
         clinTab.age_at_diagnosis, clinTab.pathologic_stage,
         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, 
         Prot.gene_name as gene_name, Prot.protein_name
    from
        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
    join
         clinTab
    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
    where
        regexp_contains(Prot.gene_name, '^P')
    group by
          1,2,3,4,5,6,7
),

mutTab AS (
    select
        prot.age_at_diagnosis, prot.pathologic_stage,
        Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, 
        Prot.gene_name as gene_name, Prot.protein_name,
        mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status 
    from
        `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut
    join
        protTab as prot
    on prot.case_barcode = mut.case_barcode
        and prot.project_short_name = mut.project_short_name
        and prot.gene_name = mut.Hugo_Symbol
    where
        mut.Variant_Type = 'DEL'
    group by 1,2,3,4,5,6,7,8,9,10
),

probes AS (
  SELECT
    IlmnID,
    RefGene_Name,
    RefGene_Group,
    Infinium_Design_Type
  FROM
    `isb-cgc.platform_reference.methylation_annotation`,
    UNNEST(UCSC) 
),
   
methAnnot as (
select
  Methyl.probe_id,Methyl.beta_value,Methyl.platform,
  Methyl.case_barcode, Methyl.project_short_name,
  probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type
from
  `isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` Methyl  
join
  probes
on probes.IlmnID = Methyl.probe_id
where
    project_short_name = "TCGA-BRCA"
),
    
methTab as (
select
    mutTab.age_at_diagnosis, mutTab.pathologic_stage,
    mutTab.case_barcode, mutTab.project_short_name, mutTab.protein_expression, 
    mutTab.gene_name as gene_name, mutTab.protein_name,
    mutTab.Variant_Type, mutTab.Genome_Change, mutTab.Mutation_Status,
    Methyl.probe_id, Methyl.beta_value, Methyl.platform, 
    Methyl.RefGene_Group, Methyl.Infinium_Design_Type
from
    methAnnot Methyl
join
    mutTab
on
    mutTab.case_barcode = Methyl.case_barcode
    and mutTab.project_short_name = Methyl.project_short_name
    and mutTab.gene_name = Methyl.RefGene_Name
group by
  1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
)

select * from mutTab 

In [ ]:

bigdf.head()

In [ ]:

bigdf.shape

That's it!¶

We've created a big wide table that's the result of joining 5 separate tables. Whew! Send us an email, how did it go?
feedback@isb-cgc.org