Open In Colab

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to use PyPika to build a query
Author:  Lauren Hagen
Created: 2020-02-13
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How to use PyPika_to_create_a_BigQuery_SQL_query.ipynb
Purpose: Demonstrate query creation using PyPika with BigQuery
Notes:

Introduction

Overview

In this notebook, we are going to use the PyPika package to build a query and use that query with the ISB-CGC BigQuery tables. First we'll create a cohort by selecting clinical features, then we'll use the cohort to filter the RNA and miRNA expression tables from the TARGET data set.

What is PyPika?

PyPika is a Python package designed to construct SQL queries. This package is useful for people who want to integrate BigQuery into their python code.

Before we get started, we need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.

In [ ]:
# Load the BigQuery Module
from google.cloud import bigquery
In [ ]:
# Authenticate ourselves
!gcloud auth application-default login
In [ ]:
# Create the client object
project_num = 'your_project_number' # Replace with your project ID
if project_num == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client(project_num)
In [ ]:
# If needed, install PyPika
!pip install pypika
In [ ]:
# Import from PyPika
from pypika import Query, Table, Field, Order

Query Building

In this notebook, we will query each table individually first and then join them for a final query. We'll use a selection of clinical data with molecular data from the TARGET data set.

Patient Clinical Data Query

We want to write a query that filters the TARGET data set for AML with columns for the case barcode and the remission status of the patient for our cohort.

In [ ]:
# Pass the Clincal table name to the Table class to create a variable
clin_table = Table('`isb-cgc.TARGET_bioclin_v0.Clinical`')

# Create query with PyPika
clin_query = Query.from_(clin_table) \
                  .select('case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2') \
                  .where(clin_table.disease_code=='AML')
In [9]:
# Print returned query
print(clin_query)
SELECT "case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2" FROM "`isb-cgc.TARGET_bioclin_v0.Clinical`" WHERE "disease_code"='AML'

PyPika returns the query with quotation marks around some of the inputs. These need to be removed before the querying BigQuery.

In [ ]:
# Remove " from the string
clin_query_clean = str(clin_query).replace('"', "")
In [11]:
# Print the query that PyPika created to check that it looks good.
# We can also copy the query into the BigQuery UI to check for errors
print(clin_query_clean)
SELECT case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_bioclin_v0.Clinical` WHERE disease_code='AML'

We can query BigQuery, then create a data frame with the results though this isn't necessary for creating the final query.

In [ ]:
# Query BigQuery and create a data frame with the results
clin = client.query(clin_query_clean).to_dataframe()
In [13]:
# View the first several lines of the returned table
clin.head()
Out[13]:
case_barcode CR_status_at_end_of_course_1 CR_status_at_end_of_course_2
0 TARGET-20-PATDMY CR CR
1 TARGET-20-PARHSA CR CR
2 TARGET-20-PARLSW CR CR
3 TARGET-20-PASLHH CR CR
4 TARGET-20-PATKUG CR CR

Molecular Data Query

Now that we have a list of cases with some clinical information, we can join that table to one of the molecular data sets, such as the TARGET gene expression data.

We will now build the query for the molecular data set. We are not going to query BigQuery at this point, but it is good to make sure the query looks correct before joining it with another table.

In [ ]:
# Pass the Clincal data table name to the Table class to create a variable
expr_table = Table('`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`')

# Create a query with PyPika
expr_query = Query.from_(expr_table) \
                  .select('case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name') \
                  .orderby('HTSeq__FPKM_UQ')
In [ ]:
# Remove " from the string
expr_query_clean = str(expr_query).replace('"', "")
In [ ]:
# View the created query
print(expr_query_clean)
SELECT case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` ORDER BY HTSeq__FPKM_UQ

Create the Final Query

Finally, we will create a query to join the two tables. BigQuery and PyPika support all join types, though, for this query, we are using the standard inner join. This query returns a large number of lines and can be slow to bring into Collaboratory, so we will limit the number of lines returned to 100.

In [ ]:
# Create a query with PyPika
join_query = Query.from_(expr_table) \
  .join(clin_table).on(expr_table.case_barcode == clin_table.case_barcode) \
  .select('case_barcode', 'HTSeq__FPKM_UQ', 'Ensembl_gene_id', 'gene_name') \
  .select(clin_table.CR_status_at_end_of_course_1, clin_table.CR_status_at_end_of_course_2) \
  .where(clin_table.disease_code=='AML') \
  .limit(100)
In [ ]:
# Replace the " in the query
join_query_clean = str(join_query).replace('"', "")
In [ ]:
# View the created query
print(join_query_clean)
SELECT `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.HTSeq__FPKM_UQ,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.Ensembl_gene_id,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.gene_name,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_1,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` JOIN `isb-cgc.TARGET_bioclin_v0.Clinical` ON `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode=`isb-cgc.TARGET_bioclin_v0.Clinical`.case_barcode WHERE `isb-cgc.TARGET_bioclin_v0.Clinical`.disease_code='AML' LIMIT 100

There! We now have a query that joins the two tables. We can now query BigQuery and view the results.

In [ ]:
# Query BigQuery and create a data frame
final = client.query(join_query_clean).to_dataframe()
In [ ]:
# View the first several lines of the data frame
final.head()
Out[ ]:
case_barcode HTSeq__FPKM_UQ Ensembl_gene_id gene_name CR_status_at_end_of_course_1 CR_status_at_end_of_course_2
0 TARGET-20-PAPWHS 392731.987867 ENSG00000173598 NUDT4 CR CR
1 TARGET-20-PAPWHS 13617.578556 ENSG00000204745 AC083899.3 CR CR
2 TARGET-20-PAPWHS 87978.648259 ENSG00000235183 RP11-613C6.4 CR CR
3 TARGET-20-PAPWHS 258701.117197 ENSG00000177917 ARL6IP6 CR CR
4 TARGET-20-PAPWHS 8689.796007 ENSG00000181085 MAPK15 CR CR

It's that simple! Please let us know if you have any questions at [email protected]