ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   How to use PyPika to build a query
Author:  Lauren Hagen
Created: 2020-02-13
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How to use PyPika_to_create_a_BigQuery_SQL_query.ipynb
Purpose: Demonstrate query creation using PyPika with BigQuery
Notes:

Introduction¶

Overview¶

In this notebook, we are going to use the PyPika package to build a query and use that query with the ISB-CGC BigQuery tables. First we'll create a cohort by selecting clinical features, then we'll use the cohort to filter the RNA and miRNA expression tables from the TARGET data set.

What is PyPika?¶

PyPika is a Python package designed to construct SQL queries. This package is useful for people who want to integrate BigQuery into their python code.

Before we get started, we need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.

In [ ]:

# Load the BigQuery Module
from google.cloud import bigquery

In [ ]:

# Authenticate ourselves
!gcloud auth application-default login

In [ ]:

# Create the client object
project_num = 'your_project_number' # Replace with your project ID
if project_num == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client(project_num)

In [ ]:

# If needed, install PyPika
!pip install pypika

In [ ]:

# Import from PyPika
from pypika import Query, Table, Field, Order

Query Building¶

In this notebook, we will query each table individually first and then join them for a final query. We'll use a selection of clinical data with molecular data from the TARGET data set.

Patient Clinical Data Query¶

We want to write a query that filters the TARGET data set for AML with columns for the case barcode and the remission status of the patient for our cohort.

In [ ]:

# Pass the Clincal table name to the Table class to create a variable
clin_table = Table('`isb-cgc.TARGET_bioclin_v0.Clinical`')

# Create query with PyPika
clin_query = Query.from_(clin_table) \
                  .select('case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2') \
                  .where(clin_table.disease_code=='AML')

In [9]:

# Print returned query
print(clin_query)

SELECT "case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2" FROM "`isb-cgc.TARGET_bioclin_v0.Clinical`" WHERE "disease_code"='AML'

PyPika returns the query with quotation marks around some of the inputs. These need to be removed before the querying BigQuery.

In [ ]:

# Remove " from the string
clin_query_clean = str(clin_query).replace('"', "")

In [11]:

# Print the query that PyPika created to check that it looks good.
# We can also copy the query into the BigQuery UI to check for errors
print(clin_query_clean)

SELECT case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_bioclin_v0.Clinical` WHERE disease_code='AML'

We can query BigQuery, then create a data frame with the results though this isn't necessary for creating the final query.

In [ ]:

# Query BigQuery and create a data frame with the results
clin = client.query(clin_query_clean).to_dataframe()

In [13]:

# View the first several lines of the returned table
clin.head()

Out[13]:

	case_barcode	CR_status_at_end_of_course_1	CR_status_at_end_of_course_2
0	TARGET-20-PATDMY	CR	CR
1	TARGET-20-PARHSA	CR	CR
2	TARGET-20-PARLSW	CR	CR
3	TARGET-20-PASLHH	CR	CR
4	TARGET-20-PATKUG	CR	CR

Molecular Data Query¶

Now that we have a list of cases with some clinical information, we can join that table to one of the molecular data sets, such as the TARGET gene expression data.

We will now build the query for the molecular data set. We are not going to query BigQuery at this point, but it is good to make sure the query looks correct before joining it with another table.

In [ ]:

# Pass the Clincal data table name to the Table class to create a variable
expr_table = Table('`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`')

# Create a query with PyPika
expr_query = Query.from_(expr_table) \
                  .select('case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name') \
                  .orderby('HTSeq__FPKM_UQ')

In [ ]:

# Remove " from the string
expr_query_clean = str(expr_query).replace('"', "")

In [ ]:

# View the created query
print(expr_query_clean)

SELECT case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` ORDER BY HTSeq__FPKM_UQ

Create the Final Query¶

Finally, we will create a query to join the two tables. BigQuery and PyPika support all join types, though, for this query, we are using the standard inner join. This query returns a large number of lines and can be slow to bring into Collaboratory, so we will limit the number of lines returned to 100.

In [ ]:

# Create a query with PyPika
join_query = Query.from_(expr_table) \
  .join(clin_table).on(expr_table.case_barcode == clin_table.case_barcode) \
  .select('case_barcode', 'HTSeq__FPKM_UQ', 'Ensembl_gene_id', 'gene_name') \
  .select(clin_table.CR_status_at_end_of_course_1, clin_table.CR_status_at_end_of_course_2) \
  .where(clin_table.disease_code=='AML') \
  .limit(100)

In [ ]:

# Replace the " in the query
join_query_clean = str(join_query).replace('"', "")

In [ ]:

# View the created query
print(join_query_clean)

SELECT `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.HTSeq__FPKM_UQ,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.Ensembl_gene_id,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.gene_name,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_1,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` JOIN `isb-cgc.TARGET_bioclin_v0.Clinical` ON `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode=`isb-cgc.TARGET_bioclin_v0.Clinical`.case_barcode WHERE `isb-cgc.TARGET_bioclin_v0.Clinical`.disease_code='AML' LIMIT 100

There! We now have a query that joins the two tables. We can now query BigQuery and view the results.

In [ ]:

# Query BigQuery and create a data frame
final = client.query(join_query_clean).to_dataframe()

In [ ]:

# View the first several lines of the data frame
final.head()

Out[ ]:

	case_barcode	HTSeq__FPKM_UQ	Ensembl_gene_id	gene_name	CR_status_at_end_of_course_1	CR_status_at_end_of_course_2
0	TARGET-20-PAPWHS	392731.987867	ENSG00000173598	NUDT4	CR	CR
1	TARGET-20-PAPWHS	13617.578556	ENSG00000204745	AC083899.3	CR	CR
2	TARGET-20-PAPWHS	87978.648259	ENSG00000235183	RP11-613C6.4	CR	CR
3	TARGET-20-PAPWHS	258701.117197	ENSG00000177917	ARL6IP6	CR	CR
4	TARGET-20-PAPWHS	8689.796007	ENSG00000181085	MAPK15	CR	CR

It's that simple! Please let us know if you have any questions at feedback@isb-cgc.org.