ISB-CGC Community Notebooks¶

Title:   How to Create a Random Sample in BigQuery
Author:  Lauren Hagen
Created: 2019-10-17
Purpose: Demonstrates how to split a data set into multiple groups randomly with BigQuery

How to Create a Random Sample in BigQuery¶

In this notebook, we will be using BigQuery to create random samples for predicting an outcome with test and training data sets such as in machine learning. In this notebook, we assume that you have set up your GCP and accessed the ISB-CGC WebApp. If not, please visit the How To Get Started on ISB-CGC or the Community Notebook Repository for guides on how to get started.

We will go over two methods to create the subset data:

RAND() Function
MOD() and FARM_FINGERPRINT Functions

Before we can begin working with BigQuery, we will need to load the BigQuery module and authenticate ourselves.

In [0]:

from google.cloud import bigquery

In [ ]:

!gcloud auth application-default login

`RAND()` Function for Randomly Splitting data¶

A simple way to create a random sample with BigQuery is to use the RAND() function. The RAND() function creates a seemingly random set of numbers and then the query can select to create a random sample of rows.

We will create two queries with the RAND() function. The first will return a random sample of the data and the second will return all of a cohort with the rows labeled for which subset they belong to.

We are going to start with a simple query to create a cohort. Cohorts can be created previously in the WebApp or through other means instead of within this query. For more information on creating cohorts, please see the ISB-CGC Web Interface (Web App) documentation and the Community Notebook Repository.

** Note: you will need to update 'your_project_number' with your project number before continuing with the notebook **

In [0]:

# Create a query with the cohort information
# This can be replaced with your own cohort
%%bigquery cohort --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"

In [4]:

# View the first 5 rows of the cohort
cohort.head(5)

Out[4]:

	case_barcode	project_short_name	case_gdc_id
0	TARGET-30-PATSRD	TARGET-NBL	fef92ed0-242b-5564-ad92-6b35c21c3bd5
1	TARGET-30-PATTEF	TARGET-NBL	fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
2	TARGET-30-PARJAR	TARGET-NBL	feb97edc-ce83-5fd5-94e3-261ce244ac52
3	TARGET-30-PAUAZA	TARGET-NBL	fe831368-c7ce-5e2b-b0fd-c35216a7761d
4	TARGET-30-PAIJGC	TARGET-NBL	fe58a8bf-8306-5aaf-99d1-b65c20fedc58

In [5]:

print("This cohort has " + str(len(cohort)) + " cases (rows).")

This cohort has 1180 cases (rows).

The next part of the query is creating a random sample. This line can be adjusted to return any percent of the data table into a random sample. For this example, it is set to create a random sample of ~25% of the data.

In [0]:

%%bigquery sample --project your_project_number

--- Create Cohort
WITH table1 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL")

--- Select a random sample that is ~25% of the data
SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num
FROM table1
-- Count the number of rows in the cohort, then find how many of them will be 25%
-- of the cohort. Divid that number by the total number of rows in the data 
-- To change the %, change the 0.25 to what ever precentage you need
WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)

In [7]:

print("The random sample of the cohort with " + str(len(sample)) + " cases which is around " + str(round((len(sample)/len(cohort)*100),1)) + "% of the cohort.")

The random sample of the cohort with 300 cases which is around 25.4% of the cohort.

This query is nice if we just wanted to grab a smaller sample of the cohort to do some initial analysis before moving to a larger data set but it is not useful if you want to create two separate subsets of data for training a model and then testing the model. The final query joins the new random sample table back with the main table and preserving the split of data with a column for which subset it belongs to.

In [0]:

%%bigquery dataset --project your_project_number
--- Create Cohort
WITH table1 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"),

--- Select a random sample that is ~25% of the data
table2 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num
FROM table1
-- Count the number of rows in the cohort, then find how many of them will be 25%
-- of the cohort. Divid that number by the total number of rows in the data
-- To change the %, change the 0.25 to what ever precentage you need
WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)
)

--- Join the random sample table back to the main table
SELECT a.case_barcode, a.project_short_name, a.case_gdc_id, IFNULL(b.table_num,2) AS table_num
FROM table1 AS a
FULL OUTER JOIN table2 AS b
ON a.case_barcode = b.case_barcode

In [9]:

dataset.head(5)

Out[9]:

	case_barcode	project_short_name	case_gdc_id	table_num
0	TARGET-30-PAUELT	TARGET-NBL	e785d22c-f983-55ff-be60-554ac487cf8c	1
1	TARGET-30-PADINC	TARGET-NBL	d190884d-551b-57ca-8d95-79cd809af8db	1
2	TARGET-30-PAUYDE	TARGET-NBL	ccfab7a6-e356-51c4-a803-04082cb67d49	2
3	TARGET-30-PAPUWY	TARGET-NBL	95d7372e-2734-5068-8f7d-7597d92c1737	1
4	TARGET-30-PANYGR	TARGET-NBL	7d624f31-49fa-55d7-a0a1-6b32cc17a333	2

In [10]:

print("The final table has " + str(len(dataset)) + " cases.")

The final table has 1180 cases.

In [11]:

# Create a list with the sorted initial barcodes
case_barcode_initial = list(cohort.case_barcode.sort_values())
# Create a list with the sorted final barcodes
case_barcode_final = list(dataset.case_barcode.sort_values())
# Compare the two lists, if TRUE, no barcodes were lost
case_barcode_initial == case_barcode_final 

Out[11]:

True

Each query will have a different set of random samples because each time RAND() is run, it generates a new set of random numbers. This could be a problem if you want reproducible results each time you run the query. Another way to solve this problem is to use FARM_FINGERPRINT() with MOD() which we will cover next.

`MOD()` and `FARM_FINGERPRINT` Functions¶

FARM_FINGERPRINT() will compute a string of BYTES or STRING and will never change. The MOD() function will return the remainder of the farm fingerprint number and a number. This method will always return the same values for each subset.

For example, we want three approximately equal subsets of the data sets of our cohort. We will create a WHERE statement that has the case_barcode in the FARM_FINGERPRINT function, take the absolute value, put that number into the MOD(), and set that equal to 0.

Note: You need to run FARM_FINGERPRINT() on a column that has unique values for each row or combine two columns with CONCAT to create a unique value row.

For more information visit the BigQuery documentation:

In [12]:

%%bigquery  --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) = 0

Out[12]:

	case_barcode	project_short_name	case_gdc_id
0	TARGET-30-PATTEF	TARGET-NBL	fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
1	TARGET-30-PARJAR	TARGET-NBL	feb97edc-ce83-5fd5-94e3-261ce244ac52
2	TARGET-30-PANUVK	TARGET-NBL	fd756a5f-0f9a-57ca-9879-2b18e5fa0b54
3	TARGET-30-PARZHA	TARGET-NBL	fc9d5f9e-af43-5134-95e7-2d434831580b
4	TARGET-30-PAPRXW	TARGET-NBL	fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c
...	...	...	...
367	TARGET-30-PAHSRS	TARGET-NBL	None
368	TARGET-30-PAHYBI	TARGET-NBL	None
369	TARGET-30-PADGWJ	TARGET-NBL	None
370	TARGET-30-PADTYM	TARGET-NBL	None
371	TARGET-30-PALGNJ	TARGET-NBL	None

372 rows × 3 columns

The three in the MOD() statement is what splits the cohort into three subsets. If we wanted to do ~20% split of the data, we would change the three to a ten and then take any row that less than or equal to 1.

In [13]:

%%bigquery  --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) <= 1

Out[13]:

	case_barcode	project_short_name	case_gdc_id
0	TARGET-30-PASWIJ	TARGET-NBL	fe0b727f-3843-5b70-b9c1-8a207b837fc4
1	TARGET-30-PARABN	TARGET-NBL	fc9d0307-a5f5-51c4-ad9e-6e9ed60f0eba
2	TARGET-30-PAPRXW	TARGET-NBL	fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c
3	TARGET-30-PAREAG	TARGET-NBL	fb071e74-40dc-5f67-b4c3-e61dd2c2ef88
4	TARGET-30-PASKSX	TARGET-NBL	faaed289-3f28-5a16-b18b-5ee164f05f50
...	...	...	...
216	TARGET-30-PAMXWK	TARGET-NBL	None
217	TARGET-30-PAHZRF	TARGET-NBL	None
218	TARGET-30-PADKLJ	TARGET-NBL	None
219	TARGET-30-PAIFAU	TARGET-NBL	None
220	TARGET-30-PALGNJ	TARGET-NBL	None

221 rows × 3 columns

With this method, we can also create a query that will retrieve the remaining subset of data. For the first example, we would change the = to != as shown below:

In [14]:

%%bigquery  --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) != 0

Out[14]:

	case_barcode	project_short_name	case_gdc_id
0	TARGET-30-PATSRD	TARGET-NBL	fef92ed0-242b-5564-ad92-6b35c21c3bd5
1	TARGET-30-PAUAZA	TARGET-NBL	fe831368-c7ce-5e2b-b0fd-c35216a7761d
2	TARGET-30-PAIJGC	TARGET-NBL	fe58a8bf-8306-5aaf-99d1-b65c20fedc58
3	TARGET-30-PASWIJ	TARGET-NBL	fe0b727f-3843-5b70-b9c1-8a207b837fc4
4	TARGET-30-PANBMJ	TARGET-NBL	fdfb389d-eb9a-5014-b391-9cd5f908720d
...	...	...	...
803	TARGET-30-PAKKMP	TARGET-NBL	None
804	TARGET-30-PADKLJ	TARGET-NBL	None
805	TARGET-30-PAKHWS	TARGET-NBL	None
806	TARGET-30-PAHPEL	TARGET-NBL	None
807	TARGET-30-PAIFAU	TARGET-NBL	None

808 rows × 3 columns

For the second example, we would change the <= to > as shown below:

In [15]:

%%bigquery  --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) > 1

Out[15]:

	case_barcode	project_short_name	case_gdc_id
0	TARGET-30-PATSRD	TARGET-NBL	fef92ed0-242b-5564-ad92-6b35c21c3bd5
1	TARGET-30-PATTEF	TARGET-NBL	fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
2	TARGET-30-PARJAR	TARGET-NBL	feb97edc-ce83-5fd5-94e3-261ce244ac52
3	TARGET-30-PAUAZA	TARGET-NBL	fe831368-c7ce-5e2b-b0fd-c35216a7761d
4	TARGET-30-PAIJGC	TARGET-NBL	fe58a8bf-8306-5aaf-99d1-b65c20fedc58
...	...	...	...
954	TARGET-30-PAHXGX	TARGET-NBL	None
955	TARGET-30-PAKEVZ	TARGET-NBL	None
956	TARGET-30-PAKKMP	TARGET-NBL	None
957	TARGET-30-PAKHWS	TARGET-NBL	None
958	TARGET-30-PAHPEL	TARGET-NBL	None

959 rows × 3 columns

If you want to label the entire cohort and only create one query instead of two, a column can be added to label each case with the subset number.

In [16]:

%%bigquery  --project your_project_number
SELECT case_barcode, project_short_name, case_gdc_id, MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) as subset
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"

Out[16]:

	case_barcode	project_short_name	case_gdc_id	subset
0	TARGET-30-PATSRD	TARGET-NBL	fef92ed0-242b-5564-ad92-6b35c21c3bd5	1
1	TARGET-30-PATTEF	TARGET-NBL	fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb	0
2	TARGET-30-PARJAR	TARGET-NBL	feb97edc-ce83-5fd5-94e3-261ce244ac52	0
3	TARGET-30-PAUAZA	TARGET-NBL	fe831368-c7ce-5e2b-b0fd-c35216a7761d	2
4	TARGET-30-PAIJGC	TARGET-NBL	fe58a8bf-8306-5aaf-99d1-b65c20fedc58	2
...	...	...	...	...
1175	TARGET-30-PADKLJ	TARGET-NBL	None	2
1176	TARGET-30-PAKHWS	TARGET-NBL	None	2
1177	TARGET-30-PAHPEL	TARGET-NBL	None	2
1178	TARGET-30-PAIFAU	TARGET-NBL	None	2
1179	TARGET-30-PALGNJ	TARGET-NBL	None	0

1180 rows × 4 columns

The subsets can then be filtered out and manipulated in python. Have fun random sampling the data! Please let us know if you have questions by emailing us at feedback@isb-cgc.org.

ISB-CGC Community Notebooks¶

How to Create a Random Sample in BigQuery¶

RAND() Function for Randomly Splitting data¶

MOD() and FARM_FINGERPRINT Functions¶

`RAND()` Function for Randomly Splitting data¶

`MOD()` and `FARM_FINGERPRINT` Functions¶