# ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   One-Way ANOVA Test in BigQuery
Author:  Lauren Hagen
Created: 2019-08-02
Purpose: Demonstrate an ANOVA test within BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_an_ANOVA_test_in_BigQuery.ipynb
Notes:   This notebook was adapted from work by David L Gibbs, September 2017 Query of the Month

## Running ANOVA tests in BigQuery¶

In this notebook, we will cover how to do a one-way ANOVA test in BigQuery. This statistical test can be used to determine whether there is a statistically significant difference between the means of two or more independent groups. Although in this example, I’m only looking at two groups, it would not be difficult to extend this to any number of groups, assuming there is a reasonable number of samples within each group.

Consider the model yij = m + ai + eij, where yij is a continuous variable over samples j, in groups i, and ai is a constant for each group i, and eij is a gaussian error term with mean 0.

Using this model, we are describing the data as being sampled from groups, with each group having a mean value equal to m + ai. The null hypothesis is that each of the group means is the same (ie that the ai terms are zero), while the alternative hypothesis is that at least one of the ai terms is not zero.

We use the F-test to compare these two hypotheses. To compute the test statistic, we compute the within-group variation and the between-group variation. Recall that sample variance is defined as the sum of squared differences between observations and the mean, divided by the number of samples (normalized).

First, we will need to import bigquery module, authenticate ourselves, and create a client variable. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

In [ ]:
# Import BigQuery Module

In [ ]:
# Autheticate ourselves

In [ ]:
# Create a variable for which client to use with BigQuery
project = 'your_project_number' # Replace with your project ID
if project_num == 'your_project_number':
else:
client = bigquery.Client(project)


Let’s look at the query: Note: you will need to update 'your_project_number' with your project number before continuing with the notebook

In [ ]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery anova --project your_project_number
WITH
-- using standard SQL,
-- we'll select our cohort and gene expression
--
cohortExpr AS (
SELECT
sample_barcode,
LOG10(normalized_count) AS expr
FROM
isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM
WHERE
project_short_name = 'TCGA-BRCA'
AND HGNC_gene_symbol = 'TP53'
AND normalized_count IS NOT NULL
AND normalized_count > 0),
--
-- And we'll select the variant data for our cohort,
-- we're going to be comparing variant types (SNP, DEL, etc)
--
cohortVar AS (
SELECT
Variant_Type,
sample_barcode_tumor AS sample_barcode
FROM
isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3
WHERE
SYMBOL = 'TP53' ),
--
-- then we join the expression and variant data using sample barcodes
--
cohort AS (
SELECT
cohortExpr.sample_barcode AS sample_barcode,
Variant_Type AS group_name,
expr
FROM
cohortExpr
JOIN
cohortVar
ON
cohortExpr.sample_barcode = cohortVar.sample_barcode ),
--
-- First part of the calculation, the grand mean (over everything)
--
grandMeanTable AS (
SELECT
AVG(expr) AS grand_mean
FROM
cohort ),
--
-- Then we need a mean per group, and we can get a count of samples
-- per group.
--
groupMeansTable AS (
SELECT
AVG(expr) AS group_mean,
group_name,
COUNT(sample_barcode) AS n
FROM
cohort
GROUP BY
group_name),
--
-- To get the between-group variance
-- we take the difference between the grand mean
-- and the means for each group and sum over all samples
-- ... a short cut being taking the product with n.
-- Later we'll sum over the n_sq_diff
--
ssBetween AS (
SELECT
group_name,
group_mean,
grand_mean,
n,
n*POW(group_mean - grand_mean,2) AS n_diff_sq
FROM
groupMeansTable
CROSS JOIN
grandMeanTable ),
--
-- Then, to get the variance within each group
-- we have to build a table matching up the group mean
-- with the values for each group. So we join the group
-- means to the values on group name. We are going to
-- sum over this table just like ssBetween
--
ssWithin AS (
SELECT
a.group_name AS group_name,
expr,
group_mean,
b.n AS n,
POW(expr - group_mean, 2) AS s2
FROM
cohort a
JOIN
ssBetween b
ON
a.group_name = b.group_name ),
--
-- The F stat comes from a ratio, the numerator is
-- calculated using the between group variance, and
-- dividing by the number of groups (k) minus 1.
--
numerator AS (
SELECT
'dummy' AS dummy,
SUM(n_diff_sq) / (count(group_name) - 1) AS mean_sq_between
FROM
ssBetween ),
--
-- The denominator of the F stat ratio is found using the
-- variance within groups. We divide the sum of the within
-- group variance and divide it by (n-k).
--
denominator AS (
SELECT
'dummy' AS dummy,
COUNT(distinct(group_name)) AS k,
COUNT(group_name) AS n,
SUM(s2)/(COUNT(group_name)-COUNT(distinct(group_name))) AS mean_sq_within
FROM
ssWithin),
--
-- Now we're ready to calculate F!
--
Ftable AS (
SELECT
n,
k,
mean_sq_between,
mean_sq_within,
mean_sq_between / mean_sq_within AS F
FROM
numerator
JOIN
denominator
ON
numerator.dummy = denominator.dummy)

SELECT
*
FROM
Ftable

In [ ]:
anova

Out[ ]:
n k mean_sq_between mean_sq_within F
0 375 3 6.230237 0.095634 65.146988

OK, so let’s check our work. Using the BRCA cohort and TP53 as our gene, we have 375 samples with a variant in this gene. We’re going to look at whether the type of variant is related to the gene expression we observe. If we just pull down the data using the ‘cohort’ subtable (as above), we can get a small data frame, which let’s us do the standard F stat table.

In [ ]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery dat --project your_project_number
WITH
-- using standard SQL,
-- we'll select our cohort and gene expression
--
cohortExpr AS (
SELECT
sample_barcode,
LOG10(normalized_count) AS expr
FROM
isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM
WHERE
project_short_name = 'TCGA-BRCA'
AND HGNC_gene_symbol = 'TP53'
AND normalized_count IS NOT NULL
AND normalized_count > 0),
--
-- And we'll select the variant data for our cohort,
-- we're going to be comparing variant types (SNP, DEL, etc)
--
cohortVar AS (
SELECT
Variant_Type,
sample_barcode_tumor AS sample_barcode
FROM
isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3
WHERE
SYMBOL = 'TP53' ),
--
-- then we join the expression and variant data using sample barcodes
--
cohort AS (
SELECT
cohortExpr.sample_barcode AS sample_barcode,
Variant_Type AS group_name,
expr
FROM
cohortExpr
JOIN
cohortVar
ON
cohortExpr.sample_barcode = cohortVar.sample_barcode )
SELECT sample_barcode, group_name, expr
FROM cohort

In [ ]:
dat.head()

Out[ ]:
sample_barcode group_name expr
0 TCGA-E2-A155-01A SNP 2.982242
1 TCGA-AR-A0TV-01A SNP 3.139173
2 TCGA-A2-A0CL-01A SNP 3.340913
3 TCGA-B6-A0IK-01A SNP 3.470800
4 TCGA-AN-A0FL-01A SNP 2.312675
In [ ]:
import pandas as pd

In [ ]:
dat.groupby(['group_name']).describe()

Out[ ]:
expr
count mean std min 25% 50% 75% max
group_name
DEL 57.0 2.791941 0.322067 2.298823 2.579250 2.743779 2.913681 3.551119
INS 15.0 2.642215 0.115888 2.462272 2.546598 2.662163 2.719217 2.888580
SNP 303.0 3.218580 0.312959 2.312675 3.100479 3.274329 3.427654 3.796834
In [ ]:
from scipy import stats

In [ ]:
dat_DEL = dat[dat['group_name']=='DEL']
dat_INS = dat[dat['group_name']=='INS']
dat_SNP = dat[dat['group_name']=='SNP']
stats.f_oneway(dat_DEL['expr'],dat_INS['expr'],dat_SNP['expr'])

Out[ ]:
F_onewayResult(statistic=65.1469883103552, pvalue=5.5310088412358445e-25)

OK, if you run the above BigQuery, you’ll see the same results. We see that the F statistic is really high, which makes sense looking at the difference in mean expression values across the groups (these are log10 expression values).