ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   One-Way ANOVA Test in BigQuery
Author:  Lauren Hagen
Created: 2019-08-02
Purpose: Demonstrate an ANOVA test within BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_an_ANOVA_test_in_BigQuery.ipynb
Notes:   This notebook was adapted from work by David L Gibbs, September 2017 Query of the Month

Running ANOVA tests in BigQuery

In this notebook, we will cover how to do a one-way ANOVA test in BigQuery. This statistical test can be used to determine whether there is a statistically significant difference between the means of two or more independent groups. Although in this example, I’m only looking at two groups, it would not be difficult to extend this to any number of groups, assuming there is a reasonable number of samples within each group.

Consider the model yij = m + ai + eij, where yij is a continuous variable over samples j, in groups i, and ai is a constant for each group i, and eij is a gaussian error term with mean 0.

Using this model, we are describing the data as being sampled from groups, with each group having a mean value equal to m + ai. The null hypothesis is that each of the group means is the same (ie that the ai terms are zero), while the alternative hypothesis is that at least one of the ai terms is not zero.

We use the F-test to compare these two hypotheses. To compute the test statistic, we compute the within-group variation and the between-group variation. Recall that sample variance is defined as the sum of squared differences between observations and the mean, divided by the number of samples (normalized).

First, we will need to import bigquery module, authenticate ourselves, and create a client variable. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

In [ ]:
# Import BigQuery Module
from google.cloud import bigquery
In [ ]:
# Autheticate ourselves
!gcloud auth application-default login
In [ ]:
# Create a variable for which client to use with BigQuery
project = 'your_project_number' # Replace with your project ID
if project_num == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client(project) 

Let’s look at the query: Note: you will need to update 'your_project_number' with your project number before continuing with the notebook

In [ ]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery anova --project your_project_number
WITH
  -- using standard SQL,
  -- we'll select our cohort and gene expression
  --
  cohortExpr AS (
  SELECT
    sample_barcode,
    LOG10(normalized_count) AS expr
  FROM
    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`
  WHERE
    project_short_name = 'TCGA-BRCA'
    AND HGNC_gene_symbol = 'TP53'
    AND normalized_count IS NOT NULL
    AND normalized_count > 0),
  --
  -- And we'll select the variant data for our cohort,
  -- we're going to be comparing variant types (SNP, DEL, etc)
  --
  cohortVar AS (
  SELECT
    Variant_Type,
    sample_barcode_tumor AS sample_barcode
  FROM
    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`
  WHERE
    SYMBOL = 'TP53' ),
  --
  -- then we join the expression and variant data using sample barcodes
  --
  cohort AS (
  SELECT
    cohortExpr.sample_barcode AS sample_barcode,
    Variant_Type AS group_name,
    expr
  FROM
    cohortExpr
  JOIN
    cohortVar
  ON
    cohortExpr.sample_barcode = cohortVar.sample_barcode ),
  --
  -- First part of the calculation, the grand mean (over everything)
  --
  grandMeanTable AS (
  SELECT
    AVG(expr) AS grand_mean
  FROM
    cohort ),
  --
  -- Then we need a mean per group, and we can get a count of samples
  -- per group.
  --
  groupMeansTable AS (
  SELECT
    AVG(expr) AS group_mean,
    group_name,
    COUNT(sample_barcode) AS n
  FROM
    cohort
  GROUP BY
    group_name),
  --
  -- To get the between-group variance
  -- we take the difference between the grand mean
  -- and the means for each group and sum over all samples
  -- ... a short cut being taking the product with n.
  -- Later we'll sum over the n_sq_diff
  --
  ssBetween AS (
  SELECT
    group_name,
    group_mean,
    grand_mean,
    n,
    n*POW(group_mean - grand_mean,2) AS n_diff_sq
  FROM
    groupMeansTable
  CROSS JOIN
    grandMeanTable ),
  --
  -- Then, to get the variance within each group
  -- we have to build a table matching up the group mean
  -- with the values for each group. So we join the group
  -- means to the values on group name. We are going to
  -- sum over this table just like ssBetween
  --
  ssWithin AS (
  SELECT
    a.group_name AS group_name,
    expr,
    group_mean,
    b.n AS n,
    POW(expr - group_mean, 2) AS s2
  FROM
    cohort a
  JOIN
    ssBetween b
  ON
    a.group_name = b.group_name ),
  --
  -- The F stat comes from a ratio, the numerator is
  -- calculated using the between group variance, and
  -- dividing by the number of groups (k) minus 1.
  --
  numerator AS (
  SELECT
    'dummy' AS dummy,
    SUM(n_diff_sq) / (count(group_name) - 1) AS mean_sq_between
  FROM
    ssBetween ),
  --
  -- The denominator of the F stat ratio is found using the
  -- variance within groups. We divide the sum of the within
  -- group variance and divide it by (n-k).
  --
  denominator AS (
  SELECT
    'dummy' AS dummy,
    COUNT(distinct(group_name)) AS k,
    COUNT(group_name) AS n,
    SUM(s2)/(COUNT(group_name)-COUNT(distinct(group_name))) AS mean_sq_within
  FROM
    ssWithin),
  --
  -- Now we're ready to calculate F!
  --
  Ftable AS (
  SELECT
    n,
    k,
    mean_sq_between,
    mean_sq_within,
    mean_sq_between / mean_sq_within AS F
  FROM
    numerator
  JOIN
    denominator
  ON
    numerator.dummy = denominator.dummy)

SELECT
  *
FROM
  Ftable
In [ ]:
anova
Out[ ]:
n k mean_sq_between mean_sq_within F
0 375 3 6.230237 0.095634 65.146988

OK, so let’s check our work. Using the BRCA cohort and TP53 as our gene, we have 375 samples with a variant in this gene. We’re going to look at whether the type of variant is related to the gene expression we observe. If we just pull down the data using the ‘cohort’ subtable (as above), we can get a small data frame, which let’s us do the standard F stat table.

In [ ]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery dat --project your_project_number
WITH
  -- using standard SQL,
  -- we'll select our cohort and gene expression
  --
  cohortExpr AS (
  SELECT
    sample_barcode,
    LOG10(normalized_count) AS expr
  FROM
    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`
  WHERE
    project_short_name = 'TCGA-BRCA'
    AND HGNC_gene_symbol = 'TP53'
    AND normalized_count IS NOT NULL
    AND normalized_count > 0),
  --
  -- And we'll select the variant data for our cohort,
  -- we're going to be comparing variant types (SNP, DEL, etc)
  --
  cohortVar AS (
  SELECT
    Variant_Type,
    sample_barcode_tumor AS sample_barcode
  FROM
    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`
  WHERE
    SYMBOL = 'TP53' ),
  --
  -- then we join the expression and variant data using sample barcodes
  --
  cohort AS (
  SELECT
    cohortExpr.sample_barcode AS sample_barcode,
    Variant_Type AS group_name,
    expr
  FROM
    cohortExpr
  JOIN
    cohortVar
  ON
    cohortExpr.sample_barcode = cohortVar.sample_barcode )
SELECT sample_barcode, group_name, expr
FROM cohort
In [ ]:
dat.head()
Out[ ]:
sample_barcode group_name expr
0 TCGA-E2-A155-01A SNP 2.982242
1 TCGA-AR-A0TV-01A SNP 3.139173
2 TCGA-A2-A0CL-01A SNP 3.340913
3 TCGA-B6-A0IK-01A SNP 3.470800
4 TCGA-AN-A0FL-01A SNP 2.312675
In [ ]:
import pandas as pd
In [ ]:
dat.groupby(['group_name']).describe()
Out[ ]:
expr
count mean std min 25% 50% 75% max
group_name
DEL 57.0 2.791941 0.322067 2.298823 2.579250 2.743779 2.913681 3.551119
INS 15.0 2.642215 0.115888 2.462272 2.546598 2.662163 2.719217 2.888580
SNP 303.0 3.218580 0.312959 2.312675 3.100479 3.274329 3.427654 3.796834
In [ ]:
from scipy import stats
In [ ]:
dat_DEL = dat[dat['group_name']=='DEL']
dat_INS = dat[dat['group_name']=='INS']
dat_SNP = dat[dat['group_name']=='SNP']
stats.f_oneway(dat_DEL['expr'],dat_INS['expr'],dat_SNP['expr'])
Out[ ]:
F_onewayResult(statistic=65.1469883103552, pvalue=5.5310088412358445e-25)

OK, if you run the above BigQuery, you’ll see the same results. We see that the F statistic is really high, which makes sense looking at the difference in mean expression values across the groups (these are log10 expression values).