1000 Genomes via bigrquery and dplyr¶

This notebook illustrates access to genotype calls via BigQuery. Extraction of a few genotype calls is illustrated. Programming patterns that efficiently collect large numbers of genotypes for analysis are not yet identified.

https://cloud.google.com/genomics/docs/how-tos/analyze-variants will have to be studied. I have just used routine bigrquery programming to poke around here. In the VCF-based notebook it was fairly easy to get all the calls on a chromosomal region, to form a matrix to which PCA could be applied. It would be nice to replicate that action in this framework.

Setup¶

"First, be sure to run notebook R environment setup in this workspace." -- This follows advice in the terra playground.

In [1]:

install_if_missing <- function(packages) {
    if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
        install.packages(setdiff(packages, rownames(installed.packages())))
    }
}

install_if_missing(c('tidyverse', 'viridis', 'ggthemes', 'qwraps2', 'pryr', 'skimr',
                     'testthat', 'reticulate', 'data.table', 'RCurl','stringr',
                    'bigrquery'))

devtools::install_github('DataBiosphere/ronaldo')

Skipping install of 'Ronaldo' from a github remote, the SHA1 (426459ff) has not changed since last install.
  Use `force = TRUE` to force installation

Then in this section we:

load the needed R packages
set the project id of the cloud project to bill for queries to BigQuery
authorize our bigquery client library to issue requests

In [2]:

# Load the libraries into memory
suppressPackageStartupMessages({
 library(bigrquery)
 library(dplyr)
 library(skimr)
 library(ggplot2)
})

In [3]:

# Authorize bigrquery client
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')
bigrquery::set_service_token(Ronaldo::getServiceAccountKey())

Connect to the human genome variants in BigQuery¶

In [4]:

# Create a "connection" to a public BigQuery dataset.
dbcon <- bigrquery::src_bigquery(project = 'bigquery-public-data',
                                 dataset = 'human_genome_variants',
                                 billing = BILLING_PROJECT_ID)

In [5]:

dbcon

src:  BigQueryConnection
tbls: 1000_genomes_pedigree,
  1000_genomes_phase_3_optimized_schema_variants_20150220,
  1000_genomes_phase_3_variants_20150220, 1000_genomes_sample_info,
  platinum_genomes_deepvariant_variants_20180823,
  simons_genome_diversity_project_sample_attributes,
  simons_genome_diversity_project_sample_metadata,
  simons_genome_diversity_project_sample_variants

Examine the catalog of variants, phase 3 vintage 2015¶

In [6]:

dbcon %>% tbl("1000_genomes_phase_3_variants_20150220")

# Source:   table<1000_genomes_phase_3_variants_20150220> [?? x 30]
# Database: BigQueryConnection
   reference_name start_position end_position reference_bases alternate_bases
   <chr>                   <int>        <int> <chr>           <list>         
 1 1                      124852       124853 A               <tibble [1 × 8…
 2 1                      129671       129672 A               <tibble [1 × 8…
 3 1                       55646        55647 A               <tibble [1 × 8…
 4 1                       85106        85107 A               <tibble [1 × 8…
 5 1                      532446       532447 A               <tibble [1 × 8…
 6 1                       84699        84700 A               <tibble [1 × 8…
 7 1                      566059       566060 A               <tibble [1 × 8…
 8 1                      599981       599982 A               <tibble [1 × 8…
 9 1                      251664       251665 A               <tibble [1 × 8…
10 1                      235067       235068 A               <tibble [1 × 8…
# … with more rows, and 25 more variables: names <list>, quality <dbl>,
#   filter <list>, call <list>, CIEND <list>, CIPOS <list>, CS <chr>,
#   IMPRECISE <lgl>, MC <list>, MEINFO <list>, MEND <int>, MLEN <int>,
#   MSTART <int>, SVLEN <list>, SVTYPE <chr>, TSD <chr>, NS <int>, AN <int>,
#   DP <int>, AA <chr>, VT <list>, EX_TARGET <lgl>, MULTI_ALLELIC <lgl>,
#   OLD_VARIANT <chr>, partition_date_please_ignore <date>

In [7]:

# this shows how the chromosomes are named in the database
# dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>% select(reference_name)

What do alternate bases look like? This command is slow. The answer does not seem right.

In [8]:

(dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>% select(start_position, alternate_bases) %>% 
      as.data.frame(n=2))

start_position	alternate_bases
97245325	<CN0> , 1 , 0.000199681, 0 , 0 , 8e-04 , 0 , 0
97240180	G , 3 , 0.000599042, 0 , 0 , 0 , 0.0043 , 0

Compose a function that queries calls for a specific locus¶

In [9]:

query_loc = function(con, tablename="1000_genomes_phase_3_variants_20150220", chr, pos) {
  tmp = (con %>% tbl(tablename) %>% select(reference_name, start_position, names, call) %>%
    filter(reference_name==chr, start_position==pos) %>% as.data.frame())
  data.frame(id=as.character(tmp$call[[1]]$name), snp=as.character(tmp$names), 
             call=sapply(tmp$call[[1]]$genotype,sum))
}

In [10]:

oneloc = query_loc(dbcon, "1000_genomes_phase_3_variants_20150220", chr="17", pos=85101)
head(oneloc)
table(oneloc$call)

id	snp	call
HG00096	rs549730151	0
HG00097	rs549730151	0
HG00099	rs549730151	0
HG00100	rs549730151	0
HG00101	rs549730151	0
HG00102	rs549730151	0

   0    1 
2502    2

In this cohort, two individuals are heterozygous for this SNP; all others are homozygous for the reference allele.

Check the details of the 'call' represented in BigQuery¶

The point of this section is to show that a somewhat nonstandard data representation is used for the call information. When we use "table-to-data.frame" patterns familiar with BigQuery, the genotype field comes back as a column of class "list".

In [11]:

df1 = dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>% 
  filter(reference_name=="17", start_position == 85101) %>% 
  select(reference_name,start_position, names, call) %>% as.data.frame()

In [12]:

names(df1)

'reference_name'
'start_position'
'names'
'call'

In [13]:

library(tibble)
as_tibble(df1$call[[1]]) %>% head()

name	genotype	phaseset	CN	CNQ	GQ	FT
HG00096	0, 0	*	NA	NA	NA	NA
HG00097	0, 0	*	NA	NA	NA	NA
HG00099	0, 0	*	NA	NA	NA	NA
HG00100	0, 0	*	NA	NA	NA	NA
HG00101	0, 0	*	NA	NA	NA	NA
HG00102	0, 0	*	NA	NA	NA	NA

Examine the simons diversity project variants¶

In [14]:

dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>% 
       filter(start_position==24228721)

# Source:   lazy query [?? x 29]
# Database: BigQueryConnection
  reference_name start_position end_position reference_bases alternate_bases
  <chr>                   <int>        <int> <chr>           <list>         
1 1                    24228721     24228722 A               <tibble [1 × 5…
2 1                    24228721     24228722 A               <tibble [1 × 5…
3 1                    24228721     24228722 A               <tibble [1 × 5…
# … with 24 more variables: names <list>, quality <dbl>, filter <list>,
#   call <list>, AN <int>, BaseCounts <list>, BaseQRankSum <dbl>, DB <lgl>,
#   DP <int>, DS <lgl>, Dels <dbl>, FS <dbl>, GC <dbl>, HaplotypeScore <dbl>,
#   InbreedingCoeff <dbl>, MQ <dbl>, MQ0 <int>, MQRankSum <dbl>, QD <dbl>,
#   RPA <list>, RU <chr>, ReadPosRankSum <dbl>, STR <lgl>,
#   partition_date_please_ignore <date>

In [15]:

mydf = dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>% 
       filter(start_position==24228721) %>% select(call) %>% as.data.frame()

In [16]:

dim(mydf)

3
1

In [17]:

#str(mydf[1,])

In [18]:

chk = dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>% 
       filter(start_position>=24228721 & start_position <=24230000)

In [19]:

chk

# Source:   lazy query [?? x 29]
# Database: BigQueryConnection
   reference_name start_position end_position reference_bases alternate_bases
   <chr>                   <int>        <int> <chr>           <list>         
 1 1                    24228721     24228722 A               <tibble [1 × 5…
 2 1                    24228721     24228722 A               <tibble [1 × 5…
 3 1                    24229105     24229106 A               <tibble [1 × 5…
 4 1                    24229275     24229276 T               <tibble [1 × 5…
 5 1                    24228761     24228762 G               <tibble [1 × 5…
 6 1                    24229412     24229413 G               <tibble [1 × 5…
 7 1                    24229824     24229825 G               <tibble [1 × 5…
 8 1                    24229031     24229032 C               <tibble [1 × 5…
 9 1                    24228862     24228863 G               <tibble [1 × 5…
10 1                    24229685     24229686 G               <tibble [1 × 5…
# … with more rows, and 24 more variables: names <list>, quality <dbl>,
#   filter <list>, call <list>, AN <int>, BaseCounts <list>,
#   BaseQRankSum <dbl>, DB <lgl>, DP <int>, DS <lgl>, Dels <dbl>, FS <dbl>,
#   GC <dbl>, HaplotypeScore <dbl>, InbreedingCoeff <dbl>, MQ <dbl>, MQ0 <int>,
#   MQRankSum <dbl>, QD <dbl>, RPA <list>, RU <chr>, ReadPosRankSum <dbl>,
#   STR <lgl>, partition_date_please_ignore <date>

In [20]:

dim(chk %>% select(call) %>% as.data.frame(n=2))

2
1

In [21]:

names(mydf$call[[1]])

'name'
'genotype'
'phaseset'
'AD'
'DP'
'GQ'
'PL'
'FL'
'quality'

In [22]:

head(mydf$call[[1]])

name	genotype	phaseset	AD	DP	GQ	PL	FL	quality
LP6005442-DNA_A04	1, 1	NA	0, 1	1	3	38, 3, 0	1	39.77
LP6005441-DNA_D09	1, 1	NA	0, 1	1	3	38, 3, 0	0	39.77
LP6005443-DNA_D02	1, 1	NA	0, 1	1	3	34, 3, 0	5	35.77
LP6005443-DNA_G07	1, 1	NA	0, 1	1	3	23, 3, 0	7	24.79
LP6005677-DNA_D03	1, 1	NA	0, 1	1	3	38, 3, 0	2	39.77
LP6005441-DNA_F07	1, 1	NA	0, 1	1	3	38, 3, 0	3	39.77

In [23]:

summary(mydf$call[[1]]$DP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.253   1.000   3.000

Provenance¶

In [24]:

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Debian GNU/Linux 9 (stretch)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2019-04-10                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version     date       lib
 askpass       1.1         2019-01-13 [2]
 assertthat    0.2.1       2019-03-21 [1]
 backports     1.1.3       2018-12-14 [2]
 base64enc     0.1-3       2015-07-28 [2]
 bigrquery   * 1.1.0       2019-02-05 [2]
 bit           1.1-14      2018-05-29 [2]
 bit64         0.9-7       2017-05-08 [2]
 callr         3.2.0       2019-03-15 [2]
 cli           1.1.0       2019-03-19 [1]
 colorspace    1.4-1       2019-03-18 [1]
 crayon        1.3.4       2017-09-16 [2]
 curl          3.3         2019-01-10 [2]
 DBI           1.0.0       2018-05-02 [2]
 dbplyr        1.3.0       2019-01-09 [2]
 desc          1.2.0       2018-05-01 [2]
 devtools      2.0.1       2018-10-26 [2]
 digest        0.6.18      2018-10-10 [2]
 dplyr       * 0.8.0.1     2019-02-15 [2]
 evaluate      0.13        2019-02-12 [2]
 fansi         0.4.0       2018-10-05 [2]
 fs            1.2.6       2018-08-23 [2]
 ggplot2     * 3.1.0       2018-10-25 [2]
 glue          1.3.1       2019-03-12 [2]
 gtable        0.3.0       2019-03-25 [1]
 htmltools     0.3.6       2017-04-28 [2]
 httr          1.4.0       2018-12-11 [2]
 IRdisplay     0.7.0       2018-11-29 [2]
 IRkernel      0.8.15.9000 2019-03-18 [2]
 jsonlite      1.6         2018-12-07 [2]
 lazyeval      0.2.2       2019-03-15 [2]
 magrittr      1.5         2014-11-22 [2]
 memoise       1.1.0       2017-04-21 [2]
 munsell       0.5.0       2018-06-12 [2]
 openssl       1.3         2019-03-22 [1]
 pbdZMQ        0.3-3       2018-05-05 [2]
 pillar        1.3.1       2018-12-15 [2]
 pkgbuild      1.0.2       2018-10-16 [2]
 pkgconfig     2.0.2       2018-08-16 [2]
 pkgload       1.0.2       2018-10-29 [2]
 plyr          1.8.4       2016-06-08 [2]
 prettyunits   1.0.2       2015-07-13 [2]
 processx      3.3.0       2019-03-10 [2]
 ps            1.3.0       2018-12-21 [2]
 purrr         0.3.2       2019-03-15 [2]
 R6            2.4.0       2019-02-14 [2]
 Rcpp          1.0.1       2019-03-17 [2]
 remotes       2.0.2       2018-10-30 [2]
 repr          0.19.2      2019-02-06 [2]
 rlang         0.3.3       2019-03-29 [1]
 Ronaldo       0.1.0       2019-03-18 [2]
 rprojroot     1.3-2       2018-01-03 [2]
 scales        1.0.0       2018-08-09 [2]
 sessioninfo   1.1.1       2018-11-05 [2]
 skimr       * 1.0.5       2019-02-25 [1]
 testthat      2.0.1       2018-10-13 [1]
 tibble      * 2.1.1       2019-03-16 [2]
 tidyselect    0.2.5       2018-10-11 [2]
 usethis       1.4.0       2018-08-14 [2]
 utf8          1.1.4       2018-05-24 [2]
 uuid          0.1-2       2015-07-28 [2]
 withr         2.1.2       2018-03-15 [2]
 source                                
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 Github (IRkernel/IRkernel@dbe3a10)    
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 Github (DataBiosphere/Ronaldo@426459f)
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        
 CRAN (R 3.5.2)                        

[1] /home/jupyter-user/.rpackages
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library