This notebook illustrates access to genotype calls via BigQuery. Extraction of a few genotype calls is illustrated. Programming patterns that efficiently collect large numbers of genotypes for analysis are not yet identified.
https://cloud.google.com/genomics/docs/how-tos/analyze-variants will have to be studied. I have just used routine bigrquery programming to poke around here. In the VCF-based notebook it was fairly easy to get all the calls on a chromosomal region, to form a matrix to which PCA could be applied. It would be nice to replicate that action in this framework.
"First, be sure to run notebook R environment setup
in this workspace." -- This follows advice in the terra playground.
install_if_missing <- function(packages) {
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
}
install_if_missing(c('tidyverse', 'viridis', 'ggthemes', 'qwraps2', 'pryr', 'skimr',
'testthat', 'reticulate', 'data.table', 'RCurl','stringr',
'bigrquery'))
devtools::install_github('DataBiosphere/ronaldo')
Skipping install of 'Ronaldo' from a github remote, the SHA1 (426459ff) has not changed since last install. Use `force = TRUE` to force installation
Then in this section we:
# Load the libraries into memory
suppressPackageStartupMessages({
library(bigrquery)
library(dplyr)
library(skimr)
library(ggplot2)
})
# Authorize bigrquery client
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')
bigrquery::set_service_token(Ronaldo::getServiceAccountKey())
# Create a "connection" to a public BigQuery dataset.
dbcon <- bigrquery::src_bigquery(project = 'bigquery-public-data',
dataset = 'human_genome_variants',
billing = BILLING_PROJECT_ID)
dbcon
src: BigQueryConnection tbls: 1000_genomes_pedigree, 1000_genomes_phase_3_optimized_schema_variants_20150220, 1000_genomes_phase_3_variants_20150220, 1000_genomes_sample_info, platinum_genomes_deepvariant_variants_20180823, simons_genome_diversity_project_sample_attributes, simons_genome_diversity_project_sample_metadata, simons_genome_diversity_project_sample_variants
dbcon %>% tbl("1000_genomes_phase_3_variants_20150220")
# Source: table<1000_genomes_phase_3_variants_20150220> [?? x 30] # Database: BigQueryConnection reference_name start_position end_position reference_bases alternate_bases <chr> <int> <int> <chr> <list> 1 1 124852 124853 A <tibble [1 × 8… 2 1 129671 129672 A <tibble [1 × 8… 3 1 55646 55647 A <tibble [1 × 8… 4 1 85106 85107 A <tibble [1 × 8… 5 1 532446 532447 A <tibble [1 × 8… 6 1 84699 84700 A <tibble [1 × 8… 7 1 566059 566060 A <tibble [1 × 8… 8 1 599981 599982 A <tibble [1 × 8… 9 1 251664 251665 A <tibble [1 × 8… 10 1 235067 235068 A <tibble [1 × 8… # … with more rows, and 25 more variables: names <list>, quality <dbl>, # filter <list>, call <list>, CIEND <list>, CIPOS <list>, CS <chr>, # IMPRECISE <lgl>, MC <list>, MEINFO <list>, MEND <int>, MLEN <int>, # MSTART <int>, SVLEN <list>, SVTYPE <chr>, TSD <chr>, NS <int>, AN <int>, # DP <int>, AA <chr>, VT <list>, EX_TARGET <lgl>, MULTI_ALLELIC <lgl>, # OLD_VARIANT <chr>, partition_date_please_ignore <date>
# this shows how the chromosomes are named in the database
# dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>% select(reference_name)
What do alternate bases look like? This command is slow. The answer does not seem right.
(dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>% select(start_position, alternate_bases) %>%
as.data.frame(n=2))
start_position | alternate_bases |
---|---|
97245325 | <CN0> , 1 , 0.000199681, 0 , 0 , 8e-04 , 0 , 0 |
97240180 | G , 3 , 0.000599042, 0 , 0 , 0 , 0.0043 , 0 |
query_loc = function(con, tablename="1000_genomes_phase_3_variants_20150220", chr, pos) {
tmp = (con %>% tbl(tablename) %>% select(reference_name, start_position, names, call) %>%
filter(reference_name==chr, start_position==pos) %>% as.data.frame())
data.frame(id=as.character(tmp$call[[1]]$name), snp=as.character(tmp$names),
call=sapply(tmp$call[[1]]$genotype,sum))
}
oneloc = query_loc(dbcon, "1000_genomes_phase_3_variants_20150220", chr="17", pos=85101)
head(oneloc)
table(oneloc$call)
id | snp | call |
---|---|---|
HG00096 | rs549730151 | 0 |
HG00097 | rs549730151 | 0 |
HG00099 | rs549730151 | 0 |
HG00100 | rs549730151 | 0 |
HG00101 | rs549730151 | 0 |
HG00102 | rs549730151 | 0 |
0 1 2502 2
In this cohort, two individuals are heterozygous for this SNP; all others are homozygous for the reference allele.
The point of this section is to show that a somewhat nonstandard data representation is used for the call information. When we use "table-to-data.frame" patterns familiar with BigQuery, the genotype
field comes back as a column of class "list".
df1 = dbcon %>% tbl("1000_genomes_phase_3_variants_20150220") %>%
filter(reference_name=="17", start_position == 85101) %>%
select(reference_name,start_position, names, call) %>% as.data.frame()
names(df1)
library(tibble)
as_tibble(df1$call[[1]]) %>% head()
name | genotype | phaseset | CN | CNL | CNP | CNQ | GP | GQ | FT | PL |
---|---|---|---|---|---|---|---|---|---|---|
HG00096 | 0, 0 | * | NA | NA | NA | NA | ||||
HG00097 | 0, 0 | * | NA | NA | NA | NA | ||||
HG00099 | 0, 0 | * | NA | NA | NA | NA | ||||
HG00100 | 0, 0 | * | NA | NA | NA | NA | ||||
HG00101 | 0, 0 | * | NA | NA | NA | NA | ||||
HG00102 | 0, 0 | * | NA | NA | NA | NA |
dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>%
filter(start_position==24228721)
# Source: lazy query [?? x 29] # Database: BigQueryConnection reference_name start_position end_position reference_bases alternate_bases <chr> <int> <int> <chr> <list> 1 1 24228721 24228722 A <tibble [1 × 5… 2 1 24228721 24228722 A <tibble [1 × 5… 3 1 24228721 24228722 A <tibble [1 × 5… # … with 24 more variables: names <list>, quality <dbl>, filter <list>, # call <list>, AN <int>, BaseCounts <list>, BaseQRankSum <dbl>, DB <lgl>, # DP <int>, DS <lgl>, Dels <dbl>, FS <dbl>, GC <dbl>, HaplotypeScore <dbl>, # InbreedingCoeff <dbl>, MQ <dbl>, MQ0 <int>, MQRankSum <dbl>, QD <dbl>, # RPA <list>, RU <chr>, ReadPosRankSum <dbl>, STR <lgl>, # partition_date_please_ignore <date>
mydf = dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>%
filter(start_position==24228721) %>% select(call) %>% as.data.frame()
dim(mydf)
#str(mydf[1,])
chk = dbcon %>% tbl("simons_genome_diversity_project_sample_variants") %>%
filter(start_position>=24228721 & start_position <=24230000)
chk
# Source: lazy query [?? x 29] # Database: BigQueryConnection reference_name start_position end_position reference_bases alternate_bases <chr> <int> <int> <chr> <list> 1 1 24228721 24228722 A <tibble [1 × 5… 2 1 24228721 24228722 A <tibble [1 × 5… 3 1 24229105 24229106 A <tibble [1 × 5… 4 1 24229275 24229276 T <tibble [1 × 5… 5 1 24228761 24228762 G <tibble [1 × 5… 6 1 24229412 24229413 G <tibble [1 × 5… 7 1 24229824 24229825 G <tibble [1 × 5… 8 1 24229031 24229032 C <tibble [1 × 5… 9 1 24228862 24228863 G <tibble [1 × 5… 10 1 24229685 24229686 G <tibble [1 × 5… # … with more rows, and 24 more variables: names <list>, quality <dbl>, # filter <list>, call <list>, AN <int>, BaseCounts <list>, # BaseQRankSum <dbl>, DB <lgl>, DP <int>, DS <lgl>, Dels <dbl>, FS <dbl>, # GC <dbl>, HaplotypeScore <dbl>, InbreedingCoeff <dbl>, MQ <dbl>, MQ0 <int>, # MQRankSum <dbl>, QD <dbl>, RPA <list>, RU <chr>, ReadPosRankSum <dbl>, # STR <lgl>, partition_date_please_ignore <date>
dim(chk %>% select(call) %>% as.data.frame(n=2))
names(mydf$call[[1]])
head(mydf$call[[1]])
name | genotype | phaseset | AD | DP | GQ | PL | FL | quality |
---|---|---|---|---|---|---|---|---|
LP6005442-DNA_A04 | 1, 1 | NA | 0, 1 | 1 | 3 | 38, 3, 0 | 1 | 39.77 |
LP6005441-DNA_D09 | 1, 1 | NA | 0, 1 | 1 | 3 | 38, 3, 0 | 0 | 39.77 |
LP6005443-DNA_D02 | 1, 1 | NA | 0, 1 | 1 | 3 | 34, 3, 0 | 5 | 35.77 |
LP6005443-DNA_G07 | 1, 1 | NA | 0, 1 | 1 | 3 | 23, 3, 0 | 7 | 24.79 |
LP6005677-DNA_D03 | 1, 1 | NA | 0, 1 | 1 | 3 | 38, 3, 0 | 2 | 39.77 |
LP6005441-DNA_F07 | 1, 1 | NA | 0, 1 | 1 | 3 | 38, 3, 0 | 3 | 39.77 |
summary(mydf$call[[1]]$DP)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.253 1.000 3.000
devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os Debian GNU/Linux 9 (stretch) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz Etc/UTC date 2019-04-10 ─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib askpass 1.1 2019-01-13 [2] assertthat 0.2.1 2019-03-21 [1] backports 1.1.3 2018-12-14 [2] base64enc 0.1-3 2015-07-28 [2] bigrquery * 1.1.0 2019-02-05 [2] bit 1.1-14 2018-05-29 [2] bit64 0.9-7 2017-05-08 [2] callr 3.2.0 2019-03-15 [2] cli 1.1.0 2019-03-19 [1] colorspace 1.4-1 2019-03-18 [1] crayon 1.3.4 2017-09-16 [2] curl 3.3 2019-01-10 [2] DBI 1.0.0 2018-05-02 [2] dbplyr 1.3.0 2019-01-09 [2] desc 1.2.0 2018-05-01 [2] devtools 2.0.1 2018-10-26 [2] digest 0.6.18 2018-10-10 [2] dplyr * 0.8.0.1 2019-02-15 [2] evaluate 0.13 2019-02-12 [2] fansi 0.4.0 2018-10-05 [2] fs 1.2.6 2018-08-23 [2] ggplot2 * 3.1.0 2018-10-25 [2] glue 1.3.1 2019-03-12 [2] gtable 0.3.0 2019-03-25 [1] htmltools 0.3.6 2017-04-28 [2] httr 1.4.0 2018-12-11 [2] IRdisplay 0.7.0 2018-11-29 [2] IRkernel 0.8.15.9000 2019-03-18 [2] jsonlite 1.6 2018-12-07 [2] lazyeval 0.2.2 2019-03-15 [2] magrittr 1.5 2014-11-22 [2] memoise 1.1.0 2017-04-21 [2] munsell 0.5.0 2018-06-12 [2] openssl 1.3 2019-03-22 [1] pbdZMQ 0.3-3 2018-05-05 [2] pillar 1.3.1 2018-12-15 [2] pkgbuild 1.0.2 2018-10-16 [2] pkgconfig 2.0.2 2018-08-16 [2] pkgload 1.0.2 2018-10-29 [2] plyr 1.8.4 2016-06-08 [2] prettyunits 1.0.2 2015-07-13 [2] processx 3.3.0 2019-03-10 [2] ps 1.3.0 2018-12-21 [2] purrr 0.3.2 2019-03-15 [2] R6 2.4.0 2019-02-14 [2] Rcpp 1.0.1 2019-03-17 [2] remotes 2.0.2 2018-10-30 [2] repr 0.19.2 2019-02-06 [2] rlang 0.3.3 2019-03-29 [1] Ronaldo 0.1.0 2019-03-18 [2] rprojroot 1.3-2 2018-01-03 [2] scales 1.0.0 2018-08-09 [2] sessioninfo 1.1.1 2018-11-05 [2] skimr * 1.0.5 2019-02-25 [1] testthat 2.0.1 2018-10-13 [1] tibble * 2.1.1 2019-03-16 [2] tidyselect 0.2.5 2018-10-11 [2] usethis 1.4.0 2018-08-14 [2] utf8 1.1.4 2018-05-24 [2] uuid 0.1-2 2015-07-28 [2] withr 2.1.2 2018-03-15 [2] source CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) Github (IRkernel/IRkernel@dbe3a10) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) Github (DataBiosphere/Ronaldo@426459f) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) CRAN (R 3.5.2) [1] /home/jupyter-user/.rpackages [2] /usr/local/lib/R/site-library [3] /usr/lib/R/site-library [4] /usr/lib/R/library