Example using Google Nucleus Library to Generate Genomics Tensor

The cell below shows how to install Google-Nucleus (https://github.com/google/nucleus) and get a Jupyter notebook kernal with Google-Nucleus libraray. Once the Jupyter notebook. This notebook should be executed with the new Jupyter kernel google-nucleus.

%%capture
%%bash
. /opt/conda/bin/activate
conda create -n google-nucleus python=2.7 -y
conda activate google-nucleus
conda install -y ipykernel
pip install google-nucleus
conda install -y matplotlib
python -m ipykernel install --name google-nucleus --user

Prepare data for generating the tensor. This is specific to a DNAnexus project envrionment.

In [1]:
%%bash 
dx download -f HG002.15kb.Q20.hs37d5.pbmm2.MAPQ60.bam
dx download -f HG002.15kb.Q20.hs37d5.pbmm2.MAPQ60.bam.bai
dx download -f "/HG002_GRCh37_GIAB*"
dx download -f hs37d5.fa.gz
gunzip -f hs37d5.fa.gz
dx download -f happy_benchmarking_results.tgz
ln -sf hs37d5.fa ref.fa
tar zxvf happy_benchmarking_results.tgz > /dev/null

Quick code to split the VCF file into multiple chunks

In [2]:
%%capture
%%bash
cd hap_py_results
zcat cv_happy_out.vcf.gz
zcat cv_happy_out.vcf.gz | grep "^#" > header
zcat cv_happy_out.vcf.gz | grep -v "^#" > body
split -n l/36 body -d
for f in `ls x*`;do echo "cat header $f > $f.vcf"; done | bash

Boilerplate code to load the reader modules from the Google-Nucleus library

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys, os
from nucleus.io import fasta
from nucleus.io import sam
from nucleus.util import ranges
from nucleus.io import vcf
from nucleus.io.python import vcf_writer
from nucleus.protos import variants_pb2
from nucleus.protos import cigar_pb2

import numpy as np

CIGAR_OPS_TO_CHAR = {
    cigar_pb2.CigarUnit.ALIGNMENT_MATCH: 'M',
    cigar_pb2.CigarUnit.INSERT: 'I',
    cigar_pb2.CigarUnit.DELETE: 'D',
    cigar_pb2.CigarUnit.SKIP: 'N',
    cigar_pb2.CigarUnit.CLIP_SOFT: 'S',
    cigar_pb2.CigarUnit.CLIP_HARD: 'H',
    cigar_pb2.CigarUnit.PAD: 'P',
    cigar_pb2.CigarUnit.SEQUENCE_MATCH: '=',
    cigar_pb2.CigarUnit.SEQUENCE_MISMATCH: 'X',
}

BASE_IDX = { "A":0, "C":1, "G":2, "T":3}

We use the VcfReader, FastaReads and SamReader from Google-Nucleus' library to get the variants from the VCF file and construct alignment tensors from the SAM file and the refernece. We constructe range set from Nucleus' ranges module for the high confidence regions defined by GIAB.

In [4]:
happy_vcf_reader = vcf.VcfReader(
        "/home/dnanexus/notebooks/hap_py_results/x00.vcf")
ref_reader = fasta.FastaReader("ref.fa")
confident_regions = ranges.RangeSet.from_bed("HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed")
sam_reader = sam.SamReader("HG002.15kb.Q20.hs37d5.pbmm2.MAPQ60.bam")
header = happy_vcf_reader.header

The generate_tensor() function to will create an iterator to loop through all variant sites that with a helper function fill_tensor()

In [5]:
def fill_tensor(v_pos, out_tensor, seq, ref_start, 
                channel_offsets, flanking_size=32):
    if ref_start + len(seq) < v_pos - flanking_size:
        return
    pos = ref_start
    for c in seq:
        col_pos = pos - (v_pos - flanking_size)
        if col_pos >= 0 and col_pos < flanking_size * 2 + 1:
            for channel_offset in channel_offsets:
                out_tensor[BASE_IDX[c] + channel_offset][col_pos] += 1
        pos += 1

        
def generate_tensor(vcf_reader, sam_reader, flanking_size = 32, max_output=10):
    c = 0
    for v in vcf_reader:
        
        ## Get a set of information from the varaint `v`
        REF_name = v.reference_name
        VAR_position = v.start
        REF_call = v.calls[0].info["BD"].values[0].string_value
        QUERY_call = v.calls[1].info["BD"].values[0].string_value
        BLT = v.calls[1].info["BLT"].values[0].string_value
        BVT = v.calls[1].info["BVT"].values[0].string_value
        
        if len(v.calls) < 1:
            continue
            
        if REF_call == "UNK":  # ignore if the variant is not in high confindence region,
            continue
        
        if REF_call != "TP" or \
           QUERY_call != "TP":  # capture the variant calls that disagree with the GIAB set
            
            c += 1
            if max_output != None and c > max_output: 
                break
                
            v_pos = v.start
            region = ranges.make_range(v.reference_name,
                                       v_pos - flanking_size,
                                       v_pos + flanking_size)

            reads = list(sam_reader.query(region))

            out_tensor = np.zeros([8, flanking_size * 2 + 1])

            for r in reads:  # loop through the reads that overlaps with the variant
                read_pos = 0
                ref_pos = r.alignment.position.position
                read_seq = r.aligned_sequence
                
                for (op, length) in \
                    ((CIGAR_OPS_TO_CHAR[c.operation],
                      c.operation_length) for c in r.alignment.cigar):
                    if op in ("X", "=", "M"): # for match/mismatch/deletion, we fill the top half of the tensor
                        offsets = [0]
                        fill_tensor(v_pos,
                                    out_tensor,
                                    read_seq[read_pos:read_pos + length],
                                    ref_pos,
                                    offsets,
                                    flanking_size)
                        read_pos += length
                        ref_pos += length
                    elif op in ("I", "S"): # for insertion or skip, we fill the top half of the tensor
                        offsets = [4]
                        if op == "I":
                            fill_tensor(v_pos,
                                        out_tensor,
                                        read_seq[read_pos:read_pos + length],
                                        ref_pos,
                                        offsets,
                                        flanking_size)
                        read_pos += length
                    elif op in ("D", "N"):
                        ref_pos += length
            yield out_tensor, REF_name, VAR_position, REF_call, QUERY_call, BLT, BVT

This shows a couple examples of the generated tensors as heat map.

In [6]:
import matplotlib.pyplot as plt
In [7]:
for t in generate_tensor(happy_vcf_reader, sam_reader, flanking_size = 32, max_output=10):
    t = t[0]
    plt.figure()
    plt.matshow(t)
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>

Let's install scikit-learn for t-SNE embedding and umap-learn for UMAP embeding.

In [8]:
%%capture
%%bash
. /opt/conda/bin/activate
conda activate google-nucleus
conda install -y scikit-learn
conda install -c conda-forge umap-learn
In [9]:
import umap
from sklearn.manifold import TSNE

Let's generate the alignment and select for those sites that have indel errors.

In [10]:
tensors = []
annotation = []
for t in generate_tensor(happy_vcf_reader, sam_reader, flanking_size = 32, max_output=10000):
    tensors.append(t[0])
    annotation.append(t[1:])
In [11]:
idx = np.array(annotation)[:,-1] == "INDEL"
print(idx.shape)
(5161,)
In [12]:
err_t = np.array([ c.flatten() for c in tensors ])
err_t = err_t[idx]
print(err_t.shape)
(3252, 520)

Calling UMAP to generate umap embedding.

In [13]:
embedding = umap.UMAP(n_neighbors=25,
                      min_dist=0.25,
                      metric='correlation',
                      random_state=42).fit_transform(err_t)
In [14]:
plt.plot(embedding[:,0], embedding[:,1], '.', alpha=0.1)
Out[14]:
[<matplotlib.lines.Line2D at 0x7f154fba1590>]
In [15]:
plt.plot(embedding[:,0], embedding[:,1], '.', alpha=0.1)
plt.ylim(-5,7)
Out[15]:
(-5, 7)

Similarly, we can also try a T-SNE embedding representation of the distribution of the alignment tensors.

In [16]:
from sklearn.manifold import TSNE
In [17]:
tsne = TSNE(n_components=2, random_state=42)
In [18]:
embedded_tsne = TSNE(n_components=2).fit_transform(err_t)
In [19]:
plt.plot(embedded_tsne[:,0], embedded_tsne[:,1], '.', alpha=0.1)
Out[19]:
[<matplotlib.lines.Line2D at 0x7f154affd090>]

--Jason Chin, Jan. 2019