Predict regulatory regions from the DNA sequence using pytorch¶

In this notebook we illustrate how to use Janggu datasets with pytorch in order to predict Oct4 and Mafk binding sites. We will make use only of a few concepts available from pytorch. For more information on pytorch, please consult the official documentation.

To run this tutorial you need to install pytorch and tqdm beforehand.

In [1]:

import os
from tqdm import tqdm

import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from pkg_resources import resource_filename

from janggu.data import Bioseq
from janggu.data import Cover
from janggu.data import ReduceDim
from janggu.data import Transpose
from janggu.layers import DnaConv2D

from sklearn.metrics import roc_auc_score

/home/wkopp/anaconda3/envs/jdev/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

In [2]:

from IPython.display import Image

np.random.seed(1234)

First, we need to specify the output directory in which the results are stored.

In [3]:

os.environ['JANGGU_OUTPUT'] = '/home/wkopp/janggu_examples'

Similar as in the keras tutorial notebook, we start off by loading the Janggu datasets. Specify the DNA sequence feature order. Order 1, 2 and 3 correspond to mono-, di- and tri-nucleotide based features (see Tutorial).

In [4]:

order = 3
binsize = 200

In [5]:

# load the dataset
# The pseudo genome represents just a concatenation of all sequences
# in sample.fa and sample2.fa. Therefore, the results should be almost
# identically to the models obtained from classify_fasta.py.
REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa')
# ROI contains regions spanning positive and negative examples
ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')
# PEAK_FILE only contains positive examples
PEAK_FILE = resource_filename('janggu', 'resources/scores.bed')

Load the datasets for training and testing

In [6]:

# Training input and labels are purely defined genomic coordinates
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
                                   roi=ROI_TRAIN_FILE,
                                   binsize=binsize,
                                   order=order,
                                   cache=True)

# The ReduceDim wrapper transforms the dataset from a 4D object to a 2D table-like representation
LABELS = ReduceDim(Cover.create_from_bed('peaks', roi=ROI_TRAIN_FILE,
                               bedfiles=PEAK_FILE,
                               binsize=binsize,
                               resolution=binsize,
                               dtype='int',
                               cache=True,
                               storage='sparse'))


DNA_TEST = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
                                        roi=ROI_TEST_FILE,
                                        binsize=binsize,
                                        order=order)

LABELS_TEST = ReduceDim(Cover.create_from_bed('peaks',
                                    bedfiles=PEAK_FILE,
                                    roi=ROI_TEST_FILE,
                                    binsize=binsize,
                                    resolution=binsize,
                                    dtype='int',
                                    storage='sparse'))

In [7]:

print('training set:', DNA.shape, LABELS.shape)
print('test set:', DNA_TEST.shape, LABELS_TEST.shape)

training set: (7797, 198, 1, 64) (7797, 1)
test set: (200, 198, 1, 64) (200, 1)

Next, we specify a neural network in pytorch that takes the DNA sequence as input and predicts the class labels that reflect Oct4 or Mafk binding.

In [8]:

class DNAConvNet(nn.Module):
    def __init__(self, order=1):
        super(DNAConvNet, self).__init__()
        
        self.conv1 = nn.Conv1d(pow(4, order), 30, 21)
        self.dense = nn.Linear(30, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        
        # emulates global_max_pooling
        x = F.max_pool1d(x, x.size()[-1]).flatten(1, -1)
        x = self.dense(x)
        x = self.sigmoid(x)
        return x
    
net = DNAConvNet(order)
print(net)

DNAConvNet(
  (conv1): Conv1d(64, 30, kernel_size=(21,), stride=(1,))
  (dense): Linear(in_features=30, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

Great! Now we have our model.

Before we can use the Janggu dataset objects with the pytorch model, we need to apply some transformation. First, we need to transform the 4D shape to a 3D object, since we use a Conv1D network layer at the input of the model. To this end we use the ReduceDim wrapper and remove axis=2 which represents merely a dummy dimension for the DNA sequence. Second, we need to change the order of the dimensions such that the channel is at the first dimension (rather than the last as is required for tensorflow). This is achieved by using the Transpose wrapper using the new axis ordering (0, 2, 1).

Finally, the numpy arrays produced by the janggu datasets need to converted to pytorch tensors. We introduce the ToTensor wrapper class for this purpose below:

In [9]:

class ToTensor(torch.utils.data.Dataset):
    def __init__(self, data, dtype='float32'):
        self.data = data
        self.dtype = dtype
        
    def __len__(self):
        return len(self.data)
    
    def __repr__(self):
        return "ToTensor({})".format(str(self.data))
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        
        # enforce type compatibility
        data = self.data[idx].astype(self.dtype)
        
        return torch.from_numpy(data)

After applying these transformations we can feed input data to the model to generate predictions:

In [10]:

DNA_ = ToTensor(Transpose(ReduceDim(DNA, axis=2), axis=(0,2,1)), dtype='float32')
DNA_

Out[10]:

ToTensor(Transpose(ReduceDim(Bioseq("dna"))))

In [11]:

net(DNA_[:3])

Out[11]:

tensor([[0.4196],
        [0.4227],
        [0.4189]], grad_fn=<SigmoidBackward>)

Utilizing the pytorch DataLoader infrastructure¶

In the following we shall set up the training infrastructure using the DataLoader class from pytorch. DataLoader allows to generate mini-batches which are then supplied to the network.

For more details on DataLoaders see the official pytorch documentation.

First, we start by introducing an additional Dataset class, termed InOutTuple, which returns tuples of input-output datapoint pairs. In this example, our InOutTuple already generates batches of datapoints rather than individual samples and it takes care of shuffling. This may also be deferred to the DataLoader, but we don't do that for the purpose of this tutorial.

In [12]:

class InOutTuple(Dataset):
    def __init__(self, inputs, labels, batch_size=32, shuffle=True):
        self.inputs = inputs
        self.labels = labels
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.ridx = np.random.permutation(len(inputs))
        
    def __len__(self):
        return int(np.ceil(len(self.inputs) / self.batch_size))
    
    def __getitem__(self, idx):
        
        if torch.is_tensor(idx):
            idx = idx.tolist()
        
        ridx = self.ridx[self.batch_size*idx:(self.batch_size*(idx+1))].tolist()
        # enforce type compatibility
        return self.inputs[ridx], self.labels[ridx]

With this we create a dataset object representing input-output pairs and a dataloader that generates minibatches from IODATA.

In [13]:

IODATA = InOutTuple(DNA_, ToTensor(LABELS))
len(IODATA)

Out[13]:

In [14]:

dataloader = DataLoader(IODATA, batch_size=1,
                        shuffle=False, num_workers=1)

We are now ready to implement the training loop where we make use of the binary cross-entropy loss and the Adadelta optimizer. The model is then fitted for 100 epochs.

In [15]:

criterion = nn.BCELoss()
optimizer = optim.Adadelta(net.parameters())

final_loss = 0.
for epoch in tqdm(range(100)):
    for i, d in enumerate(dataloader, 0):
        inputs, labels = d
        # when using dataloaders a dummy dimension
        # is introduced which we don't need.
        # with view we eliminate it.
        inputs = inputs.view(inputs.shape[1:])
        labels = labels.view(labels.shape[1:])

        optimizer.zero_grad()
        
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        
        optimizer.step()
        final_loss = loss.item()

print('#' * 40)
print('loss: {}'.format(final_loss))
print('#' * 40)

100%|██████████| 100/100 [06:14<00:00,  3.74s/it]

########################################
loss: 2.4993820261443034e-05
########################################

Since, we ran the model on cpus it is relatively time-consuming. pytorch can however, also be configured to utilize gpus if available.

The predictions may be converted back to Cover object and subsequently exported as bigwig in order to inspect the plausibility of the results in the genome browser.

In [16]:

DNA_TEST_ = ToTensor(Transpose(ReduceDim(DNA_TEST, axis=2), axis=(0,2,1)), dtype='float32')

DNA_TEST_

Out[16]:

ToTensor(Transpose(ReduceDim(Bioseq("dna"))))

In [17]:

# convert the prediction to a cover object
pred = net(DNA_TEST_[:]).detach().numpy()

In [18]:

cov_pred = Cover.create_from_array('BindingProba', pred, LABELS_TEST.gindexer)

# predictions (or feature activities) can finally be exported to bigwig
cov_pred.export_to_bigwig(output_dir=os.environ['JANGGU_OUTPUT'])

In [19]:

print("AUC:", roc_auc_score(LABELS_TEST[:], pred))

AUC: 0.9993