In this notebook we illustrate how to use Janggu datasets with pytorch in order to predict Oct4 and Mafk binding sites. We will make use only of a few concepts available from pytorch. For more information on pytorch, please consult the official documentation.
To run this tutorial you need to install pytorch and tqdm beforehand.
import os
from tqdm import tqdm
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from pkg_resources import resource_filename
from janggu.data import Bioseq
from janggu.data import Cover
from janggu.data import ReduceDim
from janggu.data import Transpose
from janggu.layers import DnaConv2D
from sklearn.metrics import roc_auc_score
/home/wkopp/anaconda3/envs/jdev/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend.
from IPython.display import Image
np.random.seed(1234)
First, we need to specify the output directory in which the results are stored.
os.environ['JANGGU_OUTPUT'] = '/home/wkopp/janggu_examples'
Similar as in the keras tutorial notebook, we start off by loading the Janggu datasets. Specify the DNA sequence feature order. Order 1, 2 and 3 correspond to mono-, di- and tri-nucleotide based features (see Tutorial).
order = 3
binsize = 200
# load the dataset
# The pseudo genome represents just a concatenation of all sequences
# in sample.fa and sample2.fa. Therefore, the results should be almost
# identically to the models obtained from classify_fasta.py.
REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa')
# ROI contains regions spanning positive and negative examples
ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')
# PEAK_FILE only contains positive examples
PEAK_FILE = resource_filename('janggu', 'resources/scores.bed')
Load the datasets for training and testing
# Training input and labels are purely defined genomic coordinates
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
roi=ROI_TRAIN_FILE,
binsize=binsize,
order=order,
cache=True)
# The ReduceDim wrapper transforms the dataset from a 4D object to a 2D table-like representation
LABELS = ReduceDim(Cover.create_from_bed('peaks', roi=ROI_TRAIN_FILE,
bedfiles=PEAK_FILE,
binsize=binsize,
resolution=binsize,
dtype='int',
cache=True,
storage='sparse'))
DNA_TEST = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
roi=ROI_TEST_FILE,
binsize=binsize,
order=order)
LABELS_TEST = ReduceDim(Cover.create_from_bed('peaks',
bedfiles=PEAK_FILE,
roi=ROI_TEST_FILE,
binsize=binsize,
resolution=binsize,
dtype='int',
storage='sparse'))
print('training set:', DNA.shape, LABELS.shape)
print('test set:', DNA_TEST.shape, LABELS_TEST.shape)
training set: (7797, 198, 1, 64) (7797, 1) test set: (200, 198, 1, 64) (200, 1)
Next, we specify a neural network in pytorch that takes the DNA sequence as input and predicts the class labels that reflect Oct4 or Mafk binding.
class DNAConvNet(nn.Module):
def __init__(self, order=1):
super(DNAConvNet, self).__init__()
self.conv1 = nn.Conv1d(pow(4, order), 30, 21)
self.dense = nn.Linear(30, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = F.relu(self.conv1(x))
# emulates global_max_pooling
x = F.max_pool1d(x, x.size()[-1]).flatten(1, -1)
x = self.dense(x)
x = self.sigmoid(x)
return x
net = DNAConvNet(order)
print(net)
DNAConvNet( (conv1): Conv1d(64, 30, kernel_size=(21,), stride=(1,)) (dense): Linear(in_features=30, out_features=1, bias=True) (sigmoid): Sigmoid() )
Great! Now we have our model.
Before we can use the Janggu dataset objects with the pytorch model, we need to apply some transformation. First, we need to transform the 4D shape to a 3D object, since we use a Conv1D network layer at the input of the model. To this end we use the ReduceDim wrapper and remove axis=2 which represents merely a dummy dimension for the DNA sequence. Second, we need to change the order of the dimensions such that the channel is at the first dimension (rather than the last as is required for tensorflow). This is achieved by using the Transpose wrapper using the new axis ordering (0, 2, 1).
Finally, the numpy arrays produced by the janggu datasets need to converted to pytorch tensors. We introduce the ToTensor wrapper class for this purpose below:
class ToTensor(torch.utils.data.Dataset):
def __init__(self, data, dtype='float32'):
self.data = data
self.dtype = dtype
def __len__(self):
return len(self.data)
def __repr__(self):
return "ToTensor({})".format(str(self.data))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
# enforce type compatibility
data = self.data[idx].astype(self.dtype)
return torch.from_numpy(data)
After applying these transformations we can feed input data to the model to generate predictions:
DNA_ = ToTensor(Transpose(ReduceDim(DNA, axis=2), axis=(0,2,1)), dtype='float32')
DNA_
ToTensor(Transpose(ReduceDim(Bioseq("dna"))))
net(DNA_[:3])
tensor([[0.4196], [0.4227], [0.4189]], grad_fn=<SigmoidBackward>)
In the following we shall set up the training infrastructure using the DataLoader class from pytorch. DataLoader allows to generate mini-batches which are then supplied to the network.
For more details on DataLoaders see the official pytorch documentation.
First, we start by introducing an additional Dataset class, termed InOutTuple, which returns tuples of input-output datapoint pairs. In this example, our InOutTuple already generates batches of datapoints rather than individual samples and it takes care of shuffling. This may also be deferred to the DataLoader, but we don't do that for the purpose of this tutorial.
class InOutTuple(Dataset):
def __init__(self, inputs, labels, batch_size=32, shuffle=True):
self.inputs = inputs
self.labels = labels
self.batch_size = batch_size
self.shuffle = shuffle
self.ridx = np.random.permutation(len(inputs))
def __len__(self):
return int(np.ceil(len(self.inputs) / self.batch_size))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
ridx = self.ridx[self.batch_size*idx:(self.batch_size*(idx+1))].tolist()
# enforce type compatibility
return self.inputs[ridx], self.labels[ridx]
With this we create a dataset object representing input-output pairs and a dataloader that generates minibatches from IODATA.
IODATA = InOutTuple(DNA_, ToTensor(LABELS))
len(IODATA)
244
dataloader = DataLoader(IODATA, batch_size=1,
shuffle=False, num_workers=1)
We are now ready to implement the training loop where we make use of the binary cross-entropy loss and the Adadelta optimizer. The model is then fitted for 100 epochs.
criterion = nn.BCELoss()
optimizer = optim.Adadelta(net.parameters())
final_loss = 0.
for epoch in tqdm(range(100)):
for i, d in enumerate(dataloader, 0):
inputs, labels = d
# when using dataloaders a dummy dimension
# is introduced which we don't need.
# with view we eliminate it.
inputs = inputs.view(inputs.shape[1:])
labels = labels.view(labels.shape[1:])
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
final_loss = loss.item()
print('#' * 40)
print('loss: {}'.format(final_loss))
print('#' * 40)
100%|██████████| 100/100 [06:14<00:00, 3.74s/it]
######################################## loss: 2.4993820261443034e-05 ########################################
Since, we ran the model on cpus it is relatively time-consuming. pytorch can however, also be configured to utilize gpus if available.
The predictions may be converted back to Cover object and subsequently exported as bigwig in order to inspect the plausibility of the results in the genome browser.
DNA_TEST_ = ToTensor(Transpose(ReduceDim(DNA_TEST, axis=2), axis=(0,2,1)), dtype='float32')
DNA_TEST_
ToTensor(Transpose(ReduceDim(Bioseq("dna"))))
# convert the prediction to a cover object
pred = net(DNA_TEST_[:]).detach().numpy()
cov_pred = Cover.create_from_array('BindingProba', pred, LABELS_TEST.gindexer)
# predictions (or feature activities) can finally be exported to bigwig
cov_pred.export_to_bigwig(output_dir=os.environ['JANGGU_OUTPUT'])
print("AUC:", roc_auc_score(LABELS_TEST[:], pred))
AUC: 0.9993