TDNN Approximation

This is my best attempt at approximating a TDNN for the purposes of working out the backbone angles for CDR-H3 loops. It performs better than random at least but overall, not too well.

This is the result of several versions (well 2 previous) and a lot of fiddling to get things to a reasonable state. I can't say for sure yet, whether or not the 1D convolution is doing the TDNN bit of things well but my guess is that it is close enough for this iteration.

Data wrangling

The data is initially pulled from the Dr Andrew Martin provided abdab dataset. This data is parsed as follows:

  • Take every model and look for the residues marked 95 to 105 inclusive on the heavy chain
  • Work out the torsion angles in radians. Save these in the bio way - i.e clockwise and anticlockwise -180 to +180
  • Convert this postgresql database into a set of numpy arrays
  • Change the phi and psi angles to be cos(phi), sin(phi), cos(psi), sin(phi) to smooth over discontinuities
  • Split the data randomly into 3 sets: 80% train, 10% test, 10% validation
  • Save this data as a pickle_file

The acids are represented as a bitfield, 21 units long, e.g

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

... would represent a Valine. This is a very sparse encoding and I'm sure we can totally improve on that. In Natural Language processing, things like word vectors, continuous bag of words and the like are used. We could investigate that.

This online example uses the pickle file as I haven't uploaded the database to this server yet. I figure it's not essential.

Neural Net Design

Tensorflow doesn't really do TDNNs but from what I can tell, a convolutional operation is the one we need. It means that the weights are shared as the kernel is convolved across the input. We use a 1 x window_size 1D kernel that slides over each CDR.

CDRs are provided in batches, though I've found that just providing 1 CDR at a time works best. Our input is 1 x max_cdr_length * 21 , 21 being the number of amino acids. The maximum CDR length is determined as 31 as this was the largest I found in my database.

The problem of course is that neural networks don't do so well if the input size changes. In addition Tensorflow doesn't seem to have a custom dropout function; it's just random! So what I do is create a mask for the two fully connected layers that contains a 1 if that neuron is activated, and a 0 if it is to be ingnored. If a CDR is less than 31 residues in length, we pad the end of it with -3.0, -3.0, -3.0, -3.0 for the angles (-3.0 can never occur) and a bitmask of all zeroes for the input.

I have two fully connected layers of neurons. There are 124 of them I believe with many, many weights coming in from the conv layer. The hiden layer could potentially be expanded to see if we can improve performance perhaps?

The cost function takes the mask into account as well. We are attempting to minimise the squared difference between the expected angles and the produced angles. We use the mask to calculate which angles and inputs to include, add all the differences together and divide by the total.

Our accuracy reading is determined using the validation set, which never gets trained against.

In [1]:
"""
nn02.py - Dealing with variable length input
author : Benjamin Blundell
email : [email protected]

Based on https://www.tensorflow.org/get_started/mnist/pros
and https://danijar.com/variable-sequence-lengths-in-tensorflow/

This version performs the best so far and is probably closest
to the TDNN we want to check

"""

import sys, os, math, random
import tensorflow as tf
import numpy as np

from common.util import *
from common import gen_data

Setting of flags for our network

Here is where we import the FLAGS variable (set in common) that holds the various settings for our Neural net run

In [2]:
FLAGS = NNGlobals()
# A higher learning rate seems good as we have few examples in this data set.
# Would that be correct do we think?
FLAGS.learning_rate = 0.35
FLAGS.window_size = 4
FLAGS.pickle_filename = 'pdb_martin_02.pickle'

Support Functions for various parts

Various functions and subroutines for checking such as:

  • weight variable setting
  • Bias variable setting
  • 1D convolution
  • Create a mask for our output layers
  • The custom cost function
In [3]:
def weight_variable(shape, name ):
  ''' TODO - we should check the weights here for our TDNN 
  For now I use truncated normals with stdddev of 0.1. Hopefully
  some of these go negative.'''
  initial = tf.truncated_normal(shape, stddev=0.1, name=name)
  return tf.Variable(initial)

def bias_variable(shape, name):
  initial = tf.constant(1.0, shape=shape, name=name)
  return tf.Variable(initial)

def conv1d(x, W):
  ''' Our convolution is what we use to replicate a TDNN though
  I suspect we need to do a lot more.'''
  return tf.nn.conv1d(x, W, stride=1, padding='SAME')

def create_mask(batch):
  ''' create a mask for our fully connected layer, which
  is a [1] shape that is max_cdr * 4 long.'''
  mask = []
  for model in batch:
    mm = []
    for cdr in model:
      tt = 1
      if not 1 in cdr:
        tt = 0
      for i in range(0,4):
        mm.append(tt)
    mask.append(mm)
  return np.array(mask,dtype=np.float32)
    
def cost(goutput, gtest):
  ''' Our error function which we will try to minimise'''
  # We find the absolute difference between the output angles and the training angles
  # Can't use cross entropy because thats all to do with probabilities and the like
  # Basic error of sum squares diverges to NaN due to gradient so I go with reduce mean
  # Values of -3.0 are the ones we ignore
  # This could go wrong as adding 3.0 to -3.0 is not numerically stable
  mask = tf.sign(tf.add(gtest,3.0))
  basic_error = tf.square(gtest-goutput) * mask
  
  # reduce mean doesnt work here as we just want the numbers where mask is 1
  # We work out the mean ourselves
  basic_error = tf.reduce_sum(basic_error)
  basic_error /= tf.reduce_sum(mask)
  return basic_error

Graph creation

The actual graph that we shall run on our machine and evaluate.

In [4]:
def create_graph() :
  ''' My attempt at creating a TDNN with the conv1d operation. We have one conv layer, and two fully
  connected layers (which are quite large). We take a batch x max_cdr x amino_acid layer and output
  a max_cdr * 4 layer for angle components. We use tanh activation functions throughout.'''
  # Borrowed parts from https://stackoverflow.com/questions/41583540/custom-dropout-in-tensorflow#41584818
  graph = tf.Graph()

  with tf.device('/gpu:0'):
    with graph.as_default():
      # Input data - we use a 2D array with each 'channel' being an amino acid bitfield
      # In this case, we only use one example at a time as each is a different length
      tf_train_dataset = tf.placeholder(tf.bool, 
          [None, FLAGS.max_cdr_length, FLAGS.num_acids],name="train_input") 
      output_size = FLAGS.max_cdr_length * 4
      dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
      x = tf.cast(tf_train_dataset, dtype=tf.float32)
      
      # According to the Tensorflow tutorial, the last two vars are input channels
      # and output channels (both 21)
      W_conv0 = weight_variable([FLAGS.window_size, 
        FLAGS.num_acids, FLAGS.num_acids] , "weight_conv_0")
      b_conv0 = bias_variable([FLAGS.num_acids], "bias_conv_0")
      
      # Using tanh as an activation fuction as it is bounded over -1 to 1
      # Don't have to use it here but we get better accuracy
      h_conv0 = tf.tanh(conv1d(x, W_conv0) + b_conv0)
  
      # The second layer is fully connected, neural net.
      dim_size = FLAGS.num_acids * FLAGS.max_cdr_length
      W_f = weight_variable([dim_size, output_size], "weight_hidden")
      b_f = bias_variable([output_size], "bias_hidden")
      

      # Apparently, the convolutional layer needs to be reshaped
      # This bit might be key as our depth, our 21 amino acid neurons are being connected here
      h_conv0_flat = tf.reshape(h_conv0, [-1, dim_size])
      h_f = tf.tanh( (tf.matmul(h_conv0_flat, W_f) + b_f)) * dmask
      
      # It looks like I can't take a size < max_cdr and use it, because we have 
      # fixed sized stuff so we need to dropout the weights we don't need per sample  
      # Find the actual sequence length and only include up to that length
      # We always use dropout even after training
      # Annoyingly tensorflow's dropout doesnt work for us here so I need to 
      # add another variable to our h_f layer, deactivating these neurons matched
      test = tf.placeholder(tf.float32, [None, output_size], name="train_test")

      # Output layer - we don't need this because the previous layer is fine but
      # we do get some accuracy increases with another layer/
      # the right number of variables for us but I'll add another one anyway so we have three.
      W_o = weight_variable([output_size, output_size], "weight_output")
      b_o = bias_variable([output_size],"bias_output")

      # I use tanh to bound the results between -1 and 1
      y_conv = tf.tanh( ( tf.matmul(h_f, W_o) + b_o) * dmask, name="output")
      variable_summaries(y_conv, "y_conv")

  return graph

Session Running

Here is where we run the session and spit out the values we want

In [5]:
def run_session(graph, datasets):
  ''' Run the session once we have a graph, training methodology and a dataset '''
  with tf.device('/cpu:0'):
    with tf.Session(graph=graph) as sess:
      training_input, training_output, validate_input, validate_output, test_input, test_output = datasets
      # Pull out the bits of the graph we need
      ginput = graph.get_tensor_by_name("train_input:0")
      gtest = graph.get_tensor_by_name("train_test:0")
      goutput = graph.get_tensor_by_name("output:0")
      gmask = graph.get_tensor_by_name("dmask:0")
      stepnum = 0
      # Working out the accuracy
      basic_error = cost(goutput, gtest) 
      # Setup all the logging for tensorboard 
      variable_summaries(basic_error, "Error")
      merged = tf.summary.merge_all() 
      train_writer = tf.summary.FileWriter('./summaries/train',graph)
      # So far, I have found Gradient Descent still wins out at the moment
      # https://stackoverflow.com/questions/36162180/gradient-descent-vs-adagrad-vs-momentum-in-tensorflow
      train_step = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(basic_error)
      #train_step = tf.train.AdagradOptimizer(FLAGS.learning_rate).minimize(basic_error) 
      #train_step = tf.train.AdamOptimizer(1e-4).minimize(basic_error)
      #train_step = tf.train.MomentumOptimizer(FLAGS.learning_rate, 0.1).minimize(basic_error)
      tf.global_variables_initializer().run()
      print('Initialized')	

      while stepnum < len(training_input):
        item_is, item_os = next_item(training_input, training_output, FLAGS)
        mask = create_mask(item_is)
        summary, _ = sess.run([merged, train_step],
            feed_dict={ginput: item_is, gtest: item_os, gmask: mask})
        
        # Find the accuracy at every step, but only print every 100
        mask = create_mask(validate_input)
        train_accuracy = basic_error.eval(
            feed_dict={ginput: validate_input, gtest: validate_output,  gmask : mask}) 
        
        if stepnum % 100 == 0:
          print('step %d, training accuracy %g' % (stepnum, train_accuracy))
        
        #dm = gmask.eval(feed_dict={ginput: item_is, gtest: item_os, gmask: mask}) 
        #print(dm)
        train_writer.add_summary(summary, stepnum)
        stepnum += 1

      # save our trained net
      saver = tf.train.Saver()
      saver.save(sess, 'saved/nn02')

Run Trained net

Here we run the network on a randomly chosen example from the validation set

In [6]:
def run_saved(datasets):
  ''' Load the saved version and then test it against the validation set '''
  with tf.Session() as sess:
    graph = sess.graph
    saver = tf.train.import_meta_graph('saved/nn02.meta')
    saver.restore(sess, 'saved/nn02')
    training_input, training_output, validate_input, validate_output, test_input, test_output = datasets
    goutput = graph.get_tensor_by_name("output:0")
    ginput = graph.get_tensor_by_name("train_input:0")
    gmask = graph.get_tensor_by_name("dmask:0")
    mask = create_mask(validate_input)
    res = sess.run([goutput], feed_dict={ginput: validate_input, gmask: mask })

    # Now lets output a random example and see how close it is, as well as working out the 
    # the difference in mean values. Don't adjust the weights though
    r = random.randint(0, len(validate_input)-1)

    print("Actual              Predicted")
    for i in range(0,len(validate_input[r])):
      sys.stdout.write(bitmask_to_acid(FLAGS, validate_input[r][i]))
      phi = math.degrees(math.atan2(validate_output[r][i*4], validate_output[r][i*4+1]))
      psi = math.degrees(math.atan2(validate_output[r][i*4+2], validate_output[r][i*4+3]))
      sys.stdout.write(": " + "{0:<8}".format("{0:.3f}".format(phi)) + " ")
      sys.stdout.write("{0:<8}".format("{0:.3f}".format(psi)) + " ")
      phi = math.degrees(math.atan2(res[0][r][i*4], res[0][r][i*4+1]))
      psi = math.degrees(math.atan2(res[0][r][i*4+2], res[0][r][i*4+3]))
      sys.stdout.write(" | " + "{0:<8}".format("{0:.3f}".format(phi)) + " ")
      sys.stdout.write("{0:<8}".format("{0:.3f}".format(psi)))  
      print("")

Run the whole thing

Fairly self explanatory, this last block. It takes roughly 1 minute on this server's CPU to run through the training so do give it a bit of time. It's much faster on a GPU of course, but I don't have all the money in the world to put one in my 1U rack at Telehouse! :D

In [7]:
datasets = init_data_sets(FLAGS)
graph = create_graph()
run_session(graph, datasets)
run_saved(datasets)
Loading from pickle file
(2298, 31, 21) <class 'numpy.ndarray'> bool
(2298, 124) <class 'numpy.ndarray'> float32
Initialized
step 0, training accuracy 0.905324
step 100, training accuracy 0.509628
step 200, training accuracy 0.466601
step 300, training accuracy 0.505744
step 400, training accuracy 0.480032
step 500, training accuracy 0.527179
step 600, training accuracy 0.553449
step 700, training accuracy 0.519494
step 800, training accuracy 0.487892
step 900, training accuracy 0.55577
step 1000, training accuracy 0.593868
step 1100, training accuracy 0.505347
step 1200, training accuracy 0.482474
step 1300, training accuracy 0.51733
step 1400, training accuracy 0.49749
step 1500, training accuracy 0.543896
step 1600, training accuracy 0.466765
step 1700, training accuracy 0.509391
step 1800, training accuracy 0.48828
step 1900, training accuracy 0.504779
step 2000, training accuracy 0.525433
step 2100, training accuracy 0.493252
step 2200, training accuracy 0.502474
INFO:tensorflow:Restoring parameters from saved/nn02
Actual              Predicted
ARG: -71.565  169.057   | 40.434   164.982 
GLY: -95.550  -92.982   | -85.825  135.035 
PHE: -58.268  167.635   | -79.689  -42.095 
HIS: -71.815  -35.774   | -92.435  6.817   
GLY: -106.098 40.262    | 165.926  -94.493 
SER: -155.027 114.982   | -97.752  107.138 
TYR: -68.589  5.219     | -95.587  49.283  
SER: -66.003  140.194   | -110.704 176.270 
PHE: -86.341  98.575    | -99.566  127.951 
ALA: -79.064  -29.384   | -99.415  -37.561 
TYR: -117.519 129.273   | -121.544 136.030 
TRP: -129.219 150.560   | -125.788 148.276 
GLY: -84.466  170.754   | -76.660  161.171 
GLN: -71.271  180.000   | -73.153  178.351 
***: -135.000 -135.000  | -180.000 -180.000
***: -135.000 -135.000  | -0.000   -180.000
***: -135.000 -135.000  | -0.000   180.000 
***: -135.000 -135.000  | -0.000   180.000 
***: -135.000 -135.000  | -0.000   180.000 
***: -135.000 -135.000  | -0.000   -180.000
***: -135.000 -135.000  | -0.000   0.000   
***: -135.000 -135.000  | -0.000   180.000 
***: -135.000 -135.000  | 0.000    180.000 
***: -135.000 -135.000  | 0.000    0.000   
***: -135.000 -135.000  | 0.000    180.000 
***: -135.000 -135.000  | 0.000    0.000   
***: -135.000 -135.000  | -0.000   0.000   
***: -135.000 -135.000  | -0.000   0.000   
***: -135.000 -135.000  | 0.000    0.000   
***: -135.000 -135.000  | 0.000    0.000   
***: -135.000 -135.000  | 0.000    0.000