Functional prediction of hypothetical TFs in bacteria using supervised machine learning: Pseudomonas aeruginosa

Created by Emanuel Flores-Bautista 2018. All code contained in this notebook is licensed under the Creative Commons License 4.0.

This is the workflow we will use:

  1. Load in the TF-TF network from RegulonDB as a DataFrame, and NetworkX.
  2. Apply clustering louvain algorithm to find the modules (caveat: each run the cluster labels change, but the clusters have the same content, i.e. we have to make our predictions in one single run). Add a new attribute in the network corresponding to cluster labels. Validate their functional relationship (e.g. SoxR and SoxS, GadEWX fall into the same modules, respectively)
  3. Data preprocessing: check for missing data, normalize, and reduce dimensionality using PCA.
  4. Multi-label classification: Assign the cluster labels to the expression dataframe.
  5. Train a multi-label neural network with the known TF exp data assigning the cluster labels (try first with a SVM). Based on this ref: https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f
  6. Validate the performance. We can try 4 different algorithms and then predict with the best one.
  7. Predict the labels for the TFs with unknown physiological function.
  8. Analyze results.
In [1]:
pwd
Out[1]:
'/Users/manoloflores/jupyter-notebooks'
In [2]:
cd ../Desktop/uni/bioinfo/data/paeru/
/Users/manoloflores/Desktop/uni/bioinfo/data/paeru
In [5]:
##Setting the Pyplot Figures inside the notebook
%matplotlib inline

#Get svg graphics from the Notebook
%config InlineBackend.figure_format = 'svg'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import random 
import matplotlib as mpl

sns.set_style('white')
sns.set_context("talk", font_scale=1, rc={"lines.linewidth": 2.0, 'lines.markersize': 5})
sns.set_style("ticks")
sns.set_style({"xtick.direction": "in","ytick.direction": "in"})
#mpl.rc('axes', prop_cycle=(cycler('color', ['b','g','y','m','c']) ))
mpl.rc('text', usetex=False)

tw = 1.5
sns.set_style({"xtick.major.size": 6, "ytick.major.size": 6,
               "xtick.minor.size": 4, "ytick.minor.size": 4,
               'axes.labelsize': 28,
               'xtick.major.width': tw, 'xtick.minor.width': tw,
               'ytick.major.width': tw, 'ytick.minor.width': tw})

mpl.rc('xtick', labelsize=28) 
mpl.rc('ytick', labelsize=28)
mpl.rc('axes', linewidth=1.75)
plt.gcf().subplots_adjust(bottom=0.15)
sns.set_style({'axes.labelsize': 24})
<matplotlib.figure.Figure at 0x1074d8a58>
In [5]:
ls *.csv
paeru-trn.csv
In [6]:
df_trn = pd.read_csv('paeru-trn.csv',  comment= '#')
In [7]:
df_trn.head()
Out[7]:
Regulator Operon Target gene mode of regulation Experimental Evidence PubMed Reference P. aeruginosa Strain
0 agmR NaN pqqH + lac-promoter, B-galactosidase assay and RT-PCR 19902179 ATCC
1 algR NaN hcnA - mutant and PCR 19270096 PAO1
2 anr cox coxA - Microarray and Gene Chip data analysis 19930444 PAO1
3 anr cox coxB - Microarray and Gene Chip data analysis 19930444 PAO1
4 anr cox colII - Microarray and Gene Chip data analysis 19930444 PAO1
In [8]:
#Pandas DataFrame to a NetworkX graph object

trn = nx.from_pandas_edgelist(df= df_trn, source= 'Regulator', target='Target gene',
                             edge_attr='mode of regulation')
In [9]:
#Calculating eigenvector centrality to get the hubs

eigen_cen= nx.eigenvector_centrality(trn)
hubs= sorted(eigen_cen.items(), key= lambda cc: cc[1], reverse= True)[:10]
hubs
Out[9]:
[('lasR', 0.4554885034355758),
 ('rpoN', 0.2601974253722576),
 ('rhlR', 0.2583054773711612),
 ('algU', 0.18597879411812526),
 ('algR', 0.18398061151100079),
 ('ihf', 0.17928776617070166),
 ('mexT', 0.14558365805789786),
 ('algZ', 0.14445022198989502),
 ('rhlI', 0.13157834338353888),
 ('mvfR', 0.12436859521563319)]
In [10]:
import community
In [11]:
#Running the clustering louvain algorithm to divide our network into clusters

communities_trn = community.best_partition(trn)
In [12]:
#How many clusters do we get ? 

max(communities_trn.values())
Out[12]:
27
In [13]:
#Extracting the LCC

trn_lcc = max(nx.connected_component_subgraphs(trn), key=len)
In [14]:
len(trn_lcc)
Out[14]:
644
In [15]:
#Running the clustering louvain algorithm to divide our network into clusters

communities_trn = community.best_partition(trn_lcc)
In [16]:
#How many clusters do we get with the TRN's LCC? 

max(communities_trn.values())
Out[16]:
12
In [17]:
nx.set_node_attributes(trn_lcc, values= communities_trn, name='modularity')

Let's extract the network clusters.

In [18]:
cluster1 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 0]

cluster2 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 1]

cluster3 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 2]

cluster4 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 3]

cluster5 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 4]

cluster6 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 5]

cluster7 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 6]

cluster8 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 7]

cluster9 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 8]

cluster10 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 9]

cluster11 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 10]

cluster12 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 11]

cluster13 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 12]

cluster14 = [n for n in trn_lcc.nodes() if trn_lcc.node[n]['modularity'] == 13]
In [19]:
nx.draw(trn_lcc, node_color = 'lightyellow', node_label = True, alpha = 0.6, font_size= 8)
In [20]:
import csv
with open('list_tfs.txt') as inputfile:
    results = list(csv.reader(inputfile))
In [21]:
results[:5]
Out[21]:
[['PA0779'], ['PA3697'], ['PA5471'], ['agmR'], ['aguR']]
In [22]:
#Concatenating the lists with itertools
import itertools

tf_list = list(itertools.chain.from_iterable(results))
In [23]:
tf_list[:5]
Out[23]:
['PA0779', 'PA3697', 'PA5471', 'agmR', 'aguR']

Extracting the TF-TF net

In [24]:
tf_tf_list = []
In [25]:
for row in df_trn['Target gene']:
    if row in tf_list: 
        tf_tf_list.append(1)
    else:
        tf_tf_list.append(0)
        
In [26]:
df_trn['tf'] = tf_tf_list
In [27]:
df_trn.head()
Out[27]:
Regulator Operon Target gene mode of regulation Experimental Evidence PubMed Reference P. aeruginosa Strain tf
0 agmR NaN pqqH + lac-promoter, B-galactosidase assay and RT-PCR 19902179 ATCC 0
1 algR NaN hcnA - mutant and PCR 19270096 PAO1 0
2 anr cox coxA - Microarray and Gene Chip data analysis 19930444 PAO1 0
3 anr cox coxB - Microarray and Gene Chip data analysis 19930444 PAO1 0
4 anr cox colII - Microarray and Gene Chip data analysis 19930444 PAO1 0
In [28]:
tf_tf_net = df_trn[df_trn['tf'] == 1]
In [29]:
tf_tf_net.shape
Out[29]:
(93, 8)
In [30]:
tf_tf_net.head()
Out[30]:
Regulator Operon Target gene mode of regulation Experimental Evidence PubMed Reference P. aeruginosa Strain tf
20 bexR NaN bexR + Microarrays, chip, RT-PCR and lacZ-reporter 20041030 PAO1 1
27 fhpR NaN fhpR - lacZ-reporter and B-galactosidase assay 19767835 PAO1 1
40 lasR NaN tpbA + Transcriptome profiling and RT-PCR 19543378 PA14 1
90 mexT NaN exsA - RT-PCR, lacZ-reporter and B-galactosidase assay 19683048 PAO1 1
98 PA0779 NaN fhpR - lacZ-reporter and B-galactosidase assay 19767835 PAO1 1
In [31]:
#Pandas DataFrame to a NetworkX graph object

tf_trn = nx.from_pandas_edgelist(df= tf_tf_net, source= 'Regulator', target='Target gene',
                             edge_attr='mode of regulation')
In [32]:
#Calculating eigenvector centrality to get the hubs

eigen_cen= nx.eigenvector_centrality(tf_trn)
hubs= sorted(eigen_cen.items(), key= lambda cc: cc[1], reverse= True)[:10]
hubs
Out[32]:
[('rhlR', 0.5160165927773019),
 ('lasR', 0.5123180985978882),
 ('vfr', 0.34796505775838366),
 ('mvfR', 0.28813789014281227),
 ('algQ', 0.22507289775529424),
 ('ampR', 0.22507289775529424),
 ('tpbA', 0.15745834770732636),
 ('rpoN', 0.13009940360986172),
 ('mexR', 0.12683366531740034),
 ('rsaL', 0.11777357530685671)]
In [33]:
#Clustering the TF- TF network

communities_trn = community.best_partition(trn)
In [34]:
#How many clusters do we get ? 

max(communities_trn.values())
Out[34]:
27
In [35]:
#Extracting the LCC
tf_tf_lcc = max(nx.connected_component_subgraphs(tf_trn), key=len)
In [36]:
len(tf_tf_lcc)
Out[36]:
49
In [37]:
communities_trn = community.best_partition(tf_tf_lcc)
In [38]:
#How many clusters do we get ? 

print(max(communities_trn.values()) +1 )
8
In [39]:
nx.set_node_attributes(tf_tf_lcc, values= communities_trn, name='modularity')
In [40]:
cluster1 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 0]

cluster2 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 1]

cluster3 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 2]

cluster4 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 3]

cluster5 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 4]

cluster6 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 5]

cluster7 =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 6]

cluster8  =  [n for n in tf_tf_lcc.nodes() if tf_tf_lcc.node[n]['modularity'] == 7]
In [41]:
cluster1
Out[41]:
['psrA', 'mexR', 'mexT', 'rpoS', 'exsA', 'exsD', 'nalC']
In [42]:
cluster2
Out[42]:
['algR', 'algZ', 'algU', 'mucB']

Now that we have our clusters of TFs, let's proceed to preprocess the expression data.

Expression data pre-processing

In [43]:
df_x = pd.read_csv("colombos_paeru_exprdata_20151029.txt", delimiter= '\t', comment= '#')
df_x.head()
Out[43]:
locustag gene name Reference description ref_GSE4026_WT_OD0.5 ref_GSE4026_WT_OD0.5.1 ref_GSE4026_WT_OD0.5.2 ref_GSE4026_WT_OD0.5.3 ref_GSE4026_WT_OD0.5.4 ref_GSE4026_WT_OD0.5.5 ref_GSE4026_WT_OD0.5.6 ... ref_GSE58862_WT ref_GSE58862_WT.1 ref_GSE58862_WT.2 ref_GSE58862_WT.3 ref_GSE58862_WT.4 ref_GSE58862_WT.5 ref_GSE58862_WT.6 ref_GSE58862_WT.7 ref_GSE58862_WT.8 ref_GSE58862_WT.9
0 PA4372 PA4372 4374 0.39906 -0.27788 -0.55782 -0.298950 0.25207 0.30557 0.015565 ... -0.416470 -0.42538 -0.171690 -0.553160 -0.508280 -0.501250 -0.476490 -0.533060 -0.385570 -0.510890
1 PA4373 PA4373 4375 -0.17034 -0.47932 -0.37914 -1.141500 -0.65243 0.25024 0.211010 ... -0.470950 -0.35411 -0.186720 -0.451170 -0.510100 -0.441120 -0.355190 -0.346370 -0.291120 -0.413110
2 PA4374 PA4374 4376 -0.67948 -0.24802 0.15095 0.235100 0.79640 -0.80024 -0.279800 ... 0.062909 0.12707 0.046114 -0.000962 0.074838 0.040787 0.098608 -0.019972 0.000630 0.047984
3 PA4375 PA4375 4377 -0.87518 -0.25627 0.81047 0.892840 0.88056 -0.89583 -0.160360 ... 0.073901 0.19379 0.058365 0.110420 0.119340 0.055979 0.094709 0.076326 0.041963 0.192550
4 PA4376 pncB2 4378 -0.10441 -0.49771 -0.49071 -0.072914 -0.22118 0.38368 -0.838580 ... 0.040202 0.17626 0.035601 0.107870 0.020122 -0.003838 0.098458 0.064412 0.040828 0.187640

5 rows × 562 columns

In [44]:
df_x.shape
Out[44]:
(4429, 562)
In [45]:
def test_missing_data(df, fname):
    """Look for missing entries."""
    assert np.all(df.notnull()), fname + ' contains missing data'
In [46]:
fname = 'colombos_paeru_exprdata_20151029.txt'
In [47]:
#test_missing_data(df_x, fname)
In [48]:
##Replacing the NaNs with the median of each column (expression condition)

df_x = df_x.fillna(df_x.median())
In [49]:
test_missing_data(df_x, fname)

We're good to go.

Dividing the expression and annotation datasets

In [50]:
annot= df_x.iloc[:,:3]
exp = df_x.iloc[:,3:]
In [51]:
annot.head()
Out[51]:
locustag gene name Reference description
0 PA4372 PA4372 4374
1 PA4373 PA4373 4375
2 PA4374 PA4374 4376
3 PA4375 PA4375 4377
4 PA4376 pncB2 4378
In [52]:
from sklearn.preprocessing import StandardScaler as st
std_scaler = st() # initialize stdScaler object
df_ex = std_scaler.fit_transform(exp)  # transform data
df_ex= pd.DataFrame(df_ex)  #go from std scaler object to df
In [53]:
df_ex.describe()## check if mean = 0 , s.d. = 1 
Out[53]:
0 1 2 3 4 5 6 7 8 9 ... 549 550 551 552 553 554 555 556 557 558
count 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 ... 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03 4.429000e+03
mean 4.863023e-17 8.817362e-18 -2.367590e-17 8.979045e-17 -2.121932e-17 -1.170760e-16 1.698423e-16 -1.038782e-16 2.436525e-17 4.334106e-17 ... 2.306176e-17 7.800890e-17 -1.504028e-18 -3.158458e-17 1.885048e-17 -1.804833e-17 5.314231e-18 1.223276e-17 1.423813e-17 4.592298e-17
std 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 ... 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00 1.000113e+00
min -5.444778e+00 -5.402485e+00 -5.175880e+00 -5.738712e+00 -7.796913e+00 -4.816482e+00 -6.040191e+00 -6.599566e+00 -5.761099e+00 -5.464318e+00 ... -1.735807e+01 -1.683504e+01 -1.143493e+01 -6.919876e+00 -6.898993e+00 -8.566631e+00 -1.664414e+01 -6.552998e+00 -7.218859e+00 -1.653566e+01
25% -5.464152e-01 -5.354005e-01 -5.116261e-01 -4.707079e-01 -4.466736e-01 -5.495407e-01 -5.051940e-01 -5.257671e-01 -4.737837e-01 -4.758205e-01 ... -1.401863e-01 -2.081378e-01 -5.835356e-01 -4.455856e-01 -3.864561e-01 -2.756227e-01 -1.789979e-01 -4.265152e-01 -4.388975e-01 -2.143702e-01
50% -1.506575e-01 -1.576834e-01 -1.813372e-01 -1.658139e-01 -5.037828e-02 -1.536203e-01 -1.413588e-01 -1.568060e-01 -1.696845e-01 -9.344114e-02 ... 8.138446e-02 5.073202e-02 -9.720670e-03 -9.252155e-02 -1.003026e-01 -6.418285e-02 6.412598e-02 -7.475554e-02 -5.028284e-02 4.370861e-02
75% 3.657911e-01 3.390271e-01 2.900729e-01 2.170175e-01 3.526403e-01 3.772787e-01 3.187507e-01 3.482203e-01 2.236671e-01 3.457832e-01 ... 2.863057e-01 3.187308e-01 5.695573e-01 2.902662e-01 2.125584e-01 1.466850e-01 3.100478e-01 2.907780e-01 3.594328e-01 3.209232e-01
max 7.367306e+00 7.059914e+00 6.674187e+00 1.082831e+01 9.136374e+00 6.242306e+00 7.127717e+00 6.508462e+00 9.569231e+00 1.085524e+01 ... 1.137373e+01 1.302729e+01 6.264489e+00 1.023223e+01 1.065201e+01 1.510992e+01 1.214637e+01 8.938069e+00 8.472909e+00 1.188193e+01

8 rows × 559 columns

Let's now apply a PCA to our normalized data.

In [54]:
from sklearn.decomposition import PCA as PCA
# apply dimensionality reduction
pca = PCA(svd_solver='randomized', random_state = 42).fit(df_ex)
In [55]:
cum_exp_var = np.cumsum(pca.explained_variance_ratio_)

# look at it
plt.plot(cum_exp_var, color = 'salmon', alpha = 0.7)
plt.xlabel('Number of dimensions')
plt.ylabel('Cumulative explained variance ratio')
#plt.axvline(x= 14 , color='k', linestyle='--')
plt.title('PCA Explained Variance ', fontsize= 20)
plt.xlim(0,500)
#plt.savefig('PCA-var.tiff',dpi=500, bbox_inches='tight')
Out[55]:
(0, 500)
In [56]:
vr = np.cumsum(pca.explained_variance_ratio_) 
In [57]:
vr[300]
Out[57]:
0.98384021483137485
In [58]:
n_components = 300
In [59]:
##Now let's tranform our data to a PCA object using the original index, so we can track of the genes

df_pca = pca.transform(df_ex)

df_pca = pd.DataFrame(df_pca,
                         index= df_x.index)
In [60]:
df_exp = df_pca.iloc[:,:n_components] ## We'll keep 420 principal components
In [61]:
##changing the col. names to PC
df_exp.columns =  ['PC' +  str(x) for x in range(1, n_components+1)] 
In [62]:
df_exp.head()
Out[62]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC291 PC292 PC293 PC294 PC295 PC296 PC297 PC298 PC299 PC300
0 -0.044517 -4.141681 -0.859863 7.736152 9.501700 2.885544 -4.538022 -4.877982 -2.525804 -3.774753 ... -0.742366 0.133380 0.237840 -0.187477 -0.268732 -0.507753 -0.335736 0.187547 -0.260069 0.159282
1 -3.900209 1.083549 0.149836 -0.598552 4.373991 4.204264 -5.794299 -6.983968 -2.089755 -0.924027 ... -0.124829 -0.318707 -0.099551 -0.316073 -0.470012 -0.152636 -0.154408 0.184063 -0.079113 0.144670
2 -0.401913 -3.623344 1.279273 2.080450 -0.382767 1.692863 -2.063468 0.249417 1.849785 0.101401 ... -0.226769 0.208678 -0.397089 0.089704 0.013919 -0.145869 -0.293011 0.088385 -0.132864 -0.089518
3 -2.783305 -0.530289 0.584059 0.663760 -0.369143 0.863311 -4.337111 3.365441 1.862070 -0.064619 ... 0.055104 -0.193268 0.201649 0.227109 -0.345883 0.419205 0.144620 0.209731 0.012862 -0.226739
4 -0.679018 -3.071262 1.836192 1.807062 1.569900 -1.816808 -3.150245 1.240399 -0.033679 -0.717069 ... -0.083740 -0.038305 -0.291922 0.222395 0.127059 -0.030585 0.096032 -0.235622 -0.268026 -0.252741

5 rows × 300 columns

Multi-class classification data preparation: Assign the cluster labels to the expression dataframe.

In [63]:
df_trn.tail()
Out[63]:
Regulator Operon Target gene mode of regulation Experimental Evidence PubMed Reference P. aeruginosa Strain tf
1015 vfr NaN vfr + Prodoric 18974177 PAO1 1
1016 vqsM NaN pprB + Prodoric 18974177 PAO1 1
1017 vqsM NaN rpoS + Prodoric 18974177 PAO1 0
1018 vqsM NaN vqsR + Prodoric 18974177 PAO1 1
1019 vqsR NaN pprB - Prodoric 18974177 PAO1 1
In [64]:
del(df_trn['tf'])
In [65]:
df_trn.columns = ['tf', 'operon', 'tg', 'regtype', 'ev', 'PMID', 'strain']

Cluster 1:

In [66]:
cluster1
Out[66]:
['psrA', 'mexR', 'mexT', 'rpoS', 'exsA', 'exsD', 'nalC']
In [67]:
#Now let's filter the regulons of each TF from the TRN

cluster_1 = df_trn[(df_trn['tf'] =='gbuR') | (df_trn['tf'] =='pfeR')\
                 | (df_trn['tf'] =='pchR') | (df_trn['tf'] =='fur')\
                 | (df_trn['tf'] =='agmR')| (df_trn['tf'] =='glpR')
                  ]
In [68]:
cluster_1_tgs = []

#Making a list that corresponds to the first cluster's target genes (TGs)

for row in cluster_1['tg']:
    cluster_1_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it.
    
cluster1_tgs = list(set(cluster_1_tgs))
print(len(cluster1_tgs))
88

Cluster 2:

In [69]:
cluster2
Out[69]:
['algR', 'algZ', 'algU', 'mucB']
In [70]:
#Now let's filter the regulons of each TF from the TRN

cluster_2 = df_trn[(df_trn['tf'] =='gacA') | (df_trn['tf'] =='qscR')\
                 | (df_trn['tf'] =='ampR') | (df_trn['tf'] =='algQ')\
                 | (df_trn['tf'] =='algR4')| (df_trn['tf'] =='mvfR')\
                 | (df_trn['tf'] =='rhlR') | (df_trn['tf'] =='rsaL')\
                 | (df_trn['tf'] =='ihf')| (df_trn['tf'] =='algR3')\
                 | (df_trn['tf'] =='lasR')
                  ]
In [71]:
cluster_2_tgs = []

#Making a list that corresponds to the second cluster's target genes (TGs)

for row in cluster_2['tg']:
    cluster_2_tgs.append(row)
    
    
#Make a set to avoid repetition, and then re-make a list out of it.
    
cluster2_tgs = list(set(cluster_2_tgs))

#Let's see how many TGs does cluster 2 have 

print(len(cluster2_tgs))
140
In [72]:
cluster3
Out[72]:
['ptxS', 'rpoD', 'vfr', 'pvdS', 'ptxR', 'toxR', 'mvtA', 'fleQ', 'hu']
In [73]:
#Now let's filter the regulons of each TF from the TRN

cluster_3 = df_trn[(df_trn['tf'] =='fleQ') | (df_trn['tf'] =='vfr')\
                 | (df_trn['tf'] =='pvdS') | (df_trn['tf'] =='hu')\
                 | (df_trn['tf'] =='ptxR')| (df_trn['tf'] =='mvtA')\
                 | (df_trn['tf'] =='ptxS') | (df_trn['tf'] =='toxR')\
                 | (df_trn['tf'] =='rpoD')
                  ]
In [74]:
cluster_3_tgs = []

#Making a list that corresponds to the cluster's target genes (TGs)

for row in cluster_3['tg']:
    cluster_3_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it.

    
cluster3_tgs = list(set(cluster_3_tgs))

print(len(cluster3_tgs))
93
In [75]:
cluster4
Out[75]:
['pilR', 'fhpR', 'PA0779', 'pilA', 'rpoN']
In [76]:
#Now let's filter the regulons of each TF from the TRN

cluster_4 = df_trn[(df_trn['tf'] =='exsA') | (df_trn['tf'] =='mexR')\
                 | (df_trn['tf'] =='mexT') | (df_trn['tf'] =='nalC')\
                 | (df_trn['tf'] =='psrA')| (df_trn['tf'] =='rpoS')\
                 | (df_trn['tf'] =='exsD') 
                  ]
In [77]:
cluster_4_tgs = []

#Making a list that corresponds to the cluster's target genes (TGs)...

for row in cluster_4['tg']:
    cluster_4_tgs.append(row)
    
#Make a set to avoid repetition, and then re-make a list out of it...

    
cluster4_tgs = list(set(cluster_4_tgs))

print(len(cluster4_tgs))
81
In [78]:
cluster5
Out[78]:
['algR3',
 'ihf',
 'rhlR',
 'ampR',
 'gacA',
 'qscR',
 'rsaL',
 'algQ',
 'mvfR',
 'algR4',
 'lasR']
In [79]:
cluster_5 = df_trn[(df_trn['tf'] =='dnr') | (df_trn['tf'] =='anr')\
                 | (df_trn['tf'] =='narL') ]
In [80]:
cluster_5_tgs = []

for row in cluster_5['tg']:
    cluster_5_tgs.append(row)
    
cluster5_tgs = list(set(cluster_5_tgs))

print(len(cluster5_tgs))
61
In [81]:
cluster6
Out[81]:
['pmrA', 'tpbA', 'phoQ', 'phoP']
In [82]:
#Now let's filter the regulons of each TF from the TRN

cluster_6 = df_trn[(df_trn['tf'] =='PA0779') | (df_trn['tf'] =='pilA')\
                 | (df_trn['tf'] =='fhpR') | (df_trn['tf'] =='rpoN')\
                 | (df_trn['tf'] =='pilR')]
In [83]:
cluster_6_tgs = []

for row in cluster_6['tg']:
    cluster_6_tgs.append(row)
    
cluster6_tgs = list(set(cluster_6_tgs))

print(len(cluster6_tgs))
75
In [84]:
cluster7
Out[84]:
['agmR', 'pchR', 'gbuR', 'glpR', 'fur', 'pfeR']
In [85]:
#Now let's filter the regulons of each TF from the TRN

cluster_7 = df_trn[(df_trn['tf'] =='phoQ') | (df_trn['tf'] =='pmrA')\
                 | (df_trn['tf'] =='phoP') | (df_trn['tf'] =='tbpA')]
In [86]:
cluster_7_tgs = []

for row in cluster_7['tg']:
    cluster_7_tgs.append(row)
    
cluster7_tgs = list(set(cluster_7_tgs))

print(len(cluster7_tgs))
55
In [162]:
cluster2
Out[162]:
['algR', 'algZ', 'algU', 'mucB']
In [164]:
cluster_8 = df_trn[(df_trn['tf'] =='algR') | (df_trn['tf'] =='algZ')\
                 | (df_trn['tf'] =='algU') | (df_trn['tf'] =='mucB')]
In [165]:
cluster_8_tgs = []

for row in cluster_8['tg']:
    cluster_8_tgs.append(row)
    
cluster8_tgs = list(set(cluster_8_tgs))

print(len(cluster8_tgs))
58
In [166]:
#Initializing the labels' lists

labels1 = []
labels2 = []
labels3 = []
labels4 = []
labels5 = []
labels6 = []
labels7 = []
labels8 = []
In [167]:
##Seting up the labels for each cluster

#C1
for row in df_x['gene name']:
    if row in cluster1_tgs:
        labels1.append(1)
    else:
        labels1.append(0)

        
#C2        
for row in df_x['gene name']:
    if row in cluster2_tgs:
        labels2.append(1)
    else:
        labels2.append(0)
        
#C3
for row in df_x['gene name']:
    if row in cluster3_tgs:
        labels3.append(1)
    else:
        labels3.append(0)

#C4 
for row in df_x['gene name']:
    if row in cluster4_tgs:
        labels4.append(1)
    else:
        labels4.append(0)
       
        
#C5
for row in df_x['gene name']:
    if row in cluster5_tgs:
        labels5.append(1)
    else:
        labels5.append(0)
        
#C6
for row in df_x['gene name']:
    if row in cluster6_tgs:
        labels6.append(1)
    else:
        labels6.append(0)
        
#C7
for row in df_x['gene name']:
    if row in cluster7_tgs:
        labels7.append(1)
    else:
        labels7.append(0)
        
#C8
for row in df_x['gene name']:
    if row in cluster8_tgs:
        labels8.append(1)
    else:
        labels8.append(0)
In [168]:
# Adding the cluster labels for each cluster as extra columns

df_exp['cluster 1'] = labels1
df_exp['cluster 2'] = labels2
df_exp['cluster 3'] = labels3
df_exp['cluster 4'] = labels4
df_exp['cluster 5'] = labels5
df_exp['cluster 6'] = labels6
df_exp['cluster 7'] = labels7
df_exp['cluster 8'] = labels8
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [169]:
df_exp.head()
Out[169]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC300 cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 TGs cluster 8
0 -0.044517 -4.141681 -0.859863 7.736152 9.501700 2.885544 -4.538022 -4.877982 -2.525804 -3.774753 ... 0.159282 0 0 0 0 0 0 0 0 0
1 -3.900209 1.083549 0.149836 -0.598552 4.373991 4.204264 -5.794299 -6.983968 -2.089755 -0.924027 ... 0.144670 0 0 0 0 0 0 0 0 0
2 -0.401913 -3.623344 1.279273 2.080450 -0.382767 1.692863 -2.063468 0.249417 1.849785 0.101401 ... -0.089518 0 0 0 0 0 0 0 0 0
3 -2.783305 -0.530289 0.584059 0.663760 -0.369143 0.863311 -4.337111 3.365441 1.862070 -0.064619 ... -0.226739 0 0 0 0 0 0 0 0 0
4 -0.679018 -3.071262 1.836192 1.807062 1.569900 -1.816808 -3.150245 1.240399 -0.033679 -0.717069 ... -0.252741 0 0 0 0 0 0 0 0 0

5 rows × 309 columns

In [170]:
#Let's make a list to filter out the TGs in each cluster
#This will allow us to use only the genes that are regulated by TFs
#(i.e. the TGs of the TRN) as training data


TGs_list = []

for row in df_x['gene name']:
    if row in cluster1_tgs:
        TGs_list.append(1)
        
    elif row in cluster2_tgs:
        TGs_list.append(1)
        
    elif row in cluster3_tgs:
        TGs_list.append(1)
        
    elif row in cluster4_tgs:
        TGs_list.append(1)
        
    elif row in cluster5_tgs:
        TGs_list.append(1)
        
    elif row in cluster6_tgs:
        TGs_list.append(1)
        
    elif row in cluster7_tgs:
        TGs_list.append(1)
        
    elif row in cluster8_tgs:
        TGs_list.append(1)
        
    else:
        TGs_list.append(0)        
In [171]:
##Adding the TG list as a column to the expression data

df_exp['TGs'] = TGs_list
/Users/manoloflores/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
In [172]:
df_exp.head()
Out[172]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC300 cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 TGs cluster 8
0 -0.044517 -4.141681 -0.859863 7.736152 9.501700 2.885544 -4.538022 -4.877982 -2.525804 -3.774753 ... 0.159282 0 0 0 0 0 0 0 0 0
1 -3.900209 1.083549 0.149836 -0.598552 4.373991 4.204264 -5.794299 -6.983968 -2.089755 -0.924027 ... 0.144670 0 0 0 0 0 0 0 0 0
2 -0.401913 -3.623344 1.279273 2.080450 -0.382767 1.692863 -2.063468 0.249417 1.849785 0.101401 ... -0.089518 0 0 0 0 0 0 0 0 0
3 -2.783305 -0.530289 0.584059 0.663760 -0.369143 0.863311 -4.337111 3.365441 1.862070 -0.064619 ... -0.226739 0 0 0 0 0 0 0 0 0
4 -0.679018 -3.071262 1.836192 1.807062 1.569900 -1.816808 -3.150245 1.240399 -0.033679 -0.717069 ... -0.252741 0 0 0 0 0 0 0 0 0

5 rows × 309 columns

In [173]:
##Let's filter out the genes that are regulated by TFs

regulons_df = df_exp[df_exp['TGs'] == 1]
In [174]:
del(regulons_df['TGs'])
In [175]:
#Let's filter out the genes that are not regulated by TFs

non_reg_df  = df_exp[df_exp['TGs'] == 0]
In [176]:
del(non_reg_df['TGs'])
In [ ]:
 
In [177]:
##Making a dataframe called noise, by randomly picking 
##genes that are NOT REGULATED by TFs without replacement

noise = non_reg_df.sample(n =20, replace = False, axis = 0, random_state = 42)
In [178]:
regulons_with_noise = pd.concat([regulons_df, noise]) ## unbiased train/test dataset 

regulons_with_noise.shape ##Let's look at the nrows and ncols
Out[178]:
(334, 308)
In [185]:
X_data = regulons_with_noise.iloc[:,:-8]
y_data = regulons_with_noise.iloc[:,-8:]
In [180]:
from sklearn.model_selection import train_test_split
In [186]:
#The test subset will correspond to 30% of the data at random

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=42) 
In [112]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from keras.metrics import categorical_accuracy
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report,accuracy_score\
,confusion_matrix
In [104]:
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

import holoviews as hv
hv.extension('bokeh')
Loading BokehJS ...