This project is based on the Tabular Playground Series - Feb 2022 from Kaggle, which is based on data from original research article hyper"Analysis of Identification Method for Bacterial Species and Antibiotic Resistance Genes Using Optical Data From DNA Oligomers".
The targets of this work are formulated at the beginning of the article above.
Bacterial antibiotic resistance is becoming a significant health threat, and rapid identification of antibiotic-resistant bacteria is essential to save lives and reduce the spread of antibiotic resistance. This paper analyzes the ability of machine learning algorithms (MLAs) to process data from a novel spectroscopic diagnostic device to identify antibiotic-resistant genes and bacterial species by comparison to available bacterial DNA sequences. Simulation results show that the algorithms attain from 92% accuracy (for genes) up to 99% accuracy (for species). This novel approach identifies genes and species by optically reading the percentage of A, C, G, T bases in 1000s of short 10-base DNA oligomers instead of relying on conventional DNA sequencing in which the sequence of bases in long oligomers provides genetic information. The identification algorithms are robust in the presence of simulated random genetic mutations and simulated random experimental errors. Thus, these algorithms can be used to identify bacterial species, to reveal antibiotic resistance genes, and to perform other genomic analyses. Some MLAs evaluated here are shown to be better than others at accurate gene identification and avoidance of false negative identification of antibiotic resistance.
Let's begin. Load required modules.
from IPython.display import display
from IPython.display import HTML
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split
import copy
import datetime
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns
import time
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
from tensorflow import keras
from tensorflow.keras import activations
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import initializers
from tensorflow.keras import Input
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
pd.options.display.max_columns = 300
pd.options.display.max_rows = 300
2022-05-07 19:53:08.160540: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
%%time
# Open datasets link for download bellow
# https://www.kaggle.com/competitions/tabular-playground-series-feb-2022/data
train_raw = pd.read_csv("/home/mvg/Documents/Ewps/ML_DL/DL/TPS_022022/data/train.csv")
test_raw = pd.read_csv("/home/mvg/Documents/Ewps/ML_DL/DL/TPS_022022/data/test.csv")
submission_raw = pd.read_csv("/home/mvg/Documents/Ewps/ML_DL/DL/TPS_022022/data/sample_submission.csv")
train_raw.set_index("row_id", inplace=True)
test_raw.set_index("row_id", inplace=True)
# Check nan values
print("The train has {} features with nan values."\
.format(list(train_raw.isnull().sum().values > 0).count(True)))
print("The test has {} features with nan values."\
.format(list(test_raw.isnull().sum().values > 0).count(True)))
print("The sample_submission has with {} features nan values."\
.format(list(submission_raw.isnull().sum().values > 0).count(True)))
# Exploraratory data analyst
# Check duplicates
train_duplicated_rows = train_raw[train_raw.duplicated()==True].shape[0]
print("\ntrain dataset contains {:,} rows from total {:,} rows.\
Share duplicates = {:.2f}%".format(train_duplicated_rows,
train_raw.shape[0],
100.0*train_duplicated_rows/train_raw.shape[0]))
test_duplicated_rows = test_raw[test_raw.duplicated()==True].shape[0]
print("\ntest dataset contains {:,} rows from total {:,} rows.\
Share duplicates = {:.2f}%".format(test_duplicated_rows,
test_raw.shape[0],
100.0*test_duplicated_rows/test_raw.shape[0]))
# Remove duplicates from train and test datasets
train = train_raw.drop_duplicates()
test = test_raw.drop_duplicates()
# Check y range and dispersion - the uniform distribution
train_target_dispersion = train.target.value_counts(normalize=True)*100
print(train_target_dispersion)
print("\nTotal classes for target :{:}.".format(len(train_target_dispersion.values)))
# Encode target one hot encoder and separate to X and Y
train_x_all = train.iloc[:,:-1].values
train_y_all = train.iloc[:,[-1]].values
test_x_all = test.values
ohe = OneHotEncoder()
ohe.fit(train_y_all)
train_y_all_ohe = ohe.transform(train_y_all).toarray()
The train has 0 features with nan values. The test has 0 features with nan values. The sample_submission has with 0 features nan values. train dataset contains 76,007 rows from total 200,000 rows.Share duplicates = 38.00% test dataset contains 26,779 rows from total 100,000 rows.Share duplicates = 26.78% Bacteroides_fragilis 10.098957 Campylobacter_jejuni 10.056213 Klebsiella_pneumoniae 10.016694 Streptococcus_pneumoniae 10.013469 Staphylococcus_aureus 10.012662 Streptococcus_pyogenes 10.005404 Salmonella_enterica 9.992500 Enterococcus_hirae 9.978789 Escherichia_coli 9.917495 Escherichia_fergusonii 9.907817 Name: target, dtype: float64 Total classes for target :10. CPU times: user 28.3 s, sys: 2.61 s, total: 30.9 s Wall time: 30.9 s
Define scaling function for further using.
%%time
# Scaler for array
def npscaler(x_values, scaler="ss"):
"""
Scale/transform np array.
Possible scale/transform option for x features:
1. None – not scale or trainsform
2. “ptbc” Power-transformer by Box-Cox
3. “ptbc” - .PowerTransformer by Yeo-Johnson’
4. “rb” - .RobustScaler(
5. "ss" - StandardScaler
For prevent data leakage using separate instance scaler/transformer
for each train and test parts.
Parameters
----------
x_values :np.array with numeric values of features.
scaler : TYPE - None or str, optional. The default is None.
Returns
-------
x_vals - scaled/transformed np.array
"""
scalers = ["ptbc", "ptyj", "rb", "ss"]
x_vals = np.copy(x_values)
mms = MinMaxScaler(feature_range=(1, 2))
ptbc = PowerTransformer(method='box-cox')
ptyj = PowerTransformer()
rb = RobustScaler(unit_variance=True)
ss = StandardScaler()
if scaler == "ptbc":
x_vals = ptbc.fit_transform(mms.fit_transform(x_vals[:,:]))
elif scaler == "ptyj":
x_vals = ptyj.fit_transform(x_vals[:,:])
elif scaler == "rb":
x_vals = rb.fit_transform(x_vals[:,:]), \
elif scaler == "ss":
x_vals = ss.fit_transform(x_vals[:,:])
if scaler not in scalers:
return "Value error for 'scaler'!Enter \
'ptbc' or", " 'ptyj' or 'rb' or 'ss' value for scaler!"
return x_vals
CPU times: user 4 µs, sys: 0 ns, total: 4 µs Wall time: 7.15 µs
%%time
# Find out put bias vector for y - divide pos==1 to neg==0
output_bias = []
for i in range (train_y_all_ohe.shape[1]):
neg, pos = np.bincount(train_y_all_ohe[:,0].astype(int))/train_y_all_ohe[:,0].shape[0]
output_bias.append(np.log(pos/neg))
output_bias = np.array(output_bias)
# Find input bias for Standard Scaler using this link
# https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full
train_features = list(train.columns)[:-1]
input_bias=[]
reg_a = (r"(A\d+)")
reg_t = (r"(T\d+)")
reg_g = (r"(G\d+)")
reg_c = (r"(C\d+)")
for name in train_features:
int_a = int(re.findall(reg_a, name)[0].replace("A",""))
int_t = int(re.findall(reg_t, name)[0].replace("T",""))
int_g = int(re.findall(reg_g, name)[0].replace("G",""))
int_c = int(re.findall(reg_c, name)[0].replace("C",""))
bias = ((1/4)**10) * math.factorial(10)/(math.factorial(int_a)
* math.factorial(int_t)
* math.factorial(int_g)
* math.factorial(int_c))
input_bias.append(bias)
# 1. Scale input bias with standard scaler and extract
train_x_all_with_bias = np.append(np.array(input_bias).reshape(1,-1),
train_x_all, axis=0)
train_x_all_with_bias_ss = npscaler(train_x_all_with_bias)
input_bias_ss = train_x_all_with_bias_ss[-1,:]
CPU times: user 660 ms, sys: 223 ms, total: 883 ms Wall time: 878 ms
%%time
# PCA analysis for train dataset
pca_train=PCA()
pca_train.fit(npscaler(train_x_all))
pca_train_cumsum = np.cumsum(pca_train.explained_variance_ratio_)
pca_train_comp_no = np.array(list(range(1,len(pca_train_cumsum)+1)))
# define number of components with 95% variance
pca_train_comp = np.argmax(pca_train_cumsum >= 0.95) + 1
pca_train_df = pd.DataFrame(data=pca_train_cumsum ,
columns=["pca_var_ratio"])
# Check PCA for test dataset
pca_test=PCA()
pca_test.fit(npscaler(test_x_all))
pca_test_cumsum = np.cumsum(pca_test.explained_variance_ratio_)
pca_test_comp_no = np.array(list(range(1,len(pca_test_cumsum)+1)))
# define number of components with 95$ variance
pca_test_comp = np.argmax(pca_test_cumsum >= 0.95) + 1
pca_test_df = pd.DataFrame(data=pca_test_cumsum ,
columns=["pca_var_ratio"])
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(pca_test_comp_no, pca_test_cumsum, label="test")
ax.plot(pca_train_comp_no, pca_train_cumsum, label="train")
ax.legend()
# Set major sticks
major_xticks = np.arange(0,len(pca_train_comp_no)+15, 50)
major_yticks = np.arange(0,1.0, 0.1)
# Set minor sticks
minor_xticks = np.arange(0,len(pca_train_comp_no)+15, 5)
minor_yticks = np.arange(0,1.0, 0.025)
# Define major and minor sticks
ax.tick_params(axis = 'both', which = 'major', labelsize = 12)
ax.tick_params(axis = 'both', which = 'minor', labelsize = 12)
ax.set_xticks(major_xticks)
ax.set_xticks(minor_xticks, minor = True)
ax.set_yticks(major_yticks);
ax.set_yticks(minor_yticks, minor = True);
ax.grid(visible=True, which="both", axis="both")
# ax labels
ax.set_ylabel('Cumulative sum', fontsize=12, fontweight="bold")
ax.set_xlabel("PCA futures number", fontsize=12, fontweight="bold")
ax.legend()
ax.set_title("Cumulative sum of PCA components for test and train dataset.\
\n Numbers PCA's cumponents for 95% cumsum for test = {:}, for train = {:}.".format(
pca_test_comp, pca_train_comp),
fontsize=13,
fontweight="bold")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
CPU times: user 1min 10s, sys: 9.57 s, total: 1min 20s Wall time: 12.9 s
As see above 95% of cumulative sum of variance ratio for train dataset equal 226 features, but for test dataset equal 231. Remain last value for further using and apply PCA for train and test dataset and calculate input bias for train dataset after PCA conversion.
%%time
# Remain 231 PCA components for PCA and transform it
train_pca=PCA(n_components=231, svd_solver='randomized')
train_pca_x_all=train_pca.fit_transform(npscaler(train_x_all))
# Calculate input_bias for pca, convert with SS for futrher data scaling
train_pca_x_all_with_bias = train_pca.fit_transform(train_x_all_with_bias_ss)
train_pca_x_all_with_bias_ss = npscaler(train_pca_x_all_with_bias)
input_bias_pca = train_pca_x_all_with_bias_ss[-1,:]
# Convert train with PCA
test_pca=PCA(n_components=231, svd_solver='randomized')
test_pca_x_all=test_pca.fit_transform(npscaler(test_x_all))
CPU times: user 3min 9s, sys: 45.3 s, total: 3min 54s Wall time: 36.5 s
%%time
mut_inf_clf = mutual_info_classif(train_x_all, train_y_all)
mut_inf_df = pd.DataFrame(data=list(train.columns)[:-1], columns=['features'])
mut_inf_df["mi_clfs"]=mut_inf_clf
nut_inf_zero_vals = mut_inf_df[mut_inf_df["mi_clfs"]==0].count().values[1]
# Plot histogram values
fig, ax = plt.subplots(figsize=(12,8))
ax=sns.histplot(data=mut_inf_df, x="mi_clfs", kde=True, ax=ax)
ax.set_title("Histogram of mutual information values. Numbers values with \
zero mutual information values = {}.".format(nut_inf_zero_vals))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel("Mutual information values", fontsize=12, fontweight="bold")
plt.show()
CPU times: user 4min 19s, sys: 469 ms, total: 4min 19s Wall time: 4min 19s
As see above dataset contains 286 features with continuous values and ten targets. PCA variance of features almost evenly distributed. Mutual information for a discrete target variable by features hasn't zero values and have range 0.1 - 1.4. Classical multilabel classification task for classical Multi Layer Perceptron (MLP).
I spent the huge of time on the selection of hyperparameters and the optimal network topology. I wrote my own functions for cross validation, tried use Keras Tuner - but the result was not very good in all cases - these programs worked slowly, did not use multithreading and etc. In the end, I remembered Ockham's razor
- plurality should not be posited without necessity
. Regarding the number of neurons, I used the thumb rule - no more than the number of futures, but the number of layers, I searched with the help of Ray Tune. This file contains the code for finding the optimal topology, learnrate, Dropout. Execute partially in the IDE with IPython like Spyder.
Prepare input and output bias for TF.
%%time
# Convert input and output biases to to tf constant
tf_output_bias = tf.keras.initializers.Constant(output_bias)
tf_input_bias = tf.keras.initializers.Constant(input_bias_ss)
tf_input_bias_pca = tf.keras.initializers.Constant(input_bias_pca)
CPU times: user 9 µs, sys: 1 µs, total: 10 µs Wall time: 13.4 µs
Define model and required functions
def npds_train_test_split(x, y, test_ratio=0.2, batch_sz=512, scaler="ss"):
"""
Convert, shaffle and scale numpy arrays to tf.data.Dataset.
Parameters
----------
x : input np.array.
y : input np.array.
test_ratio : float, optional, the default is 0.2.
Ratio for test part to all array rows. If None return all
tf.data.Dataset.
batch_sz : int, optional. The default is 1024.
batch size
scaler : string, optional. The default is "ss".
Returns
-------
tf.data.Dataset.
"""
if test_ratio != None and test_ratio < 1.0 and isinstance(test_ratio,float):
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=test_ratio, stratify=y,random_state=42)
x_train, x_test = npscaler(x_train, scaler), npscaler(x_test, scaler)
ds_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds_train = ds_train.shuffle(buffer_size=x_train.shape[0])\
.batch(batch_sz).prefetch(1)
ds_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))
ds_test = ds_test.shuffle(buffer_size=x_test.shape[0])\
.batch(batch_sz).prefetch(1)
return ds_train, ds_test
elif test_ratio == None:
x = npscaler(x)
ds_all = tf.data.Dataset.from_tensor_slices((x, y))\
.shuffle(buffer_size=x.shape[0]).batch(batch_sz)
return ds_all
def clf_adam(shape_x=train_x_all.shape[1], learn_rate=0.005623,
drop_out_in=0.35, drop_out_1=0.05, drop_out_2=0.0, drop_out_3=0.1,
drop_out_4=0.1, input_bias=tf_input_bias,
output_bias=tf_output_bias):
# Define metrics
metrics_short = [
tf.keras.metrics.CategoricalAccuracy(name="categorical_accuracy")]
# Create model
model = Sequential()
#input layer
model.add(layers.Dense(units=shape_x, input_shape=(shape_x,),
kernel_initializer="GlorotUniform",
bias_initializer=input_bias,
activation="relu"))
model.add(Dropout(rate=drop_out_in))
model.add(BatchNormalization())
# First hidden layer
model.add(layers.Dense(units=72, kernel_initializer="GlorotUniform",
activation="relu"))
model.add(Dropout(rate=drop_out_1))
model.add(BatchNormalization())
# Second hidden layer
model.add(layers.Dense(units=72, kernel_initializer="GlorotUniform",
activation="relu"))
model.add(Dropout(rate=drop_out_2))
model.add(BatchNormalization())
# Trird layer
model.add(layers.Dense(units=72, kernel_initializer="GlorotUniform",
activation="relu"))
model.add(Dropout(rate=drop_out_3))
model.add(BatchNormalization())
# Fourth layer
model.add(layers.Dense(units=72, kernel_initializer="GlorotUniform",
activation="relu"))
model.add(Dropout(rate=drop_out_4))
model.add(BatchNormalization())
# add final layer
model.add(layers.Dense(units=10, bias_initializer=output_bias))
model.add(layers.Activation(activations.softmax))
model.compile(
loss="categorical_crossentropy",
optimizer=Adam(learning_rate=learn_rate),
metrics=metrics_short)
return model
def train_model(model, x_train, y_train_ohe, test_sz=0.2, batch_sz=1024,
stop_no=30) :
callbacks = [EarlyStopping(monitor='categorical_accuracy',mode='max',
patience=stop_no,restore_best_weights=True)]
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train_ohe,
test_size=test_sz,
stratify=y_train_ohe,
random_state=42)
x_train, x_test = npscaler(x_train), npscaler(x_test)
ds_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds_train = ds_train.shuffle(buffer_size=x_train.shape[0]).batch(batch_sz)\
.prefetch(1)
ds_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))
ds_test = ds_test.shuffle(buffer_size=x_test.shape[0]).batch(batch_sz)\
.prefetch(1)
ds_start=time.time()
model_check = copy.copy(model)
ds_history = model_check.fit(ds_train,
epochs=10000,
validation_data=ds_test,
callbacks=callbacks,
verbose=0)
ds_end=time.time()
ds_total_time= datetime.timedelta(seconds = (ds_end-ds_start))
ds_history_df = pd.DataFrame(data=ds_history.history)
ds_history_df.sort_values(by='val_loss', ascending=True,
inplace=True)
ds_history_df["epochs"]=ds_history_df.index + 1
ds_history_df["time"]=ds_total_time
return model_check, ds_history_df
%%time
model_adam, model_adam_hist = train_model(clf_adam(), train_x_all,
train_y_all_ohe)
2022-05-07 19:58:50.520234: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-05-07 19:58:50.587829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 computeCapability: 6.1 coreClock: 1.5185GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 104.43GiB/s 2022-05-07 19:58:50.587876: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-05-07 19:58:50.595003: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-05-07 19:58:50.595078: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2022-05-07 19:58:50.596258: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-05-07 19:58:50.596585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-05-07 19:58:50.597180: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2022-05-07 19:58:50.598053: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2022-05-07 19:58:50.598227: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2022-05-07 19:58:50.599019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2022-05-07 19:58:50.599411: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-05-07 19:58:50.601107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 computeCapability: 6.1 coreClock: 1.5185GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 104.43GiB/s 2022-05-07 19:58:50.601733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2022-05-07 19:58:50.601796: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-05-07 19:58:51.113919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-05-07 19:58:51.113965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2022-05-07 19:58:51.113977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2022-05-07 19:58:51.115121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1328 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) 2022-05-07 19:58:55.816674: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2022-05-07 19:58:55.842799: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 4026710000 Hz 2022-05-07 19:58:56.545814: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-05-07 19:58:56.968001: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
CPU times: user 24min 3s, sys: 2min 27s, total: 26min 31s Wall time: 15min 3s
%%time
# Plot accuracy and loss for all 286 features
fig, ax = plt.subplots(figsize=(18,12))
ax = sns.lineplot(data=model_adam_hist.set_index("epochs"), ax = ax)
ax.set_title(" Accuracy and Loss for all 286 features.", fontsize=16,
fontweight="bold")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# Set major sticks
major_xticks = np.arange(0,(divmod(model_adam_hist.shape[0], 50)[0]+1)*50, 50)
major_yticks = np.arange(0,1.1, 0.1)
# Set minor sticks
minor_xticks = np.arange(0,(divmod(model_adam_hist.shape[0], 50)[0]+1)*50, 10)
minor_yticks = np.arange(0,1.1, 0.02)
# Define major and minor sticks
ax.tick_params(axis = 'both', which = 'major', labelsize = 18)
ax.tick_params(axis = 'both', which = 'minor', labelsize = 12)
ax.set_xticks(major_xticks)
ax.set_xticks(minor_xticks, minor = True)
ax.set_yticks(major_yticks);
ax.set_yticks(minor_yticks, minor = True);
ax.grid(visible=True, which="both", axis="both")
# ax labels
ax.set_ylabel('Accuracy/Loss', fontsize=12, fontweight="bold")
ax.set_xlabel("Epochs", fontsize=12, fontweight="bold")
legend = ax.legend(shadow=True, fontsize=15)
CPU times: user 387 ms, sys: 96 ms, total: 483 ms Wall time: 349 ms
Evaluate model
%%time
ds_xy = npds_train_test_split(train_x_all, train_y_all_ohe, batch_sz=32, test_ratio=None)
model_adam_acc = model_adam.evaluate(ds_xy)
3875/3875 [==============================] - 16s 4ms/step - loss: 0.0231 - categorical_accuracy: 0.9948 CPU times: user 25.5 s, sys: 3.48 s, total: 29 s Wall time: 17.1 s
print("Accuracy model: ",model_adam_acc[1])
print("Loss model: ",model_adam_acc[0])
Accuracy model: 0.9947658181190491 Loss model: 0.023062463849782944
%%time
model_adam_pca, model_adam_hist_pca = train_model(clf_adam(
shape_x=train_pca_x_all.shape[1], input_bias = tf_input_bias_pca),
train_pca_x_all,
train_y_all_ohe)
CPU times: user 20min 45s, sys: 1min 55s, total: 22min 41s Wall time: 12min 55s
%%time
# Plot accuracy and loss for 231 PCA features
fig, ax = plt.subplots(figsize=(18,12))
ax = sns.lineplot(data=model_adam_hist_pca.set_index("epochs"), ax = ax)
ax.set_title(" Accuracy and Loss for 231 PCA features.", fontsize=16,
fontweight="bold")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# Set major sticks
major_xticks = np.arange(0,(divmod(model_adam_hist_pca.shape[0], 50)[0]+1)*50, 50)
major_yticks = np.arange(0,1.1, 0.1)
# Set minor sticks
minor_xticks = np.arange(0,(divmod(model_adam_hist_pca.shape[0], 50)[0]+1)*50, 10)
minor_yticks = np.arange(0,1.1, 0.02)
# Define major and minor sticks
ax.tick_params(axis = 'both', which = 'major', labelsize = 18)
ax.tick_params(axis = 'both', which = 'minor', labelsize = 12)
ax.set_xticks(major_xticks)
ax.set_xticks(minor_xticks, minor = True)
ax.set_yticks(major_yticks);
ax.set_yticks(minor_yticks, minor = True);
ax.grid(visible=True, which="both", axis="both")
# ax labels
ax.set_ylabel('Accuracy/Loss', fontsize=12, fontweight="bold")
ax.set_xlabel("Epochs", fontsize=12, fontweight="bold")
legend = ax.legend(shadow=True, fontsize=15)
CPU times: user 357 ms, sys: 106 ms, total: 463 ms Wall time: 323 ms
Evaluate PCA model
%%time
ds_pca_xy = npds_train_test_split(train_pca_x_all, train_y_all_ohe, batch_sz=32,
test_ratio=None)
model_adam_pca_acc = model_adam_pca.evaluate(ds_pca_xy)
3875/3875 [==============================] - 16s 4ms/step - loss: 0.0257 - categorical_accuracy: 0.9942 CPU times: user 24.9 s, sys: 3.27 s, total: 28.2 s Wall time: 16.8 s
print("Accuracy PCA model: ",model_adam_pca_acc[1])
print("Loss PCA model: ",model_adam_pca_acc[0])
Accuracy PCA model: 0.9941851496696472 Loss PCA model: 0.02573591098189354
View final PCA model:
model_adam_pca.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_6 (Dense) (None, 231) 53592 _________________________________________________________________ dropout_5 (Dropout) (None, 231) 0 _________________________________________________________________ batch_normalization_5 (Batch (None, 231) 924 _________________________________________________________________ dense_7 (Dense) (None, 72) 16704 _________________________________________________________________ dropout_6 (Dropout) (None, 72) 0 _________________________________________________________________ batch_normalization_6 (Batch (None, 72) 288 _________________________________________________________________ dense_8 (Dense) (None, 72) 5256 _________________________________________________________________ dropout_7 (Dropout) (None, 72) 0 _________________________________________________________________ batch_normalization_7 (Batch (None, 72) 288 _________________________________________________________________ dense_9 (Dense) (None, 72) 5256 _________________________________________________________________ dropout_8 (Dropout) (None, 72) 0 _________________________________________________________________ batch_normalization_8 (Batch (None, 72) 288 _________________________________________________________________ dense_10 (Dense) (None, 72) 5256 _________________________________________________________________ dropout_9 (Dropout) (None, 72) 0 _________________________________________________________________ batch_normalization_9 (Batch (None, 72) 288 _________________________________________________________________ dense_11 (Dense) (None, 10) 730 _________________________________________________________________ activation_1 (Activation) (None, 10) 0 ================================================================= Total params: 88,870 Trainable params: 87,832 Non-trainable params: 1,038 _________________________________________________________________
For predict values I use model with PCA
%%time
y_predict = model_adam_pca.predict(npscaler(test_pca_x_all))
# convert continuous values to binary using argmax
for row in y_predict:
row = np.where(row < row[np.argmax(row)],0,1)
y_predict_char = ohe.inverse_transform(y_predict)
# Create pandas dataframe submission
test_drop_columns = list(test.columns)
sample_submission = test.drop(labels=test_drop_columns, axis=1)
sample_submission["target"] = y_predict_char
sample_submission.to_csv("/home/mvg/Documents/Ewps/ML_DL/DL/TPS_022022/data/sample_submission_target.csv")
sample_submission.head(10)
CPU times: user 10.8 s, sys: 856 ms, total: 11.7 s Wall time: 8.19 s
target | |
---|---|
row_id | |
200000 | Escherichia_fergusonii |
200001 | Salmonella_enterica |
200002 | Bacteroides_fragilis |
200003 | Salmonella_enterica |
200004 | Staphylococcus_aureus |
200005 | Escherichia_fergusonii |
200006 | Salmonella_enterica |
200007 | Enterococcus_hirae |
200008 | Streptococcus_pneumoniae |
200009 | Bacteroides_fragilis |
Created on May 07, 2022
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment: Debian 11, Python 3.8.13, TensorFlow 2.5.3, CUDA 11.3 for localhost.
See required installed and imported python modules in the cell No 1.