Predicting Adverse Drug Reactions With Machine Learning

In [3]:

# Misc
from sklearn.model_selection import train_test_split
from pprint import pprint
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline

# Functions
from mlprocess import *
from params_by_label import *

# Fixing the seed
seed = 6
np.random.seed(seed)

""" Careful when importing saved dataframes with label names as indexes -> index = True """

Out[3]:

' Careful when importing saved dataframes with label names as indexes -> index = True '

1. Introduction

The objective of this work is to develop machine learning (ML) methods that can accurately predict adverse drug reactions (ADRs) using databases like SIDER and OFFSIDES.

2. Methods

2.1 Methods

One of the most important factors when using ML methods are the datasets used to train, validate and test the model. In this work, 3 different ones will be used at different stages, shown in table 1.

Dataset	Description
SIDER 4	1427 Approved drugs with ADRs text-mined from drug package inserts grouped into 27 system organ classes following MedDRA classification
OFFSIDES	Database of off-label side effects

When pre-processing OFFSIDES, ADRs were grouped by system organ classes following MedDRA classification and SMILES strings were obtained from PubChem using the REST API and the STITCH IDs of the compounds.

2.2 Features

Features are the set of attributes associated with the example that try to represent the dataset.

SMILES strings are commonly used to represent molecules, as is used in SIDER, and they will be the basis of this work. But, even though they are a unique representation of molecules, they are not enough to use as a feature in ML. Because of this, they will be used as a way to generate other features like fingerprints and molecular descriptors using tools like RDKit in Python. The general workflow for the datasets when in SIDER format is displayed in the following figure.

The SIDER dataset consists of a first column with the molecules' SMILES representation and twenty-seven other columns with the different SOCs. Three of these SOCs were not used since they had no real connection with the molecule and, as such, the development of ML models to predict these labels was not useful; these were 'Product Issues', 'Investigations', and 'Social circumstances'.

With the SMILES representation, it was possible to create multiple different features using RDKit, mainly fingerprints and other descriptors, for example, molecular weight, number of radical electrons, and number of valence electrons. We used these to add relevant information that complements the fingerprint.

In total, 27 descriptors were calculated for each molecule; not every descriptor was useful and, as such, some selection was required. But, since we had 24 different classification tasks, each with an independent model, and different descriptors had a different importance for each of them, this selection was done independently for each task, which resulted in 24 different DataFrames consisting in the fingerprint representation plus the 3 (after testing different values) descriptors selected for each task. An example is shown in following table:

This selection was done using the SelectKBest function from scikit-learn with ANOVA as the statistical test. When transforming OFFSIDES and after getting the SMILES from the STITCH IDs, the process is the same as described before.

2.3 Machine Learning Methods

Supervised learning is the most common ML scenario in chemoinformatics, and can be subdivided into classification and regression problems. In this type of learning, the training data has the outcome variable to guide the learning process. The objective of this type of learning is to predict the value of an outcome or to classify it. The tested models were:

2.3.1 Support Vector Machine (SVM)

One of the most popular ML methods. It maps the data into a high-dimensional space, using a non-linear kernel function, in order to optimally separate the classes. This separation is done by maximizing the margin between the closest points of the classes, support vectors, to the decision boundary, a hyperplane.

2.3.2 Random Forest (RF)

Tries to give a classification based on an ensemble of decision trees built based on the training data. It is an ensemble of tree predictors where each tree is independently constructed by using bootstrap samples of the training data and random feature selection.

After the RF is built, a prediction is made by a majority vote or averaging the predictions of all the trees.

2.3.3 Gradient Boosted Trees (GBT)

Similar to RF, as it is also an ensemble prediction method but the trees are not independent. This comes from the fact that, in GBT, at each iteration, the respective tree is constructed by fitting a simple function to current residuals.

The models tested and optimized were SVC (classification implementation of SVM) and Random Forest using scikit-learn, and Gradient Boosted Trees with XGBoost.

2.3.4 Model Development

As it is possible to see mext, the percentage of positives is very different from label to label. Because of this, the workflow for each model was base evaluation of the base model trained on the original dataset using cross-validation, followed by cross-validation with oversampling of the minority class, followed by hyperparameter optimization using random and grid search, followed by a final validation with the optimized parameters and oversampling. This process is shown next:

In [5]:

# Creating base df_molecules, df_y with the results vectors, and df_mols_descr with the descriptors
print("Creating Dataframes")
y_all, df_molecules = create_original_df(write_s=False)
df_molecules.drop("smiles", axis=1, inplace=True)
todrop = ["Product issues", "Investigations", "Social circumstances"]
y_all.drop(todrop, axis=1, inplace=True)  # No real connection with the molecule, multiple problems
out_names = y_all.columns.tolist()  # Get class labels

# Separating in a DF_mols_train and an Df_mols_test, in order to avoid data snooping and fitting the model to the test
df_mols_train, df_mols_test, y_train, y_test = train_test_split(df_molecules, y_all, test_size=0.2, random_state=seed)

Creating Dataframes

In [6]:

d = {"Positives": y_all.sum(axis=0), "Negatives": 1427 - y_all.sum(axis=0)}
countsm = pd.DataFrame(data=d)
countsm.plot(kind='bar', figsize=(16, 10), title="Adverse Drug Reactions Counts", ylim=(0, 1500), stacked=True)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x2be87601208>

ML model development and validation:

After replicating this process for the 3 models, the best one for each label was selected and tested with the test dataset.

2.3.5 Cross-Validation (cv)

This process was done using stratified k-fold so that each set contains approximately the same percentage of a sample of each target class as the complete set.

2.3.4 Class Balancing

As is seen in figure previous figure, one of the steps when developing the ML models was balancing the dataset. This can be necessary when the classification categories are not approximately equally represented.

Class imbalance can, usually, be dealt with by re-sample the dataset, either by over-sampling the minority class and/or under-sampling the majority class. In this work, over-sampling was used, specifically an extension of Synthetic Minority Over-sampling TEchnique (SMOTE) with the imbalanced-learn package, SMOTE-NC.

With SMOTE, the minority class is over-sampled by introducing synthetic examples along the line segments joining k minority class nearest neighbours. SMOTE-NC adapts this strategy by doing something specifically for the categorical features. When generating a new sample, it picks the most frequent category of the nearest neighbours present for these features.

There are some dangers when combining class balancing with cross-validation. In order to keep the validation process valid, balancing should not be done before separating train and validation for each fold. That is, we start the cross-validation process, dividing the training set in training and validation, and only then do we over-sample the minority class (when employing over-sampling). We do this for every iteration of the process. This process will minimize possible overfitting and a change in the test distribution that would result in misleading results.

2.4 Metrics

Performance measures for classification are typically based on the confusion matrix.

In this matrix, TN are the true negatives, TP are the true positives, FP are the false positives, and FN are the false negatives.

Using this format it is possible to calculate other metrics in order to evaluate the quality of a model's predictions. In order to have a better idea of how well a model works, different metrics should be used. In this work, it was used: Recall, Precision, Average Precision, Area Under the Receiver Operating Characteristic (AUROC), and different variations of F1 score.

Precision is the ability of the classifier not to label as positive a sample that is negative and is defined by:

$$Precision = \frac{\text{TP}}{\text{TP + FP} }$$

Recall is the ability of the classifier to label as positive a sample that is positive and is defined by:

$$Recall = \frac{\text{TP}}{\text{TP + FN} }$$

Average Precision summarizes a precision-recall curve as the weighted mean of precisions archived at each threshold $(P_{n})$. The increase in recall $(R_{n})$ from the previous threshold $(R_{n-1})$ is used as the weight. It is defined by:

$$AP = \sum_{n} \left ( R_{n} - R_{n-1} \right ) P_{n}$$

AUROC is the area under the receiving operating characteristic (ROC) curve. This curve is created by plotting the fraction of TP out of the actual positives against the fraction of FP out of the actual negatives, at different thresholds. \

F1 Score is a weighted average of the precision and recall and is defined by:

$$F1 = \frac{2 \cdot precision\cdot recall}{precision + recall}$$

In this work, three types of F1-score were used. F1 binary, also represented as F1, is the F1 Score with respect only to the positive label. F1 Macro Score is the unweighted mean between both positive and negative labels. F1 Micro Score uses global TP, FN and FP and is equivalent to the accuracy metric in a binary classification task.

During this work, Average Precision, Recall and the different F1 Scores will be the main metrics used to evaluate and develop the model since they deal with imbalanced datasets better than AUROC.

3. Results and Discussion

In this work 24 models were studied, one for each SOC. Before any testing, the dataset was split in train and test and all validation and optimization tasks were done using the first, in order to prevent any type of test overfitting.

Something to have in mind when evaluating these results is the imbalance of the test dataset which is a consequence of the same imbalance of the original dataset. This greatly affects the metrics, mainly the precision and all metrics that derive from it since having a big majority of positive tests will always result in high precision scores.

3.1 Feature Generation and Selection

The first step was to choose a fingerprint type and its length. The tested possibilities were ECFP-4, MACCS key, Atom Pairs and Topological Torsion. For each of these types, different lengths between 100 and 2048 were tested and the different metrics were calculated in order to pick the best combination. In order to simplify this process, the different combinations were tested using 10-fold cross-validation with SVC and only to the label 'Hepatobiliary disorders'. The results are displayed next:

In [2]:

# Fingerprint length
all_df_results_svc = test_fingerprint_size(df_mols_train, y_train, SVC(gamma="scale", random_state=seed), makeplots=True, write=False)
# Best result with ECFP-4 at 1125 - This will be used to all results

Creating Dataframes

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [08:46<00:00, 40.24s/it]

After analysing the results, the performance appeared to be similar across the board with a small advantage when using ECFP-4. In this fingerprint type a small peak of performance was seen at length 1125, so that was the chosen combination.

In [3]:

# Create X datasets with fingerprint length
X_all, _, _, _ = createfingerprints(df_molecules, length=1125)
X_train_fp, _, _, _ = createfingerprints(df_mols_train, length=1125)
X_test_fp, _, _, _ = createfingerprints(df_mols_test, length=1125)

# Selects and create descriptors dataset
df_desc = createdescriptors(df_molecules)  # Create all descriptors

# Splits in train and test
df_desc_base_train, df_desc_base_test = train_test_split(df_desc, test_size=0.2, random_state=seed)

# Creates a dictionary with key = class label and value = dataframe with fingerprint + best K descriptors for that label
X_train_dic, X_test_dic, selected_cols = create_dataframes_dic(df_desc_base_train, df_desc_base_test, X_train_fp,
                                                               X_test_fp, y_train, out_names, score_func=f_classif, k=3)
# Creates a y dictionary for all labels
y_train_dic = {name: y_train[name] for name in out_names}
modelnamesvc = {name: "SVC" for name in out_names}
modelnamerf = {name: "RF" for name in out_names}
modelnamexgb = {name: "XGB" for name in out_names}
modelnamevot = {name: "VotingClassifier" for name in out_names}

100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 114.51it/s]

In [12]:

print("Selected descriptors by label:")
pprint(selected_cols, width=-1)

Selected descriptors by label:
{'Blood and lymphatic system disorders': ['fracsp33',
                                          'aliphcarbocycles',
                                          'numsatcarbcycles'],
 'Cardiac disorders': ['arocarbocycles',
                       'arorings',
                       'numsatcarbcycles'],
 'Congenital, familial and genetic disorders': ['maxabspartcharge',
                                                'numrade',
                                                'arocarbocycles'],
 'Ear and labyrinth disorders': ['maxpartcharge',
                                 'aliphcarbocycles',
                                 'numsatcarbcycles'],
 'Endocrine disorders': ['numrade',
                         'aliphcarbocycles',
                         'numsatcarbcycles'],
 'Eye disorders': ['nhohcount',
                   'numhdonors',
                   'numhatoms'],
 'Gastrointestinal disorders': ['aliphcarbocycles',
                                'arorings',
                                'numsatcarbcycles'],
 'General disorders and administration site conditions': ['numrade',
                                                          'aliphcarbocycles',
                                                          'numsatcarbcycles'],
 'Hepatobiliary disorders': ['arohetcycles',
                             'arorings',
                             'numhdonors'],
 'Immune system disorders': ['aliphhetcycles',
                             'numhacceptors',
                             'numsathetcycles'],
 'Infections and infestations': ['aliphrings',
                                 'numsatrings',
                                 'ringcount'],
 'Injury, poisoning and procedural complications': ['arohetcycles',
                                                    'arorings',
                                                    'ringcount'],
 'Metabolism and nutrition disorders': ['arohetcycles',
                                        'arorings',
                                        'ringcount'],
 'Musculoskeletal and connective tissue disorders': ['arohetcycles',
                                                     'arorings',
                                                     'ringcount'],
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': ['aliphcarbocycles',
                                                                         'arohetcycles',
                                                                         'ringcount'],
 'Nervous system disorders': ['arocarbocycles',
                              'arorings',
                              'ringcount'],
 'Pregnancy, puerperium and perinatal conditions': ['fracsp33',
                                                    'aliphcarbocycles',
                                                    'aliphrings'],
 'Psychiatric disorders': ['nhohcount',
                           'arohetcycles',
                           'numsatcarbcycles'],
 'Renal and urinary disorders': ['aliphcarbocycles',
                                 'arorings',
                                 'numsatcarbcycles'],
 'Reproductive system and breast disorders': ['aliphcarbocycles',
                                              'aliphrings',
                                              'ringcount'],
 'Respiratory, thoracic and mediastinal disorders': ['fracsp33',
                                                     'aliphcarbocycles',
                                                     'numsatcarbcycles'],
 'Skin and subcutaneous tissue disorders': ['numrade',
                                            'aliphrings',
                                            'ringcount'],
 'Surgical and medical procedures': ['maxpartcharge',
                                     'numval',
                                     'nhohcount'],
 'Vascular disorders': ['aliphhetcycles',
                        'arorings',
                        'ringcount']}

3.2 Model Development

3.2.1 SVC

As was said previously the first step was to test if the oversampling of the dataset improved the model.

In [14]:

print("SVC")
print("Base SVC without balancing:")
base_svc_report = cv_multi_report(X_train_dic, y_train, out_names, SVC(gamma="auto", random_state=seed), n_splits=5,
                                  n_jobs=-2, verbose=False)

SVC
Base SVC without balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [01:13<00:00,  2.84s/it]

In [18]:

print("Scores for SVC without balancing:")
base_svc_report

Scores for SVC without balancing:

Out[18]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.712	0.643	0.622	0.676	0.851	0.612	0.660
Metabolism and nutrition disorders	0.818	0.692	0.409	0.565	1.000	0.692	0.732
Eye disorders	0.763	0.624	0.430	0.597	0.984	0.622	0.687
Musculoskeletal and connective tissue disorders	0.815	0.687	0.407	0.648	1.000	0.687	0.802
Gastrointestinal disorders	0.951	0.907	0.476	0.672	1.000	0.907	0.945
Immune system disorders	0.841	0.726	0.421	0.559	1.000	0.726	0.777
Reproductive system and breast disorders	0.609	0.608	0.608	0.652	0.611	0.608	0.625
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.000	0.745	0.427	0.679	0.000	0.000	0.470
General disorders and administration site conditions	0.953	0.910	0.476	0.603	1.000	0.910	0.938
Endocrine disorders	0.000	0.786	0.440	0.631	0.000	0.000	0.391
Surgical and medical procedures	0.000	0.855	0.461	0.527	0.000	0.000	0.187
Vascular disorders	0.869	0.769	0.435	0.577	1.000	0.769	0.817
Blood and lymphatic system disorders	0.766	0.633	0.458	0.661	0.981	0.628	0.742
Skin and subcutaneous tissue disorders	0.962	0.927	0.481	0.593	1.000	0.927	0.949
Congenital, familial and genetic disorders	0.000	0.824	0.452	0.585	0.000	0.000	0.245
Infections and infestations	0.830	0.710	0.415	0.655	1.000	0.710	0.815
Respiratory, thoracic and mediastinal disorders	0.856	0.748	0.428	0.558	1.000	0.748	0.793
Psychiatric disorders	0.823	0.699	0.412	0.618	1.000	0.699	0.773
Renal and urinary disorders	0.781	0.641	0.390	0.645	1.000	0.641	0.756
Pregnancy, puerperium and perinatal conditions	0.000	0.915	0.478	0.514	0.000	0.000	0.129
Ear and labyrinth disorders	0.000	0.552	0.356	0.620	0.000	0.000	0.577
Cardiac disorders	0.819	0.693	0.409	0.641	1.000	0.693	0.795
Nervous system disorders	0.958	0.919	0.479	0.633	1.000	0.919	0.954
Injury, poisoning and procedural complications	0.799	0.665	0.399	0.568	1.000	0.665	0.730

In [19]:

print("Base SVC with balancing:")
base_bal_svc_report = cv_multi_report(X_train_dic, y_train, out_names, SVC(gamma="auto", random_state=seed),
                                      balancing=True, n_splits=5, n_jobs=-2, verbose=False, random_state=seed)

Base SVC with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [21:26<00:00, 53.81s/it]

In [28]:

print("Scores for SVC with balancing:")
base_bal_svc_report

Scores for SVC with balancing:

Out[28]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.654	0.628	0.624	0.675	0.678	0.633	0.655
Metabolism and nutrition disorders	0.701	0.603	0.553	0.587	0.676	0.730	0.758
Eye disorders	0.702	0.607	0.560	0.608	0.754	0.658	0.699
Musculoskeletal and connective tissue disorders	0.742	0.646	0.588	0.633	0.741	0.744	0.781
Gastrointestinal disorders	0.905	0.833	0.602	0.720	0.880	0.932	0.956
Immune system disorders	0.776	0.664	0.553	0.594	0.800	0.754	0.791
Reproductive system and breast disorders	0.607	0.607	0.607	0.652	0.608	0.608	0.625
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.331	0.640	0.542	0.584	0.350	0.314	0.331
General disorders and administration site conditions	0.902	0.825	0.524	0.577	0.890	0.915	0.928
Endocrine disorders	0.262	0.698	0.536	0.557	0.250	0.279	0.277
Surgical and medical procedures	0.253	0.591	0.484	0.553	0.475	0.173	0.177
Vascular disorders	0.795	0.690	0.581	0.617	0.780	0.810	0.825
Blood and lymphatic system disorders	0.727	0.651	0.620	0.658	0.759	0.699	0.737
Skin and subcutaneous tissue disorders	0.922	0.859	0.591	0.670	0.900	0.945	0.953
Congenital, familial and genetic disorders	0.177	0.671	0.486	0.501	0.199	0.160	0.192
Infections and infestations	0.743	0.651	0.599	0.657	0.711	0.779	0.819
Respiratory, thoracic and mediastinal disorders	0.797	0.689	0.563	0.596	0.817	0.779	0.805
Psychiatric disorders	0.773	0.673	0.593	0.624	0.796	0.752	0.772
Renal and urinary disorders	0.731	0.654	0.622	0.646	0.731	0.733	0.741
Pregnancy, puerperium and perinatal conditions	0.131	0.797	0.508	0.509	0.185	0.103	0.114
Ear and labyrinth disorders	0.623	0.506	0.452	0.541	0.912	0.473	0.473
Cardiac disorders	0.755	0.660	0.599	0.635	0.756	0.755	0.772
Nervous system disorders	0.910	0.840	0.594	0.733	0.882	0.941	0.964
Injury, poisoning and procedural complications	0.658	0.576	0.549	0.597	0.614	0.710	0.743

In [21]:

diff_bal_svc = base_bal_svc_report - base_svc_report
print("Changes in scores after balancing:")
diff_bal_svc

Changes in scores after balancing:

Out[21]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	-0.058	-0.015	0.002	-0.001	-0.173	0.021	-0.005
Metabolism and nutrition disorders	-0.117	-0.089	0.144	0.022	-0.324	0.038	0.026
Eye disorders	-0.061	-0.017	0.130	0.011	-0.230	0.036	0.012
Musculoskeletal and connective tissue disorders	-0.073	-0.041	0.181	-0.015	-0.259	0.057	-0.021
Gastrointestinal disorders	-0.046	-0.074	0.126	0.048	-0.120	0.025	0.011
Immune system disorders	-0.065	-0.062	0.132	0.035	-0.200	0.028	0.014
Reproductive system and breast disorders	-0.002	-0.001	-0.001	0.000	-0.003	0.000	0.000
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.331	-0.105	0.115	-0.095	0.350	0.314	-0.139
General disorders and administration site conditions	-0.051	-0.085	0.048	-0.026	-0.110	0.005	-0.010
Endocrine disorders	0.262	-0.088	0.096	-0.074	0.250	0.279	-0.114
Surgical and medical procedures	0.253	-0.264	0.023	0.026	0.475	0.173	-0.010
Vascular disorders	-0.074	-0.079	0.146	0.040	-0.220	0.041	0.008
Blood and lymphatic system disorders	-0.039	0.018	0.162	-0.003	-0.222	0.071	-0.005
Skin and subcutaneous tissue disorders	-0.040	-0.068	0.110	0.077	-0.100	0.018	0.004
Congenital, familial and genetic disorders	0.177	-0.153	0.034	-0.084	0.199	0.160	-0.053
Infections and infestations	-0.087	-0.059	0.184	0.002	-0.289	0.069	0.004
Respiratory, thoracic and mediastinal disorders	-0.059	-0.059	0.135	0.038	-0.183	0.031	0.012
Psychiatric disorders	-0.050	-0.026	0.181	0.006	-0.204	0.053	-0.001
Renal and urinary disorders	-0.050	0.013	0.232	0.001	-0.269	0.092	-0.015
Pregnancy, puerperium and perinatal conditions	0.131	-0.118	0.030	-0.005	0.185	0.103	-0.015
Ear and labyrinth disorders	0.623	-0.046	0.096	-0.079	0.912	0.473	-0.104
Cardiac disorders	-0.064	-0.033	0.190	-0.006	-0.244	0.062	-0.023
Nervous system disorders	-0.048	-0.079	0.115	0.100	-0.118	0.022	0.010
Injury, poisoning and procedural complications	-0.141	-0.089	0.150	0.029	-0.386	0.045	0.013

As we can see by some of the results that are displayed in the previous tables, even though Average Precision didn't change by much, the biggest change of all the metrics was in the F1 Macro score. As this is the unweighted mean of F1 score to both negative and positive classes, we can conclude that an improvement of the F1 score in the minority class was the main consequence of the oversampling, as was to be expected.

With these positive results, oversampling was applied when developing every model.

The next step in the model development was hyperparameter optimization. In SVC, two parameters were optimized using grid search with cross-validation - C and gamma - while using a Radial Basis Function (RBF) kernel, where C is the cost of misclassifying training examples and gamma is a specific parameter of RBF kernel and can be seen as the inverse of the radius of influence of the samples selected as support vectors.

In [26]:

# Searching best parameters
params_to_test = {"svc__kernel": ["rbf"], "svc__C": [0.01, 0.1, 1, 10],
                  "svc__gamma": [0.001, 0.01, 0.1, 1]}
d_params_to_test = {name: params_to_test for name in out_names}

"""The following code was previously executed and its output was saved"""
#best_SVC_params_by_label = multi_label_grid_search(X_train_dic, y_train, out_names[15:],
#                                                   SVC(gamma="auto", random_state=seed), d_params_to_test,
#                                                   balancing=True, n_splits=5, scoring="f1_micro", n_jobs=-2,
#                                                   verbose=True, random_state=seed)
pprint(best_SVC_params_by_label, width=-1)

{'Blood and lymphatic system disorders': {'svc__C': 10,
                                          'svc__gamma': 0.01,
                                          'svc__kernel': 'rbf'},
 'Cardiac disorders': {'svc__C': 10,
                       'svc__gamma': 0.1,
                       'svc__kernel': 'rbf'},
 'Congenital, familial and genetic disorders': {'svc__C': 0.01,
                                                'svc__gamma': 1,
                                                'svc__kernel': 'rbf'},
 'Ear and labyrinth disorders': {'svc__C': 10,
                                 'svc__gamma': 0.01,
                                 'svc__kernel': 'rbf'},
 'Endocrine disorders': {'svc__C': 0.01,
                         'svc__gamma': 1,
                         'svc__kernel': 'rbf'},
 'Eye disorders': {'svc__C': 10,
                   'svc__gamma': 0.1,
                   'svc__kernel': 'rbf'},
 'Gastrointestinal disorders': {'svc__C': 0.01,
                                'svc__gamma': 1,
                                'svc__kernel': 'rbf'},
 'General disorders and administration site conditions': {'svc__C': 0.01,
                                                          'svc__gamma': 1,
                                                          'svc__kernel': 'rbf'},
 'Hepatobiliary disorders': {'svc__C': 1,
                             'svc__gamma': 0.01,
                             'svc__kernel': 'rbf'},
 'Immune system disorders': {'svc__C': 1,
                             'svc__gamma': 0.1,
                             'svc__kernel': 'rbf'},
 'Infections and infestations': {'svc__C': 0.01,
                                 'svc__gamma': 1,
                                 'svc__kernel': 'rbf'},
 'Injury, poisoning and procedural complications': {'svc__C': 1,
                                                    'svc__gamma': 0.1,
                                                    'svc__kernel': 'rbf'},
 'Metabolism and nutrition disorders': {'svc__C': 1,
                                        'svc__gamma': 0.1,
                                        'svc__kernel': 'rbf'},
 'Musculoskeletal and connective tissue disorders': {'svc__C': 10,
                                                     'svc__gamma': 0.1,
                                                     'svc__kernel': 'rbf'},
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': {'svc__C': 0.01,
                                                                         'svc__gamma': 1,
                                                                         'svc__kernel': 'rbf'},
 'Nervous system disorders': {'svc__C': 1,
                              'svc__gamma': 1,
                              'svc__kernel': 'rbf'},
 'Pregnancy, puerperium and perinatal conditions': {'svc__C': 0.01,
                                                    'svc__gamma': 1,
                                                    'svc__kernel': 'rbf'},
 'Psychiatric disorders': {'svc__C': 1,
                           'svc__gamma': 1,
                           'svc__kernel': 'rbf'},
 'Renal and urinary disorders': {'svc__C': 1,
                                 'svc__gamma': 0.01,
                                 'svc__kernel': 'rbf'},
 'Reproductive system and breast disorders': {'svc__C': 10,
                                              'svc__gamma': 0.01,
                                              'svc__kernel': 'rbf'},
 'Respiratory, thoracic and mediastinal disorders': {'svc__C': 10,
                                                     'svc__gamma': 1,
                                                     'svc__kernel': 'rbf'},
 'Skin and subcutaneous tissue disorders': {'svc__C': 0.01,
                                            'svc__gamma': 1,
                                            'svc__kernel': 'rbf'},
 'Surgical and medical procedures': {'svc__C': 0.01,
                                     'svc__gamma': 0.1,
                                     'svc__kernel': 'rbf'},
 'Vascular disorders': {'svc__C': 10,
                        'svc__gamma': 0.1,
                        'svc__kernel': 'rbf'}}

In [27]:

print("Improved SVC with balancing:")
impr_bal_svc_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamesvc,
                                      spec_params=best_SVC_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
                                      verbose=False, random_state=seed)

Improved SVC with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [26:49<00:00, 65.60s/it]

In [30]:

print("Scores for optimized SVC with balancing:")
impr_bal_svc_report

Scores for optimized SVC with balancing:

Out[30]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.709	0.682	0.678	0.730	0.751	0.674	0.701
Metabolism and nutrition disorders	0.810	0.693	0.507	0.597	0.944	0.709	0.757
Eye disorders	0.754	0.637	0.532	0.633	0.906	0.646	0.720
Musculoskeletal and connective tissue disorders	0.812	0.704	0.553	0.669	0.932	0.720	0.798
Gastrointestinal disorders	0.951	0.907	0.476	0.617	1.000	0.907	0.929
Immune system disorders	0.839	0.734	0.531	0.612	0.958	0.747	0.793
Reproductive system and breast disorders	0.677	0.681	0.680	0.725	0.672	0.685	0.722
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.000	0.745	0.427	0.648	0.000	0.000	0.345
General disorders and administration site conditions	0.953	0.910	0.476	0.589	1.000	0.910	0.931
Endocrine disorders	0.000	0.786	0.440	0.574	0.000	0.000	0.262
Surgical and medical procedures	0.000	0.855	0.461	0.594	0.000	0.000	0.207
Vascular disorders	0.868	0.776	0.557	0.665	0.961	0.792	0.852
Blood and lymphatic system disorders	0.729	0.673	0.658	0.718	0.719	0.740	0.789
Skin and subcutaneous tissue disorders	0.962	0.927	0.481	0.626	1.000	0.927	0.941
Congenital, familial and genetic disorders	0.000	0.824	0.452	0.563	0.000	0.000	0.205
Infections and infestations	0.830	0.710	0.415	0.625	1.000	0.710	0.779
Respiratory, thoracic and mediastinal disorders	0.856	0.752	0.478	0.594	0.986	0.757	0.798
Psychiatric disorders	0.824	0.705	0.457	0.594	0.986	0.707	0.753
Renal and urinary disorders	0.757	0.670	0.620	0.673	0.804	0.715	0.767
Pregnancy, puerperium and perinatal conditions	0.019	0.916	0.488	0.504	0.010	0.200	0.107
Ear and labyrinth disorders	0.558	0.586	0.585	0.614	0.583	0.535	0.547
Cardiac disorders	0.809	0.695	0.523	0.688	0.934	0.714	0.820
Nervous system disorders	0.958	0.920	0.499	0.595	1.000	0.920	0.922
Injury, poisoning and procedural complications	0.790	0.669	0.505	0.614	0.934	0.684	0.754

3.2.2 Random Forest

The workflow to develop the RF model was the same as the SVC, as shown previously.

In [6]:

print("Random Forest")
print("Base RF without balancing:")
base_rf_report = cv_multi_report(X_train_dic, y_train, out_names,
                                 RandomForestClassifier(n_estimators=100, random_state=seed), n_splits=5, n_jobs=-2,
                                 verbose=False)

Random Forest
Base RF without balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [00:18<00:00,  1.53it/s]

In [7]:

print("Scores for RF without balancing:")
base_rf_report

Scores for RF without balancing:

Out[7]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.723	0.700	0.698	0.756	0.754	0.694	0.737
Metabolism and nutrition disorders	0.802	0.687	0.527	0.582	0.915	0.714	0.744
Eye disorders	0.747	0.660	0.613	0.655	0.823	0.685	0.733
Musculoskeletal and connective tissue disorders	0.809	0.707	0.590	0.662	0.902	0.734	0.799
Gastrointestinal disorders	0.949	0.904	0.524	0.704	0.990	0.911	0.948
Immune system disorders	0.821	0.707	0.509	0.605	0.925	0.738	0.796
Reproductive system and breast disorders	0.681	0.679	0.679	0.738	0.685	0.678	0.740
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.327	0.777	0.597	0.717	0.213	0.739	0.498
General disorders and administration site conditions	0.951	0.906	0.501	0.643	0.993	0.912	0.942
Endocrine disorders	0.263	0.792	0.571	0.693	0.172	0.568	0.425
Surgical and medical procedures	0.127	0.859	0.525	0.631	0.073	0.533	0.301
Vascular disorders	0.865	0.770	0.541	0.645	0.959	0.787	0.844
Blood and lymphatic system disorders	0.766	0.687	0.647	0.711	0.835	0.707	0.784
Skin and subcutaneous tissue disorders	0.961	0.925	0.502	0.603	0.995	0.929	0.942
Congenital, familial and genetic disorders	0.062	0.816	0.480	0.581	0.035	0.313	0.244
Infections and infestations	0.826	0.720	0.547	0.630	0.941	0.737	0.806
Respiratory, thoracic and mediastinal disorders	0.846	0.741	0.522	0.588	0.947	0.764	0.798
Psychiatric disorders	0.809	0.701	0.559	0.652	0.905	0.732	0.805
Renal and urinary disorders	0.777	0.680	0.606	0.693	0.869	0.703	0.795
Pregnancy, puerperium and perinatal conditions	0.107	0.911	0.530	0.529	0.062	0.400	0.176
Ear and labyrinth disorders	0.542	0.618	0.606	0.663	0.507	0.588	0.627
Cardiac disorders	0.808	0.702	0.569	0.670	0.906	0.730	0.812
Nervous system disorders	0.955	0.914	0.514	0.664	0.991	0.921	0.950
Injury, poisoning and procedural complications	0.776	0.660	0.532	0.601	0.888	0.690	0.753

In [8]:

print("Base RF with balancing:")
base_bal_rf_report = cv_multi_report(X_train_dic, y_train, out_names,
                                     RandomForestClassifier(n_estimators=100, random_state=seed), balancing=True,
                                     n_splits=5, n_jobs=-2, verbose=False, random_state=seed)

Base RF with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [19:11<00:00, 46.34s/it]

In [9]:

print("Scores for RF with balancing:")
base_bal_rf_report

Scores for RF with balancing:

Out[9]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.718	0.704	0.703	0.753	0.729	0.709	0.734
Metabolism and nutrition disorders	0.770	0.661	0.562	0.614	0.820	0.726	0.770
Eye disorders	0.735	0.658	0.626	0.671	0.777	0.698	0.747
Musculoskeletal and connective tissue disorders	0.781	0.685	0.608	0.671	0.818	0.749	0.809
Gastrointestinal disorders	0.942	0.892	0.597	0.732	0.962	0.922	0.953
Immune system disorders	0.804	0.699	0.576	0.640	0.851	0.761	0.815
Reproductive system and breast disorders	0.677	0.677	0.676	0.734	0.680	0.675	0.731
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.383	0.748	0.612	0.700	0.309	0.507	0.429
General disorders and administration site conditions	0.939	0.887	0.540	0.629	0.964	0.916	0.937
Endocrine disorders	0.310	0.776	0.588	0.664	0.234	0.474	0.366
Surgical and medical procedures	0.153	0.845	0.534	0.618	0.096	0.399	0.256
Vascular disorders	0.852	0.758	0.596	0.670	0.904	0.805	0.861
Blood and lymphatic system disorders	0.738	0.667	0.640	0.724	0.765	0.713	0.800
Skin and subcutaneous tissue disorders	0.952	0.909	0.579	0.627	0.967	0.937	0.948
Congenital, familial and genetic disorders	0.109	0.799	0.498	0.572	0.070	0.267	0.235
Infections and infestations	0.792	0.692	0.600	0.673	0.826	0.761	0.832
Respiratory, thoracic and mediastinal disorders	0.828	0.722	0.548	0.617	0.896	0.771	0.816
Psychiatric disorders	0.806	0.708	0.609	0.678	0.865	0.754	0.822
Renal and urinary disorders	0.762	0.670	0.613	0.695	0.822	0.710	0.800
Pregnancy, puerperium and perinatal conditions	0.175	0.911	0.564	0.560	0.113	0.410	0.192
Ear and labyrinth disorders	0.528	0.591	0.583	0.612	0.513	0.546	0.557
Cardiac disorders	0.795	0.698	0.607	0.708	0.848	0.749	0.844
Nervous system disorders	0.948	0.904	0.600	0.721	0.966	0.932	0.961
Injury, poisoning and procedural complications	0.753	0.652	0.581	0.624	0.796	0.715	0.768

It was possible to see that the base RF model performed not only better than the base SVC model for most of the classification tasks, it also out-performed the optimized SVC model in some of these. It also improved after oversampling the minority class.

The hyperparameter optimization in this model was done by cross-validation random search using at least 150 combinations of the parameters. The selection of random search instead of grid search was done because of the multiple possible combinations with every parameter.

In RF, six parameters were considered:
n_estimators - number of trees in the forest;
max_features - number of features to consider when looking for the best split;
max_depth - maximum depth of the tree;
min_samples_split - minimum number of samples required to split a node;
min_samples_leaf - minimum number of samples required to be a leaf node;
bootstrap - whether to use bootstrap samples or the whole dataset to build each tree.

In [10]:

# Searching best parameters
n_estimators = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
max_features = ["log2", "sqrt"]
max_depth = [50, 90, 130, 170, 210, 250]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2]
bootstrap = [True, False]
rf_grid = {"randomforestclassifier__n_estimators": n_estimators,
           "randomforestclassifier__max_features": max_features,
           "randomforestclassifier__max_depth": max_depth,
           "randomforestclassifier__min_samples_split": min_samples_split,
           "randomforestclassifier__min_samples_leaf": min_samples_leaf,
           "randomforestclassifier__bootstrap": bootstrap}
rf_grid_label = {name: rf_grid for name in out_names}

"""The following code was previously executed and its output was saved"""
# best_RF_params_by_label = multi_label_random_search(X_train_dic, y_train, out_names[20:],
#                                                     RandomForestClassifier(random_state=seed), rf_grid_label,
#                                                     balancing=True, n_splits=3, scoring="f1_micro", n_jobs=-2,
#                                                     verbose=True, random_state=seed, n_iter=150)
pprint(best_RF_params_by_label, width=-1)

{'Blood and lymphatic system disorders': {'randomforestclassifier__bootstrap': False,
                                          'randomforestclassifier__max_depth': 50,
                                          'randomforestclassifier__max_features': 'log2',
                                          'randomforestclassifier__min_samples_leaf': 2,
                                          'randomforestclassifier__min_samples_split': 10,
                                          'randomforestclassifier__n_estimators': 600},
 'Cardiac disorders': {'randomforestclassifier__bootstrap': False,
                       'randomforestclassifier__max_depth': 210,
                       'randomforestclassifier__max_features': 'sqrt',
                       'randomforestclassifier__min_samples_leaf': 1,
                       'randomforestclassifier__min_samples_split': 5,
                       'randomforestclassifier__n_estimators': 700},
 'Congenital, familial and genetic disorders': {'randomforestclassifier__bootstrap': False,
                                                'randomforestclassifier__max_depth': 170,
                                                'randomforestclassifier__max_features': 'sqrt',
                                                'randomforestclassifier__min_samples_leaf': 2,
                                                'randomforestclassifier__min_samples_split': 2,
                                                'randomforestclassifier__n_estimators': 300},
 'Ear and labyrinth disorders': {'randomforestclassifier__bootstrap': False,
                                 'randomforestclassifier__max_depth': 210,
                                 'randomforestclassifier__max_features': 'log2',
                                 'randomforestclassifier__min_samples_leaf': 1,
                                 'randomforestclassifier__min_samples_split': 10,
                                 'randomforestclassifier__n_estimators': 500},
 'Endocrine disorders': {'randomforestclassifier__bootstrap': False,
                         'randomforestclassifier__max_depth': 50,
                         'randomforestclassifier__max_features': 'sqrt',
                         'randomforestclassifier__min_samples_leaf': 1,
                         'randomforestclassifier__min_samples_split': 10,
                         'randomforestclassifier__n_estimators': 400},
 'Eye disorders': {'randomforestclassifier__bootstrap': False,
                   'randomforestclassifier__max_depth': 170,
                   'randomforestclassifier__max_features': 'log2',
                   'randomforestclassifier__min_samples_leaf': 1,
                   'randomforestclassifier__min_samples_split': 10,
                   'randomforestclassifier__n_estimators': 400},
 'Gastrointestinal disorders': {'randomforestclassifier__bootstrap': False,
                                'randomforestclassifier__max_depth': 90,
                                'randomforestclassifier__max_features': 'log2',
                                'randomforestclassifier__min_samples_leaf': 1,
                                'randomforestclassifier__min_samples_split': 10,
                                'randomforestclassifier__n_estimators': 900},
 'General disorders and administration site conditions': {'randomforestclassifier__bootstrap': False,
                                                          'randomforestclassifier__max_depth': 90,
                                                          'randomforestclassifier__max_features': 'log2',
                                                          'randomforestclassifier__min_samples_leaf': 1,
                                                          'randomforestclassifier__min_samples_split': 2,
                                                          'randomforestclassifier__n_estimators': 400},
 'Hepatobiliary disorders': {'randomforestclassifier__bootstrap': True,
                             'randomforestclassifier__max_depth': 250,
                             'randomforestclassifier__max_features': 'sqrt',
                             'randomforestclassifier__min_samples_leaf': 1,
                             'randomforestclassifier__min_samples_split': 10,
                             'randomforestclassifier__n_estimators': 200},
 'Immune system disorders': {'randomforestclassifier__bootstrap': True,
                             'randomforestclassifier__max_depth': 250,
                             'randomforestclassifier__max_features': 'sqrt',
                             'randomforestclassifier__min_samples_leaf': 1,
                             'randomforestclassifier__min_samples_split': 10,
                             'randomforestclassifier__n_estimators': 600},
 'Infections and infestations': {'randomforestclassifier__bootstrap': True,
                                 'randomforestclassifier__max_depth': 130,
                                 'randomforestclassifier__max_features': 'log2',
                                 'randomforestclassifier__min_samples_leaf': 1,
                                 'randomforestclassifier__min_samples_split': 5,
                                 'randomforestclassifier__n_estimators': 200},
 'Injury, poisoning and procedural complications': {'randomforestclassifier__bootstrap': True,
                                                    'randomforestclassifier__max_depth': 50,
                                                    'randomforestclassifier__max_features': 'log2',
                                                    'randomforestclassifier__min_samples_leaf': 1,
                                                    'randomforestclassifier__min_samples_split': 10,
                                                    'randomforestclassifier__n_estimators': 1000},
 'Metabolism and nutrition disorders': {'randomforestclassifier__bootstrap': False,
                                        'randomforestclassifier__max_depth': 90,
                                        'randomforestclassifier__max_features': 'sqrt',
                                        'randomforestclassifier__min_samples_leaf': 1,
                                        'randomforestclassifier__min_samples_split': 5,
                                        'randomforestclassifier__n_estimators': 400},
 'Musculoskeletal and connective tissue disorders': {'randomforestclassifier__bootstrap': False,
                                                     'randomforestclassifier__max_depth': 170,
                                                     'randomforestclassifier__max_features': 'log2',
                                                     'randomforestclassifier__min_samples_leaf': 1,
                                                     'randomforestclassifier__min_samples_split': 10,
                                                     'randomforestclassifier__n_estimators': 800},
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': {'randomforestclassifier__bootstrap': False,
                                                                         'randomforestclassifier__max_depth': 210,
                                                                         'randomforestclassifier__max_features': 'sqrt',
                                                                         'randomforestclassifier__min_samples_leaf': 1,
                                                                         'randomforestclassifier__min_samples_split': 10,
                                                                         'randomforestclassifier__n_estimators': 400},
 'Nervous system disorders': {'randomforestclassifier__bootstrap': False,
                              'randomforestclassifier__max_depth': 250,
                              'randomforestclassifier__max_features': 'sqrt',
                              'randomforestclassifier__min_samples_leaf': 1,
                              'randomforestclassifier__min_samples_split': 10,
                              'randomforestclassifier__n_estimators': 300},
 'Pregnancy, puerperium and perinatal conditions': {'randomforestclassifier__bootstrap': True,
                                                    'randomforestclassifier__max_depth': 210,
                                                    'randomforestclassifier__max_features': 'log2',
                                                    'randomforestclassifier__min_samples_leaf': 1,
                                                    'randomforestclassifier__min_samples_split': 2,
                                                    'randomforestclassifier__n_estimators': 700},
 'Psychiatric disorders': {'randomforestclassifier__bootstrap': False,
                           'randomforestclassifier__max_depth': 50,
                           'randomforestclassifier__max_features': 'log2',
                           'randomforestclassifier__min_samples_leaf': 1,
                           'randomforestclassifier__min_samples_split': 2,
                           'randomforestclassifier__n_estimators': 100},
 'Renal and urinary disorders': {'randomforestclassifier__bootstrap': False,
                                 'randomforestclassifier__max_depth': 170,
                                 'randomforestclassifier__max_features': 'log2',
                                 'randomforestclassifier__min_samples_leaf': 2,
                                 'randomforestclassifier__min_samples_split': 5,
                                 'randomforestclassifier__n_estimators': 500},
 'Reproductive system and breast disorders': {'randomforestclassifier__bootstrap': False,
                                              'randomforestclassifier__max_depth': 50,
                                              'randomforestclassifier__max_features': 'log2',
                                              'randomforestclassifier__min_samples_leaf': 1,
                                              'randomforestclassifier__min_samples_split': 10,
                                              'randomforestclassifier__n_estimators': 200},
 'Respiratory, thoracic and mediastinal disorders': {'randomforestclassifier__bootstrap': False,
                                                     'randomforestclassifier__max_depth': 210,
                                                     'randomforestclassifier__max_features': 'sqrt',
                                                     'randomforestclassifier__min_samples_leaf': 1,
                                                     'randomforestclassifier__min_samples_split': 10,
                                                     'randomforestclassifier__n_estimators': 100},
 'Skin and subcutaneous tissue disorders': {'randomforestclassifier__bootstrap': False,
                                            'randomforestclassifier__max_depth': 210,
                                            'randomforestclassifier__max_features': 'sqrt',
                                            'randomforestclassifier__min_samples_leaf': 1,
                                            'randomforestclassifier__min_samples_split': 5,
                                            'randomforestclassifier__n_estimators': 300},
 'Surgical and medical procedures': {'randomforestclassifier__bootstrap': False,
                                     'randomforestclassifier__max_depth': 210,
                                     'randomforestclassifier__max_features': 'sqrt',
                                     'randomforestclassifier__min_samples_leaf': 1,
                                     'randomforestclassifier__min_samples_split': 10,
                                     'randomforestclassifier__n_estimators': 800},
 'Vascular disorders': {'randomforestclassifier__bootstrap': False,
                        'randomforestclassifier__max_depth': 170,
                        'randomforestclassifier__max_features': 'log2',
                        'randomforestclassifier__min_samples_leaf': 1,
                        'randomforestclassifier__min_samples_split': 10,
                        'randomforestclassifier__n_estimators': 400}}

In [11]:

print("Improved RF with balancing:")
impr_bal_rf_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamerf,
                                     spec_params=best_RF_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
                                     verbose=False, random_state=seed)

Improved RF with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [18:50<00:00, 45.88s/it]

In [12]:

print("Scores for optimized RF with balancing:")
impr_bal_rf_report

Scores for optimized RF with balancing:

Out[12]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.721	0.698	0.695	0.756	0.754	0.691	0.742
Metabolism and nutrition disorders	0.776	0.666	0.557	0.610	0.838	0.724	0.761
Eye disorders	0.745	0.661	0.618	0.677	0.811	0.690	0.755
Musculoskeletal and connective tissue disorders	0.795	0.698	0.608	0.685	0.853	0.745	0.825
Gastrointestinal disorders	0.945	0.897	0.591	0.727	0.971	0.920	0.952
Immune system disorders	0.814	0.709	0.569	0.634	0.880	0.758	0.817
Reproductive system and breast disorders	0.684	0.679	0.679	0.740	0.695	0.674	0.752
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.390	0.767	0.623	0.714	0.295	0.581	0.444
General disorders and administration site conditions	0.941	0.890	0.554	0.643	0.965	0.918	0.943
Endocrine disorders	0.289	0.778	0.579	0.666	0.213	0.469	0.378
Surgical and medical procedures	0.123	0.851	0.521	0.645	0.072	0.427	0.277
Vascular disorders	0.856	0.761	0.570	0.682	0.927	0.796	0.872
Blood and lymphatic system disorders	0.739	0.664	0.634	0.727	0.775	0.707	0.815
Skin and subcutaneous tissue disorders	0.954	0.912	0.583	0.614	0.971	0.937	0.949
Congenital, familial and genetic disorders	0.089	0.804	0.490	0.563	0.055	0.246	0.227
Infections and infestations	0.798	0.697	0.595	0.672	0.843	0.757	0.833
Respiratory, thoracic and mediastinal disorders	0.839	0.736	0.551	0.613	0.919	0.772	0.819
Psychiatric disorders	0.800	0.699	0.594	0.663	0.861	0.747	0.810
Renal and urinary disorders	0.767	0.677	0.622	0.698	0.826	0.715	0.798
Pregnancy, puerperium and perinatal conditions	0.165	0.913	0.560	0.534	0.103	0.460	0.195
Ear and labyrinth disorders	0.528	0.596	0.587	0.635	0.505	0.553	0.595
Cardiac disorders	0.807	0.710	0.609	0.718	0.876	0.749	0.852
Nervous system disorders	0.952	0.910	0.591	0.708	0.975	0.930	0.958
Injury, poisoning and procedural complications	0.766	0.662	0.575	0.627	0.834	0.710	0.771

3.2.3 XGBoost Model (XGB)

The same workflow as before was applied to XGB.

In [13]:

print("XGB")
print("Base XGB without balancing:")
base_xgb_report = cv_multi_report(X_train_dic, y_train, out_names,
                                  xgb.XGBClassifier(objective="binary:logistic", random_state=seed), n_splits=5,
                                  n_jobs=-2, verbose=False)

XGB
Base XGB without balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [01:31<00:00,  3.50s/it]

In [18]:

base_xgb_report

Out[18]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.700	0.674	0.671	0.723	0.734	0.670	0.713
Metabolism and nutrition disorders	0.808	0.685	0.469	0.565	0.956	0.700	0.748
Eye disorders	0.735	0.623	0.536	0.613	0.859	0.644	0.701
Musculoskeletal and connective tissue disorders	0.803	0.689	0.531	0.653	0.922	0.711	0.795
Gastrointestinal disorders	0.951	0.908	0.545	0.754	0.992	0.914	0.961
Immune system disorders	0.823	0.706	0.482	0.602	0.940	0.732	0.806
Reproductive system and breast disorders	0.637	0.648	0.647	0.689	0.622	0.656	0.684
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.214	0.750	0.533	0.678	0.134	0.544	0.438
General disorders and administration site conditions	0.952	0.908	0.476	0.655	0.998	0.910	0.949
Endocrine disorders	0.221	0.790	0.549	0.640	0.140	0.574	0.390
Surgical and medical procedures	0.031	0.849	0.475	0.645	0.018	0.124	0.256
Vascular disorders	0.867	0.769	0.490	0.634	0.981	0.777	0.846
Blood and lymphatic system disorders	0.743	0.651	0.599	0.697	0.827	0.676	0.777
Skin and subcutaneous tissue disorders	0.963	0.929	0.504	0.631	1.000	0.929	0.951
Congenital, familial and genetic disorders	0.028	0.821	0.465	0.584	0.015	0.233	0.238
Infections and infestations	0.820	0.706	0.506	0.664	0.943	0.726	0.835
Respiratory, thoracic and mediastinal disorders	0.852	0.748	0.490	0.607	0.974	0.758	0.815
Psychiatric disorders	0.810	0.691	0.491	0.609	0.941	0.710	0.774
Renal and urinary disorders	0.754	0.639	0.536	0.648	0.866	0.668	0.754
Pregnancy, puerperium and perinatal conditions	0.037	0.906	0.494	0.538	0.021	0.250	0.146
Ear and labyrinth disorders	0.505	0.613	0.593	0.630	0.444	0.589	0.584
Cardiac disorders	0.808	0.692	0.509	0.639	0.938	0.710	0.793
Nervous system disorders	0.957	0.917	0.488	0.731	0.997	0.919	0.964
Injury, poisoning and procedural complications	0.773	0.642	0.461	0.599	0.917	0.668	0.751

In [15]:

print("Base XGB with balancing:")
base_bal_xgb_report = cv_multi_report(X_train_dic, y_train, out_names,
                                      xgb.XGBClassifier(objective="binary:logistic", random_state=seed), balancing=True,
                                      n_splits=5, n_jobs=-2, verbose=False, random_state=seed)

Base XGB with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [18:56<00:00, 47.61s/it]

In [17]:

base_bal_xgb_report

Out[17]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.691	0.677	0.676	0.724	0.698	0.687	0.713
Metabolism and nutrition disorders	0.733	0.626	0.553	0.580	0.743	0.724	0.744
Eye disorders	0.699	0.622	0.594	0.624	0.720	0.682	0.706
Musculoskeletal and connective tissue disorders	0.740	0.645	0.589	0.645	0.735	0.746	0.796
Gastrointestinal disorders	0.934	0.881	0.650	0.757	0.932	0.936	0.959
Immune system disorders	0.775	0.671	0.580	0.618	0.781	0.769	0.817
Reproductive system and breast disorders	0.624	0.634	0.633	0.683	0.611	0.640	0.677
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.382	0.699	0.591	0.655	0.364	0.405	0.413
General disorders and administration site conditions	0.929	0.869	0.547	0.622	0.940	0.917	0.941
Endocrine disorders	0.308	0.744	0.575	0.615	0.262	0.383	0.348
Surgical and medical procedures	0.188	0.791	0.534	0.607	0.169	0.221	0.222
Vascular disorders	0.812	0.707	0.570	0.611	0.827	0.799	0.827
Blood and lymphatic system disorders	0.701	0.638	0.621	0.693	0.693	0.709	0.771
Skin and subcutaneous tissue disorders	0.939	0.886	0.576	0.650	0.939	0.939	0.956
Congenital, familial and genetic disorders	0.243	0.760	0.550	0.579	0.219	0.275	0.233
Infections and infestations	0.779	0.685	0.614	0.679	0.781	0.777	0.836
Respiratory, thoracic and mediastinal disorders	0.813	0.705	0.560	0.601	0.854	0.776	0.807
Psychiatric disorders	0.779	0.677	0.589	0.623	0.815	0.746	0.777
Renal and urinary disorders	0.725	0.642	0.605	0.648	0.737	0.714	0.745
Pregnancy, puerperium and perinatal conditions	0.197	0.895	0.570	0.532	0.154	0.280	0.181
Ear and labyrinth disorders	0.563	0.578	0.577	0.610	0.609	0.525	0.571
Cardiac disorders	0.772	0.677	0.609	0.657	0.789	0.756	0.807
Nervous system disorders	0.931	0.875	0.611	0.705	0.925	0.938	0.960
Injury, poisoning and procedural complications	0.721	0.630	0.586	0.606	0.719	0.723	0.749

The hyperparameter optimization in XGB was done in a similar way to RF: cross-validation random search with at least 150 combinations of parameters. In this model, six parameters were considered:
eta - step size of the model;
min_child_weight - minimum number of instances needed to be in each node;
max_depth - maximum depth of a tree;
gamma - minimum loss reduction required to make a further partition;
subsample - subsample ratio of the training instance;
colsample_bytree - subsample ratio of features when constructing each tree.

In [16]:

eta = [0.05, 0.1, 0.2]
min_child_weight = [1, 3]
max_depth = [5, 7, 9]
gamma = [0, 0.1, 0.2, 0.3, 0.4]
subsample = [0.6, 0.7, 0.8, 0.9]
colsample_bytree = [0.6, 0.7, 0.8, 0.9]
xgb_grid = {"xgbclassifier__eta": eta,
            "xgbclassifier__min_child_weight": min_child_weight,
            "xgbclassifier__max_depth": max_depth,
            "xgbclassifier__gamma": gamma,
            "xgbclassifier__subsample": subsample,
            "xgbclassifier__colsample_bytree": colsample_bytree
            }
xgb_grid_label = {name: xgb_grid for name in out_names}

"""The following code was previously executed and its output was saved"""
# best_xgb_params_by_label = multi_label_random_search(X_train_dic, y_train, out_names[20:],
#                                                      xgb.XGBClassifier(objective="binary:logistic", random_state=seed),
#                                                      xgb_grid_label, balancing=True, n_splits=3, scoring="f1_micro",
#                                                      n_jobs=-2, verbose=True, random_state=seed, n_iter=150)

Out[16]:

'The following code was previously executed and its output was saved'

In [19]:

print("Improved XGB with balancing:")
impr_bal_xgb_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamexgb,
                                      spec_params=best_xgb_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
                                      verbose=False, random_state=seed)

Improved XGB with balancing:

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [20:04<00:00, 50.80s/it]

In [20]:

impr_bal_xgb_report

Out[20]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.688	0.678	0.678	0.729	0.686	0.692	0.707
Metabolism and nutrition disorders	0.758	0.648	0.554	0.587	0.796	0.724	0.754
Eye disorders	0.720	0.644	0.616	0.644	0.747	0.695	0.716
Musculoskeletal and connective tissue disorders	0.764	0.662	0.581	0.652	0.798	0.734	0.803
Gastrointestinal disorders	0.940	0.889	0.608	0.723	0.956	0.924	0.951
Immune system disorders	0.780	0.672	0.570	0.621	0.800	0.761	0.823
Reproductive system and breast disorders	0.665	0.667	0.667	0.729	0.662	0.669	0.733
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.422	0.739	0.626	0.680	0.374	0.484	0.445
General disorders and administration site conditions	0.936	0.881	0.534	0.665	0.958	0.915	0.949
Endocrine disorders	0.317	0.759	0.585	0.637	0.258	0.420	0.372
Surgical and medical procedures	0.171	0.818	0.534	0.605	0.133	0.248	0.233
Vascular disorders	0.826	0.723	0.570	0.626	0.859	0.797	0.836
Blood and lymphatic system disorders	0.733	0.666	0.643	0.720	0.749	0.718	0.801
Skin and subcutaneous tissue disorders	0.946	0.899	0.581	0.619	0.955	0.938	0.944
Congenital, familial and genetic disorders	0.177	0.774	0.523	0.586	0.139	0.249	0.234
Infections and infestations	0.784	0.688	0.610	0.685	0.796	0.773	0.845
Respiratory, thoracic and mediastinal disorders	0.821	0.715	0.562	0.612	0.872	0.775	0.812
Psychiatric disorders	0.791	0.691	0.597	0.627	0.838	0.749	0.775
Renal and urinary disorders	0.751	0.666	0.622	0.671	0.785	0.720	0.774
Pregnancy, puerperium and perinatal conditions	0.172	0.899	0.559	0.536	0.124	0.300	0.195
Ear and labyrinth disorders	0.550	0.585	0.582	0.620	0.566	0.536	0.577
Cardiac disorders	0.780	0.685	0.610	0.680	0.808	0.755	0.822
Nervous system disorders	0.937	0.884	0.586	0.721	0.944	0.931	0.957
Injury, poisoning and procedural complications	0.733	0.629	0.562	0.620	0.765	0.704	0.772

One observed pattern was the improvement with XGB when classifying labels where the positive examples were the minority class.

3.2.4 Final Model Selection and Evaluation

After developing these three models for each label, it was possible to observe that different models perform better for different tasks. As such, after observing the different metrics, it was manually selected, mainly by comparing Average Precision, F1 Binary and Macro, and Recall, the best model for each.

In [22]:

pprint(best_model_by_label)

{'Blood and lymphatic system disorders': 'RF',
 'Cardiac disorders': 'RF',
 'Congenital, familial and genetic disorders': 'XGB',
 'Ear and labyrinth disorders': 'SVC',
 'Endocrine disorders': 'XGB',
 'Eye disorders': 'RF',
 'Gastrointestinal disorders': 'RF',
 'General disorders and administration site conditions': 'RF',
 'Hepatobiliary disorders': 'RF',
 'Immune system disorders': 'SVC',
 'Infections and infestations': 'RF',
 'Injury, poisoning and procedural complications': 'RF',
 'Metabolism and nutrition disorders': 'RF',
 'Musculoskeletal and connective tissue disorders': 'RF',
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': 'XGB',
 'Nervous system disorders': 'RF',
 'Pregnancy, puerperium and perinatal conditions': 'XGB',
 'Psychiatric disorders': 'RF',
 'Renal and urinary disorders': 'RF',
 'Reproductive system and breast disorders': 'RF',
 'Respiratory, thoracic and mediastinal disorders': 'RF',
 'Skin and subcutaneous tissue disorders': 'RF',
 'Surgical and medical procedures': 'XGB',
 'Vascular disorders': 'SVC'}

In [26]:

scores_best_model = cv_multi_report(X_train_dic, y_train, out_names, modelname=best_model_by_label,
                                    spec_params=best_model_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
                                    verbose=False, random_state=seed)

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [19:05<00:00, 47.24s/it]

In [38]:

print("CV scores for best model by label")
scores_best_model

CV scores for best model by label

Out[38]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Precision
Hepatobiliary disorders	0.721	0.698	0.695	0.756	0.754	0.691	0.742
Metabolism and nutrition disorders	0.776	0.666	0.557	0.610	0.838	0.724	0.761
Eye disorders	0.745	0.661	0.618	0.677	0.811	0.690	0.755
Musculoskeletal and connective tissue disorders	0.795	0.698	0.608	0.685	0.853	0.745	0.825
Gastrointestinal disorders	0.945	0.897	0.591	0.727	0.971	0.920	0.952
Immune system disorders	0.839	0.734	0.531	0.612	0.958	0.747	0.793
Reproductive system and breast disorders	0.684	0.679	0.679	0.740	0.695	0.674	0.752
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.422	0.739	0.626	0.680	0.374	0.484	0.445
General disorders and administration site conditions	0.941	0.890	0.554	0.643	0.965	0.918	0.943
Endocrine disorders	0.317	0.759	0.585	0.637	0.258	0.420	0.372
Surgical and medical procedures	0.171	0.818	0.534	0.605	0.133	0.248	0.233
Vascular disorders	0.868	0.776	0.557	0.665	0.961	0.792	0.852
Blood and lymphatic system disorders	0.739	0.664	0.634	0.727	0.775	0.707	0.815
Skin and subcutaneous tissue disorders	0.954	0.912	0.583	0.614	0.971	0.937	0.949
Congenital, familial and genetic disorders	0.177	0.774	0.523	0.586	0.139	0.249	0.234
Infections and infestations	0.798	0.697	0.595	0.672	0.843	0.757	0.833
Respiratory, thoracic and mediastinal disorders	0.839	0.736	0.551	0.613	0.919	0.772	0.819
Psychiatric disorders	0.800	0.699	0.594	0.663	0.861	0.747	0.810
Renal and urinary disorders	0.767	0.677	0.622	0.698	0.826	0.715	0.798
Pregnancy, puerperium and perinatal conditions	0.172	0.899	0.559	0.536	0.124	0.300	0.195
Ear and labyrinth disorders	0.558	0.586	0.585	0.614	0.583	0.535	0.547
Cardiac disorders	0.807	0.710	0.609	0.718	0.876	0.749	0.852
Nervous system disorders	0.952	0.910	0.591	0.708	0.975	0.930	0.958
Injury, poisoning and procedural complications	0.766	0.662	0.575	0.627	0.834	0.710	0.771

After this selection, the respective models were tested using the test dataset and the results of some of the metrics are shown in following table.

In [4]:

print("Test scores for best model by label")
test_scores_best_model = test_score_multi_report(X_train_dic, y_train, X_test_dic, y_test, out_names,
                                                 modelname=best_model_by_label, spec_params=best_model_params_by_label,
                                                 random_state=seed, verbose=False, balancing=True, n_jobs=-2, plot=False)

Test scores for best model by label

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [23:27<00:00, 58.07s/it]

<Figure size 432x288 with 0 Axes>

In [6]:

test_scores_best_model

Out[6]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Prec-Rec
Hepatobiliary disorders	0.720	0.678	0.671	0.671	0.771	0.674	0.769
Metabolism and nutrition disorders	0.825	0.727	0.603	0.597	0.893	0.767	0.812
Eye disorders	0.770	0.685	0.635	0.634	0.858	0.699	0.812
Musculoskeletal and connective tissue disorders	0.844	0.755	0.635	0.624	0.892	0.802	0.859
Gastrointestinal disorders	0.957	0.920	0.608	0.579	0.985	0.932	0.977
Immune system disorders	0.821	0.713	0.551	0.568	0.959	0.718	0.792
Reproductive system and breast disorders	0.721	0.689	0.685	0.684	0.737	0.706	0.806
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.389	0.692	0.592	0.588	0.329	0.475	0.481
General disorders and administration site conditions	0.925	0.860	0.486	0.498	0.965	0.888	0.929
Endocrine disorders	0.415	0.734	0.622	0.613	0.342	0.529	0.502
Surgical and medical procedures	0.092	0.794	0.488	0.501	0.064	0.167	0.222
Vascular disorders	0.881	0.794	0.547	0.547	0.948	0.823	0.865
Blood and lymphatic system disorders	0.768	0.696	0.663	0.663	0.770	0.766	0.846
Skin and subcutaneous tissue disorders	0.957	0.920	0.698	0.662	0.977	0.937	0.943
Congenital, familial and genetic disorders	0.296	0.801	0.590	0.579	0.231	0.414	0.342
Infections and infestations	0.799	0.699	0.601	0.597	0.872	0.737	0.814
Respiratory, thoracic and mediastinal disorders	0.836	0.738	0.592	0.589	0.927	0.761	0.837
Psychiatric disorders	0.851	0.762	0.632	0.621	0.890	0.815	0.917
Renal and urinary disorders	0.769	0.689	0.646	0.642	0.822	0.722	0.789
Pregnancy, puerperium and perinatal conditions	0.103	0.878	0.518	0.518	0.071	0.182	0.116
Ear and labyrinth disorders	0.571	0.549	0.548	0.548	0.581	0.562	0.586
Cardiac disorders	0.813	0.713	0.600	0.598	0.904	0.739	0.856
Nervous system disorders	0.942	0.892	0.552	0.542	0.984	0.903	0.955
Injury, poisoning and procedural complications	0.754	0.647	0.563	0.566	0.829	0.692	0.763

In [25]:

test_scores_best_model_sorted = test_scores_best_model.sort_values(by=["F1 Binary"], ascending=True)
ax = test_scores_best_model_sorted.plot(kind="barh",
                                        y=["F1 Binary", "F1 Macro", "F1 Micro"],
                                        title="Test scores by label (SIDER)",
                                        xticks=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
                                        legend="reverse", xlim=(0, 1), figsize = (10,16))
for p in ax.patches: ax.annotate("{:.3f}".format(round(p.get_width(), 3)), (p.get_x() + p.get_width(), p.get_y()),
                                 xytext=(30, 0), textcoords='offset points', horizontalalignment='right')

As we can see, the results vary widely from label to label. In order to better understand the reason why, we can further analyse the three different categories of results: high F1, low F1 Macro; similar F1 and F1 Macro; and low F1 and F1 Macro. It is possible to observe these three results, for example, in "General disorders and administration site conditions", "Hepatobiliary disorders", and "Congenital, familial and genetic disorders", respectively. One of the possible explanations for this is the distribution of positive and negative examples in the dataset, observable in the following table:

In [31]:

df_perc = countsm / 1427
df_filt = df_perc.loc[["General disorders and administration site conditions","Hepatobiliary disorders",
                       "Congenital, familial and genetic disorders"]]*100
df_filt.apply(lambda x: round(x, 1)).rename(columns = lambda x: "% "+x)

Out[31]:

	% Positives	% Negatives
General disorders and administration site conditions	90.5	9.5
Hepatobiliary disorders	52.1	47.9
Congenital, familial and genetic disorders	17.7	82.3

In the second type of result, for "Hepatobiliary disorders", similar values can be observed across the metrics. This is the result of the more balanced distribution of the label in the dataset, and, as such, we have a good F1 score for both positive and negative labels, shown by the F1 Macro score. However, for the other two types of results, "General disorders..." and "Congenital...", it is possible to see a much bigger difference across the metrics. With the first one, as almost 91% of the examples in the dataset are positive, and while oversampling helped, it is still possible to observe that the model tends to classify most of the validation examples as positive. It has a very high recall, so every positive example was correctly classified as positive and the high precision could lead us to believe that this is mostly correct but, as it is possible to see by the low F1 Macro score, the F1 for the negative label is very low. This means that the model is classifying most of the examples as positive, but because of the imbalance in the dataset in favour of the positive examples, the precision is still high. The opposite is true when looking at "Congenital...", in this case, the imbalance is in favour of the negative label so the model tends to classify samples as negative.

There are two conclusions that can already be made about the performance of the models: RF had the overall best performance across the board with a majority of labels having this model as the best, and, in labels with very few positive examples, the best model was always the XGB.

When looking at the performance for each label, it is clear that the best performance came from the more balanced labels. While these labels didn't have the best F1 Binary Score, and, as such, the recall is lower than others, they had the best performance when separating positive and negative classifications, showed by having a similar F1 Binary and Macro score.

3.2.5 OFFSIDES

This dataset was used in order to try to expand the number of examples and, consequently, improve the models' performance.

One of the first apparent problems when analysing this dataset was the even greater imbalance in the dataset, as seen next:

In [38]:

mod_off = pd.read_csv("./datasets/offside_socs_modified.csv")
df = pd.read_csv("./datasets/sider.csv")
todrop = ["Product issues", "Investigations", "Social circumstances"]
df.drop(todrop, axis=1, inplace=True)

# 1332 Rows in Total
df_y_2 = mod_off.drop("smiles", axis=1)
d2 = {"Positives": df_y_2.sum(axis=0), "Negatives": 1332 - df_y_2.sum(axis=0)}
counts = pd.DataFrame(data=d2)
counts.plot(kind='bar', figsize=(16, 10), title="OFFSIDES Adverse Drug Reactions Counts", ylim=(0, 1400), stacked=True)

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b408bf44e0>

This dataset was then combined with SIDER, considering a label present in both datasets as positive when it was positive in either OFFSIDES or SIDER, and the new global counts are present next:

In [39]:

df_all = pd.read_csv("./dataframes/df_all.csv")  # (2043, 25)

# New counts (SIDER + OFFSIDES)
df_all_y = df_all.drop("smiles", axis=1)
da2 = {"Positives": df_all_y.sum(axis=0), "Negatives": 2043 - df_all_y.sum(axis=0)}
counts = pd.DataFrame(data=da2)
counts.plot(kind='bar', figsize=(16, 10), title="Adverse Drug Reactions Counts (SIDER + OFFSIDES)", ylim=(0, 2100),
            stacked=True)

Out[39]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b46cca3c88>

As it is possible to see, most of the labels became even more imbalanced with the exception of the labels that had a minority in the positive examples.

In [40]:

# Repeat process of dataframe transformation with dataframe SIDER + OFFSIDES
df_off_y, df_off_mols = create_original_df(usedf=True, file=df_all, write_s=False, write_off=False)
df_off_mols.drop("smiles", axis=1, inplace=True)

df_off_mols_train, df_off_mols_test, y_off_train, y_off_test = train_test_split(df_off_mols, df_off_y, test_size=0.2,
                                                                                random_state=seed)

# Create X datasets with fingerprint length
X_off_all, _, _, _ = createfingerprints(df_off_mols, length=1125)
X_off_train_fp, _, _, _ = createfingerprints(df_off_mols_train, length=1125)
X_off_test_fp, _, _, _ = createfingerprints(df_off_mols_test, length=1125)

# Selects and create descriptors dataset
df_off_desc = createdescriptors(df_off_mols)  # Create all descriptors

# Splits in train and test
df_off_desc_base_train, df_off_desc_base_test = train_test_split(df_off_desc, test_size=0.2, random_state=seed)

# Creates a dictionary with key = class label and value = dataframe with fingerprint + best K descriptors for that label
X_off_train_dic, X_off_test_dic, selected_off_cols = create_dataframes_dic(df_off_desc_base_train,
                                                                           df_off_desc_base_test, X_off_train_fp,
                                                                           X_off_test_fp, y_off_train, out_names,
                                                                           score_func=f_classif, k=3)

100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 94.78it/s]

In [41]:

test_scores_sioff = test_score_multi_report(X_off_train_dic, y_off_train, X_off_test_dic, y_off_test, out_names,
                                            modelname=best_model_by_label, spec_params=best_model_params_by_label,
                                            random_state=seed, verbose=False, balancing=True, n_jobs=-2, plot=False)

100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [38:21<00:00, 100.86s/it]

In [42]:

print("Test scores for SIDER + OFFSIDE")
test_scores_sioff

Test scores for SIDER + OFFSIDE

Out[42]:

	F1 Binary	F1 Micro	F1 Macro	ROC_AUC	Recall	Precision	Average Prec-Rec
Hepatobiliary disorders	0.852	0.758	0.594	0.587	0.877	0.828	0.871
Metabolism and nutrition disorders	0.910	0.839	0.571	0.562	0.935	0.886	0.928
Eye disorders	0.875	0.785	0.552	0.547	0.914	0.839	0.855
Musculoskeletal and connective tissue disorders	0.907	0.834	0.549	0.543	0.946	0.872	0.909
Gastrointestinal disorders	0.966	0.934	0.547	0.544	0.969	0.962	0.980
Immune system disorders	0.828	0.714	0.493	0.514	0.921	0.751	0.792
Reproductive system and breast disorders	0.774	0.672	0.588	0.585	0.813	0.740	0.795
Neoplasms benign, malignant and unspecified (incl cysts and polyps)	0.789	0.667	0.504	0.505	0.812	0.767	0.782
General disorders and administration site conditions	0.968	0.939	0.521	0.517	0.982	0.955	0.966
Endocrine disorders	0.607	0.538	0.523	0.525	0.646	0.573	0.604
Surgical and medical procedures	0.679	0.557	0.484	0.485	0.690	0.668	0.697
Vascular disorders	0.934	0.878	0.550	0.542	0.962	0.908	0.915
Blood and lymphatic system disorders	0.862	0.765	0.538	0.536	0.877	0.847	0.911
Skin and subcutaneous tissue disorders	0.961	0.924	0.480	0.487	0.974	0.947	0.966
Congenital, familial and genetic disorders	0.656	0.575	0.549	0.550	0.695	0.622	0.643
Infections and infestations	0.932	0.873	0.518	0.517	0.957	0.908	0.945
Respiratory, thoracic and mediastinal disorders	0.915	0.844	0.473	0.486	0.950	0.882	0.926
Psychiatric disorders	0.902	0.826	0.585	0.574	0.929	0.876	0.892
Renal and urinary disorders	0.871	0.778	0.540	0.537	0.897	0.845	0.893
Pregnancy, puerperium and perinatal conditions	0.461	0.548	0.536	0.536	0.467	0.454	0.439
Ear and labyrinth disorders	0.640	0.575	0.560	0.560	0.640	0.640	0.660
Cardiac disorders	0.897	0.817	0.518	0.521	0.945	0.854	0.905
Nervous system disorders	0.967	0.936	0.484	0.487	0.975	0.960	0.978
Injury, poisoning and procedural complications	0.896	0.814	0.516	0.515	0.911	0.881	0.924

In [49]:

diff_offsides = test_scores_sioff - test_scores_best_model
ax2 = diff_offsides.plot(kind="barh", y=["F1 Binary", "F1 Macro", "F1 Micro"], 
                         title="Changes in test scores by label with OFFSIDES",legend="reverse", figsize = (10,16))

As we can see from the previous results, the biggest and only improvements when combining SIDER and OFFSIDES were in classes that had very low recall since the addition OFFSIDES dataset greatly increased the number of positives examples, thus balancing the dataset. But, even with these improvements, the overall performance of the models was worse after the merge of the datasets.

3.3 Future Work

As can be seen during this paper, the main problem faced during this work was the imbalance of the datasets, so this could be the focus of improvement in future related work.

One of the possibilities to try to avoid this could be to break down the used SOCs in their components. SOCs are the highest level in the MedDRA hierarchy and they encompass multiple more specific reactions so, in theory breaking down these SOCs in their lower level would reduce the number of positive examples in each of them.

The other possibility could be in the model development by trying other types of models such as deep learning or methods of over or under-sampling, for example using the "class weights" in some of the scikit-learn models.