# Misc
from sklearn.model_selection import train_test_split
from pprint import pprint
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline
# Functions
from mlprocess import *
from params_by_label import *
# Fixing the seed
seed = 6
np.random.seed(seed)
""" Careful when importing saved dataframes with label names as indexes -> index = True """
' Careful when importing saved dataframes with label names as indexes -> index = True '
The objective of this work is to develop machine learning (ML) methods that can accurately predict adverse drug reactions (ADRs) using databases like SIDER and OFFSIDES.
One of the most important factors when using ML methods are the datasets used to train, validate and test the model. In this work, 3 different ones will be used at different stages, shown in table 1.
Dataset | Description |
---|---|
SIDER 4 | 1427 Approved drugs with ADRs text-mined from drug package inserts grouped into 27 system organ classes following MedDRA classification |
OFFSIDES | Database of off-label side effects |
When pre-processing OFFSIDES, ADRs were grouped by system organ classes following MedDRA classification and SMILES strings were obtained from PubChem using the REST API and the STITCH IDs of the compounds.
Features are the set of attributes associated with the example that try to represent the dataset.
SMILES strings are commonly used to represent molecules, as is used in SIDER, and they will be the basis of this work. But, even though they are a unique representation of molecules, they are not enough to use as a feature in ML. Because of this, they will be used as a way to generate other features like fingerprints and molecular descriptors using tools like RDKit in Python. The general workflow for the datasets when in SIDER format is displayed in the following figure.
The SIDER dataset consists of a first column with the molecules' SMILES representation and twenty-seven other columns with the different SOCs. Three of these SOCs were not used since they had no real connection with the molecule and, as such, the development of ML models to predict these labels was not useful; these were 'Product Issues', 'Investigations', and 'Social circumstances'.
With the SMILES representation, it was possible to create multiple different features using RDKit, mainly fingerprints and other descriptors, for example, molecular weight, number of radical electrons, and number of valence electrons. We used these to add relevant information that complements the fingerprint.
In total, 27 descriptors were calculated for each molecule; not every descriptor was useful and, as such, some selection was required. But, since we had 24 different classification tasks, each with an independent model, and different descriptors had a different importance for each of them, this selection was done independently for each task, which resulted in 24 different DataFrames consisting in the fingerprint representation plus the 3 (after testing different values) descriptors selected for each task. An example is shown in following table:
This selection was done using the SelectKBest function from scikit-learn with ANOVA as the statistical test. When transforming OFFSIDES and after getting the SMILES from the STITCH IDs, the process is the same as described before.
Supervised learning is the most common ML scenario in chemoinformatics, and can be subdivided into classification and regression problems. In this type of learning, the training data has the outcome variable to guide the learning process. The objective of this type of learning is to predict the value of an outcome or to classify it. The tested models were:
One of the most popular ML methods. It maps the data into a high-dimensional space, using a non-linear kernel function, in order to optimally separate the classes. This separation is done by maximizing the margin between the closest points of the classes, support vectors, to the decision boundary, a hyperplane.
Tries to give a classification based on an ensemble of decision trees built based on the training data. It is an ensemble of tree predictors where each tree is independently constructed by using bootstrap samples of the training data and random feature selection.
After the RF is built, a prediction is made by a majority vote or averaging the predictions of all the trees.
Similar to RF, as it is also an ensemble prediction method but the trees are not independent. This comes from the fact that, in GBT, at each iteration, the respective tree is constructed by fitting a simple function to current residuals.
The models tested and optimized were SVC (classification implementation of SVM) and Random Forest using scikit-learn, and Gradient Boosted Trees with XGBoost.
As it is possible to see mext, the percentage of positives is very different from label to label. Because of this, the workflow for each model was base evaluation of the base model trained on the original dataset using cross-validation, followed by cross-validation with oversampling of the minority class, followed by hyperparameter optimization using random and grid search, followed by a final validation with the optimized parameters and oversampling. This process is shown next:
# Creating base df_molecules, df_y with the results vectors, and df_mols_descr with the descriptors
print("Creating Dataframes")
y_all, df_molecules = create_original_df(write_s=False)
df_molecules.drop("smiles", axis=1, inplace=True)
todrop = ["Product issues", "Investigations", "Social circumstances"]
y_all.drop(todrop, axis=1, inplace=True) # No real connection with the molecule, multiple problems
out_names = y_all.columns.tolist() # Get class labels
# Separating in a DF_mols_train and an Df_mols_test, in order to avoid data snooping and fitting the model to the test
df_mols_train, df_mols_test, y_train, y_test = train_test_split(df_molecules, y_all, test_size=0.2, random_state=seed)
Creating Dataframes
d = {"Positives": y_all.sum(axis=0), "Negatives": 1427 - y_all.sum(axis=0)}
countsm = pd.DataFrame(data=d)
countsm.plot(kind='bar', figsize=(16, 10), title="Adverse Drug Reactions Counts", ylim=(0, 1500), stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x2be87601208>
ML model development and validation:
After replicating this process for the 3 models, the best one for each label was selected and tested with the test dataset.This process was done using stratified k-fold so that each set contains approximately the same percentage of a sample of each target class as the complete set.
As is seen in figure previous figure, one of the steps when developing the ML models was balancing the dataset. This can be necessary when the classification categories are not approximately equally represented.
Class imbalance can, usually, be dealt with by re-sample the dataset, either by over-sampling the minority class and/or under-sampling the majority class. In this work, over-sampling was used, specifically an extension of Synthetic Minority Over-sampling TEchnique (SMOTE) with the imbalanced-learn package, SMOTE-NC.
With SMOTE, the minority class is over-sampled by introducing synthetic examples along the line segments joining k minority class nearest neighbours. SMOTE-NC adapts this strategy by doing something specifically for the categorical features. When generating a new sample, it picks the most frequent category of the nearest neighbours present for these features.
There are some dangers when combining class balancing with cross-validation. In order to keep the validation process valid, balancing should not be done before separating train and validation for each fold. That is, we start the cross-validation process, dividing the training set in training and validation, and only then do we over-sample the minority class (when employing over-sampling). We do this for every iteration of the process. This process will minimize possible overfitting and a change in the test distribution that would result in misleading results.
Performance measures for classification are typically based on the confusion matrix.
In this matrix, TN are the true negatives, TP are the true positives, FP are the false positives, and FN are the false negatives.
Using this format it is possible to calculate other metrics in order to evaluate the quality of a model's predictions. In order to have a better idea of how well a model works, different metrics should be used. In this work, it was used: Recall, Precision, Average Precision, Area Under the Receiver Operating Characteristic (AUROC), and different variations of F1 score.
Precision is the ability of the classifier not to label as positive a sample that is negative and is defined by:
$$Precision = \frac{\text{TP}}{\text{TP + FP} }$$Recall is the ability of the classifier to label as positive a sample that is positive and is defined by:
$$Recall = \frac{\text{TP}}{\text{TP + FN} }$$Average Precision summarizes a precision-recall curve as the weighted mean of precisions archived at each threshold $(P_{n})$. The increase in recall $(R_{n})$ from the previous threshold $(R_{n-1})$ is used as the weight. It is defined by:
$$AP = \sum_{n} \left ( R_{n} - R_{n-1} \right ) P_{n}$$AUROC is the area under the receiving operating characteristic (ROC) curve. This curve is created by plotting the fraction of TP out of the actual positives against the fraction of FP out of the actual negatives, at different thresholds. \
F1 Score is a weighted average of the precision and recall and is defined by:
$$F1 = \frac{2 \cdot precision\cdot recall}{precision + recall}$$In this work, three types of F1-score were used. F1 binary, also represented as F1, is the F1 Score with respect only to the positive label. F1 Macro Score is the unweighted mean between both positive and negative labels. F1 Micro Score uses global TP, FN and FP and is equivalent to the accuracy metric in a binary classification task.
During this work, Average Precision, Recall and the different F1 Scores will be the main metrics used to evaluate and develop the model since they deal with imbalanced datasets better than AUROC.
In this work 24 models were studied, one for each SOC. Before any testing, the dataset was split in train and test and all validation and optimization tasks were done using the first, in order to prevent any type of test overfitting.
Something to have in mind when evaluating these results is the imbalance of the test dataset which is a consequence of the same imbalance of the original dataset. This greatly affects the metrics, mainly the precision and all metrics that derive from it since having a big majority of positive tests will always result in high precision scores.
The first step was to choose a fingerprint type and its length. The tested possibilities were ECFP-4, MACCS key, Atom Pairs and Topological Torsion. For each of these types, different lengths between 100 and 2048 were tested and the different metrics were calculated in order to pick the best combination. In order to simplify this process, the different combinations were tested using 10-fold cross-validation with SVC and only to the label 'Hepatobiliary disorders'. The results are displayed next:
# Fingerprint length
all_df_results_svc = test_fingerprint_size(df_mols_train, y_train, SVC(gamma="scale", random_state=seed), makeplots=True, write=False)
# Best result with ECFP-4 at 1125 - This will be used to all results
Creating Dataframes
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [08:46<00:00, 40.24s/it]
After analysing the results, the performance appeared to be similar across the board with a small advantage when using ECFP-4. In this fingerprint type a small peak of performance was seen at length 1125, so that was the chosen combination.
# Create X datasets with fingerprint length
X_all, _, _, _ = createfingerprints(df_molecules, length=1125)
X_train_fp, _, _, _ = createfingerprints(df_mols_train, length=1125)
X_test_fp, _, _, _ = createfingerprints(df_mols_test, length=1125)
# Selects and create descriptors dataset
df_desc = createdescriptors(df_molecules) # Create all descriptors
# Splits in train and test
df_desc_base_train, df_desc_base_test = train_test_split(df_desc, test_size=0.2, random_state=seed)
# Creates a dictionary with key = class label and value = dataframe with fingerprint + best K descriptors for that label
X_train_dic, X_test_dic, selected_cols = create_dataframes_dic(df_desc_base_train, df_desc_base_test, X_train_fp,
X_test_fp, y_train, out_names, score_func=f_classif, k=3)
# Creates a y dictionary for all labels
y_train_dic = {name: y_train[name] for name in out_names}
modelnamesvc = {name: "SVC" for name in out_names}
modelnamerf = {name: "RF" for name in out_names}
modelnamexgb = {name: "XGB" for name in out_names}
modelnamevot = {name: "VotingClassifier" for name in out_names}
100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 114.51it/s]
print("Selected descriptors by label:")
pprint(selected_cols, width=-1)
Selected descriptors by label: {'Blood and lymphatic system disorders': ['fracsp33', 'aliphcarbocycles', 'numsatcarbcycles'], 'Cardiac disorders': ['arocarbocycles', 'arorings', 'numsatcarbcycles'], 'Congenital, familial and genetic disorders': ['maxabspartcharge', 'numrade', 'arocarbocycles'], 'Ear and labyrinth disorders': ['maxpartcharge', 'aliphcarbocycles', 'numsatcarbcycles'], 'Endocrine disorders': ['numrade', 'aliphcarbocycles', 'numsatcarbcycles'], 'Eye disorders': ['nhohcount', 'numhdonors', 'numhatoms'], 'Gastrointestinal disorders': ['aliphcarbocycles', 'arorings', 'numsatcarbcycles'], 'General disorders and administration site conditions': ['numrade', 'aliphcarbocycles', 'numsatcarbcycles'], 'Hepatobiliary disorders': ['arohetcycles', 'arorings', 'numhdonors'], 'Immune system disorders': ['aliphhetcycles', 'numhacceptors', 'numsathetcycles'], 'Infections and infestations': ['aliphrings', 'numsatrings', 'ringcount'], 'Injury, poisoning and procedural complications': ['arohetcycles', 'arorings', 'ringcount'], 'Metabolism and nutrition disorders': ['arohetcycles', 'arorings', 'ringcount'], 'Musculoskeletal and connective tissue disorders': ['arohetcycles', 'arorings', 'ringcount'], 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': ['aliphcarbocycles', 'arohetcycles', 'ringcount'], 'Nervous system disorders': ['arocarbocycles', 'arorings', 'ringcount'], 'Pregnancy, puerperium and perinatal conditions': ['fracsp33', 'aliphcarbocycles', 'aliphrings'], 'Psychiatric disorders': ['nhohcount', 'arohetcycles', 'numsatcarbcycles'], 'Renal and urinary disorders': ['aliphcarbocycles', 'arorings', 'numsatcarbcycles'], 'Reproductive system and breast disorders': ['aliphcarbocycles', 'aliphrings', 'ringcount'], 'Respiratory, thoracic and mediastinal disorders': ['fracsp33', 'aliphcarbocycles', 'numsatcarbcycles'], 'Skin and subcutaneous tissue disorders': ['numrade', 'aliphrings', 'ringcount'], 'Surgical and medical procedures': ['maxpartcharge', 'numval', 'nhohcount'], 'Vascular disorders': ['aliphhetcycles', 'arorings', 'ringcount']}
print("SVC")
print("Base SVC without balancing:")
base_svc_report = cv_multi_report(X_train_dic, y_train, out_names, SVC(gamma="auto", random_state=seed), n_splits=5,
n_jobs=-2, verbose=False)
SVC Base SVC without balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [01:13<00:00, 2.84s/it]
print("Scores for SVC without balancing:")
base_svc_report
Scores for SVC without balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.712 | 0.643 | 0.622 | 0.676 | 0.851 | 0.612 | 0.660 |
Metabolism and nutrition disorders | 0.818 | 0.692 | 0.409 | 0.565 | 1.000 | 0.692 | 0.732 |
Eye disorders | 0.763 | 0.624 | 0.430 | 0.597 | 0.984 | 0.622 | 0.687 |
Musculoskeletal and connective tissue disorders | 0.815 | 0.687 | 0.407 | 0.648 | 1.000 | 0.687 | 0.802 |
Gastrointestinal disorders | 0.951 | 0.907 | 0.476 | 0.672 | 1.000 | 0.907 | 0.945 |
Immune system disorders | 0.841 | 0.726 | 0.421 | 0.559 | 1.000 | 0.726 | 0.777 |
Reproductive system and breast disorders | 0.609 | 0.608 | 0.608 | 0.652 | 0.611 | 0.608 | 0.625 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.000 | 0.745 | 0.427 | 0.679 | 0.000 | 0.000 | 0.470 |
General disorders and administration site conditions | 0.953 | 0.910 | 0.476 | 0.603 | 1.000 | 0.910 | 0.938 |
Endocrine disorders | 0.000 | 0.786 | 0.440 | 0.631 | 0.000 | 0.000 | 0.391 |
Surgical and medical procedures | 0.000 | 0.855 | 0.461 | 0.527 | 0.000 | 0.000 | 0.187 |
Vascular disorders | 0.869 | 0.769 | 0.435 | 0.577 | 1.000 | 0.769 | 0.817 |
Blood and lymphatic system disorders | 0.766 | 0.633 | 0.458 | 0.661 | 0.981 | 0.628 | 0.742 |
Skin and subcutaneous tissue disorders | 0.962 | 0.927 | 0.481 | 0.593 | 1.000 | 0.927 | 0.949 |
Congenital, familial and genetic disorders | 0.000 | 0.824 | 0.452 | 0.585 | 0.000 | 0.000 | 0.245 |
Infections and infestations | 0.830 | 0.710 | 0.415 | 0.655 | 1.000 | 0.710 | 0.815 |
Respiratory, thoracic and mediastinal disorders | 0.856 | 0.748 | 0.428 | 0.558 | 1.000 | 0.748 | 0.793 |
Psychiatric disorders | 0.823 | 0.699 | 0.412 | 0.618 | 1.000 | 0.699 | 0.773 |
Renal and urinary disorders | 0.781 | 0.641 | 0.390 | 0.645 | 1.000 | 0.641 | 0.756 |
Pregnancy, puerperium and perinatal conditions | 0.000 | 0.915 | 0.478 | 0.514 | 0.000 | 0.000 | 0.129 |
Ear and labyrinth disorders | 0.000 | 0.552 | 0.356 | 0.620 | 0.000 | 0.000 | 0.577 |
Cardiac disorders | 0.819 | 0.693 | 0.409 | 0.641 | 1.000 | 0.693 | 0.795 |
Nervous system disorders | 0.958 | 0.919 | 0.479 | 0.633 | 1.000 | 0.919 | 0.954 |
Injury, poisoning and procedural complications | 0.799 | 0.665 | 0.399 | 0.568 | 1.000 | 0.665 | 0.730 |
print("Base SVC with balancing:")
base_bal_svc_report = cv_multi_report(X_train_dic, y_train, out_names, SVC(gamma="auto", random_state=seed),
balancing=True, n_splits=5, n_jobs=-2, verbose=False, random_state=seed)
Base SVC with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [21:26<00:00, 53.81s/it]
print("Scores for SVC with balancing:")
base_bal_svc_report
Scores for SVC with balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.654 | 0.628 | 0.624 | 0.675 | 0.678 | 0.633 | 0.655 |
Metabolism and nutrition disorders | 0.701 | 0.603 | 0.553 | 0.587 | 0.676 | 0.730 | 0.758 |
Eye disorders | 0.702 | 0.607 | 0.560 | 0.608 | 0.754 | 0.658 | 0.699 |
Musculoskeletal and connective tissue disorders | 0.742 | 0.646 | 0.588 | 0.633 | 0.741 | 0.744 | 0.781 |
Gastrointestinal disorders | 0.905 | 0.833 | 0.602 | 0.720 | 0.880 | 0.932 | 0.956 |
Immune system disorders | 0.776 | 0.664 | 0.553 | 0.594 | 0.800 | 0.754 | 0.791 |
Reproductive system and breast disorders | 0.607 | 0.607 | 0.607 | 0.652 | 0.608 | 0.608 | 0.625 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.331 | 0.640 | 0.542 | 0.584 | 0.350 | 0.314 | 0.331 |
General disorders and administration site conditions | 0.902 | 0.825 | 0.524 | 0.577 | 0.890 | 0.915 | 0.928 |
Endocrine disorders | 0.262 | 0.698 | 0.536 | 0.557 | 0.250 | 0.279 | 0.277 |
Surgical and medical procedures | 0.253 | 0.591 | 0.484 | 0.553 | 0.475 | 0.173 | 0.177 |
Vascular disorders | 0.795 | 0.690 | 0.581 | 0.617 | 0.780 | 0.810 | 0.825 |
Blood and lymphatic system disorders | 0.727 | 0.651 | 0.620 | 0.658 | 0.759 | 0.699 | 0.737 |
Skin and subcutaneous tissue disorders | 0.922 | 0.859 | 0.591 | 0.670 | 0.900 | 0.945 | 0.953 |
Congenital, familial and genetic disorders | 0.177 | 0.671 | 0.486 | 0.501 | 0.199 | 0.160 | 0.192 |
Infections and infestations | 0.743 | 0.651 | 0.599 | 0.657 | 0.711 | 0.779 | 0.819 |
Respiratory, thoracic and mediastinal disorders | 0.797 | 0.689 | 0.563 | 0.596 | 0.817 | 0.779 | 0.805 |
Psychiatric disorders | 0.773 | 0.673 | 0.593 | 0.624 | 0.796 | 0.752 | 0.772 |
Renal and urinary disorders | 0.731 | 0.654 | 0.622 | 0.646 | 0.731 | 0.733 | 0.741 |
Pregnancy, puerperium and perinatal conditions | 0.131 | 0.797 | 0.508 | 0.509 | 0.185 | 0.103 | 0.114 |
Ear and labyrinth disorders | 0.623 | 0.506 | 0.452 | 0.541 | 0.912 | 0.473 | 0.473 |
Cardiac disorders | 0.755 | 0.660 | 0.599 | 0.635 | 0.756 | 0.755 | 0.772 |
Nervous system disorders | 0.910 | 0.840 | 0.594 | 0.733 | 0.882 | 0.941 | 0.964 |
Injury, poisoning and procedural complications | 0.658 | 0.576 | 0.549 | 0.597 | 0.614 | 0.710 | 0.743 |
diff_bal_svc = base_bal_svc_report - base_svc_report
print("Changes in scores after balancing:")
diff_bal_svc
Changes in scores after balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | -0.058 | -0.015 | 0.002 | -0.001 | -0.173 | 0.021 | -0.005 |
Metabolism and nutrition disorders | -0.117 | -0.089 | 0.144 | 0.022 | -0.324 | 0.038 | 0.026 |
Eye disorders | -0.061 | -0.017 | 0.130 | 0.011 | -0.230 | 0.036 | 0.012 |
Musculoskeletal and connective tissue disorders | -0.073 | -0.041 | 0.181 | -0.015 | -0.259 | 0.057 | -0.021 |
Gastrointestinal disorders | -0.046 | -0.074 | 0.126 | 0.048 | -0.120 | 0.025 | 0.011 |
Immune system disorders | -0.065 | -0.062 | 0.132 | 0.035 | -0.200 | 0.028 | 0.014 |
Reproductive system and breast disorders | -0.002 | -0.001 | -0.001 | 0.000 | -0.003 | 0.000 | 0.000 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.331 | -0.105 | 0.115 | -0.095 | 0.350 | 0.314 | -0.139 |
General disorders and administration site conditions | -0.051 | -0.085 | 0.048 | -0.026 | -0.110 | 0.005 | -0.010 |
Endocrine disorders | 0.262 | -0.088 | 0.096 | -0.074 | 0.250 | 0.279 | -0.114 |
Surgical and medical procedures | 0.253 | -0.264 | 0.023 | 0.026 | 0.475 | 0.173 | -0.010 |
Vascular disorders | -0.074 | -0.079 | 0.146 | 0.040 | -0.220 | 0.041 | 0.008 |
Blood and lymphatic system disorders | -0.039 | 0.018 | 0.162 | -0.003 | -0.222 | 0.071 | -0.005 |
Skin and subcutaneous tissue disorders | -0.040 | -0.068 | 0.110 | 0.077 | -0.100 | 0.018 | 0.004 |
Congenital, familial and genetic disorders | 0.177 | -0.153 | 0.034 | -0.084 | 0.199 | 0.160 | -0.053 |
Infections and infestations | -0.087 | -0.059 | 0.184 | 0.002 | -0.289 | 0.069 | 0.004 |
Respiratory, thoracic and mediastinal disorders | -0.059 | -0.059 | 0.135 | 0.038 | -0.183 | 0.031 | 0.012 |
Psychiatric disorders | -0.050 | -0.026 | 0.181 | 0.006 | -0.204 | 0.053 | -0.001 |
Renal and urinary disorders | -0.050 | 0.013 | 0.232 | 0.001 | -0.269 | 0.092 | -0.015 |
Pregnancy, puerperium and perinatal conditions | 0.131 | -0.118 | 0.030 | -0.005 | 0.185 | 0.103 | -0.015 |
Ear and labyrinth disorders | 0.623 | -0.046 | 0.096 | -0.079 | 0.912 | 0.473 | -0.104 |
Cardiac disorders | -0.064 | -0.033 | 0.190 | -0.006 | -0.244 | 0.062 | -0.023 |
Nervous system disorders | -0.048 | -0.079 | 0.115 | 0.100 | -0.118 | 0.022 | 0.010 |
Injury, poisoning and procedural complications | -0.141 | -0.089 | 0.150 | 0.029 | -0.386 | 0.045 | 0.013 |
As we can see by some of the results that are displayed in the previous tables, even though Average Precision didn't change by much, the biggest change of all the metrics was in the F1 Macro score. As this is the unweighted mean of F1 score to both negative and positive classes, we can conclude that an improvement of the F1 score in the minority class was the main consequence of the oversampling, as was to be expected.
With these positive results, oversampling was applied when developing every model.
The next step in the model development was hyperparameter optimization. In SVC, two parameters were optimized using grid search with cross-validation - C and gamma - while using a Radial Basis Function (RBF) kernel, where C is the cost of misclassifying training examples and gamma is a specific parameter of RBF kernel and can be seen as the inverse of the radius of influence of the samples selected as support vectors.
# Searching best parameters
params_to_test = {"svc__kernel": ["rbf"], "svc__C": [0.01, 0.1, 1, 10],
"svc__gamma": [0.001, 0.01, 0.1, 1]}
d_params_to_test = {name: params_to_test for name in out_names}
"""The following code was previously executed and its output was saved"""
#best_SVC_params_by_label = multi_label_grid_search(X_train_dic, y_train, out_names[15:],
# SVC(gamma="auto", random_state=seed), d_params_to_test,
# balancing=True, n_splits=5, scoring="f1_micro", n_jobs=-2,
# verbose=True, random_state=seed)
pprint(best_SVC_params_by_label, width=-1)
{'Blood and lymphatic system disorders': {'svc__C': 10, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}, 'Cardiac disorders': {'svc__C': 10, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Congenital, familial and genetic disorders': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Ear and labyrinth disorders': {'svc__C': 10, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}, 'Endocrine disorders': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Eye disorders': {'svc__C': 10, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Gastrointestinal disorders': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'General disorders and administration site conditions': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Hepatobiliary disorders': {'svc__C': 1, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}, 'Immune system disorders': {'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Infections and infestations': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Injury, poisoning and procedural complications': {'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Metabolism and nutrition disorders': {'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Musculoskeletal and connective tissue disorders': {'svc__C': 10, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Nervous system disorders': {'svc__C': 1, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Pregnancy, puerperium and perinatal conditions': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Psychiatric disorders': {'svc__C': 1, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Renal and urinary disorders': {'svc__C': 1, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}, 'Reproductive system and breast disorders': {'svc__C': 10, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}, 'Respiratory, thoracic and mediastinal disorders': {'svc__C': 10, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Skin and subcutaneous tissue disorders': {'svc__C': 0.01, 'svc__gamma': 1, 'svc__kernel': 'rbf'}, 'Surgical and medical procedures': {'svc__C': 0.01, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}, 'Vascular disorders': {'svc__C': 10, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}}
print("Improved SVC with balancing:")
impr_bal_svc_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamesvc,
spec_params=best_SVC_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
verbose=False, random_state=seed)
Improved SVC with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [26:49<00:00, 65.60s/it]
print("Scores for optimized SVC with balancing:")
impr_bal_svc_report
Scores for optimized SVC with balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.709 | 0.682 | 0.678 | 0.730 | 0.751 | 0.674 | 0.701 |
Metabolism and nutrition disorders | 0.810 | 0.693 | 0.507 | 0.597 | 0.944 | 0.709 | 0.757 |
Eye disorders | 0.754 | 0.637 | 0.532 | 0.633 | 0.906 | 0.646 | 0.720 |
Musculoskeletal and connective tissue disorders | 0.812 | 0.704 | 0.553 | 0.669 | 0.932 | 0.720 | 0.798 |
Gastrointestinal disorders | 0.951 | 0.907 | 0.476 | 0.617 | 1.000 | 0.907 | 0.929 |
Immune system disorders | 0.839 | 0.734 | 0.531 | 0.612 | 0.958 | 0.747 | 0.793 |
Reproductive system and breast disorders | 0.677 | 0.681 | 0.680 | 0.725 | 0.672 | 0.685 | 0.722 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.000 | 0.745 | 0.427 | 0.648 | 0.000 | 0.000 | 0.345 |
General disorders and administration site conditions | 0.953 | 0.910 | 0.476 | 0.589 | 1.000 | 0.910 | 0.931 |
Endocrine disorders | 0.000 | 0.786 | 0.440 | 0.574 | 0.000 | 0.000 | 0.262 |
Surgical and medical procedures | 0.000 | 0.855 | 0.461 | 0.594 | 0.000 | 0.000 | 0.207 |
Vascular disorders | 0.868 | 0.776 | 0.557 | 0.665 | 0.961 | 0.792 | 0.852 |
Blood and lymphatic system disorders | 0.729 | 0.673 | 0.658 | 0.718 | 0.719 | 0.740 | 0.789 |
Skin and subcutaneous tissue disorders | 0.962 | 0.927 | 0.481 | 0.626 | 1.000 | 0.927 | 0.941 |
Congenital, familial and genetic disorders | 0.000 | 0.824 | 0.452 | 0.563 | 0.000 | 0.000 | 0.205 |
Infections and infestations | 0.830 | 0.710 | 0.415 | 0.625 | 1.000 | 0.710 | 0.779 |
Respiratory, thoracic and mediastinal disorders | 0.856 | 0.752 | 0.478 | 0.594 | 0.986 | 0.757 | 0.798 |
Psychiatric disorders | 0.824 | 0.705 | 0.457 | 0.594 | 0.986 | 0.707 | 0.753 |
Renal and urinary disorders | 0.757 | 0.670 | 0.620 | 0.673 | 0.804 | 0.715 | 0.767 |
Pregnancy, puerperium and perinatal conditions | 0.019 | 0.916 | 0.488 | 0.504 | 0.010 | 0.200 | 0.107 |
Ear and labyrinth disorders | 0.558 | 0.586 | 0.585 | 0.614 | 0.583 | 0.535 | 0.547 |
Cardiac disorders | 0.809 | 0.695 | 0.523 | 0.688 | 0.934 | 0.714 | 0.820 |
Nervous system disorders | 0.958 | 0.920 | 0.499 | 0.595 | 1.000 | 0.920 | 0.922 |
Injury, poisoning and procedural complications | 0.790 | 0.669 | 0.505 | 0.614 | 0.934 | 0.684 | 0.754 |
print("Random Forest")
print("Base RF without balancing:")
base_rf_report = cv_multi_report(X_train_dic, y_train, out_names,
RandomForestClassifier(n_estimators=100, random_state=seed), n_splits=5, n_jobs=-2,
verbose=False)
Random Forest Base RF without balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [00:18<00:00, 1.53it/s]
print("Scores for RF without balancing:")
base_rf_report
Scores for RF without balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.723 | 0.700 | 0.698 | 0.756 | 0.754 | 0.694 | 0.737 |
Metabolism and nutrition disorders | 0.802 | 0.687 | 0.527 | 0.582 | 0.915 | 0.714 | 0.744 |
Eye disorders | 0.747 | 0.660 | 0.613 | 0.655 | 0.823 | 0.685 | 0.733 |
Musculoskeletal and connective tissue disorders | 0.809 | 0.707 | 0.590 | 0.662 | 0.902 | 0.734 | 0.799 |
Gastrointestinal disorders | 0.949 | 0.904 | 0.524 | 0.704 | 0.990 | 0.911 | 0.948 |
Immune system disorders | 0.821 | 0.707 | 0.509 | 0.605 | 0.925 | 0.738 | 0.796 |
Reproductive system and breast disorders | 0.681 | 0.679 | 0.679 | 0.738 | 0.685 | 0.678 | 0.740 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.327 | 0.777 | 0.597 | 0.717 | 0.213 | 0.739 | 0.498 |
General disorders and administration site conditions | 0.951 | 0.906 | 0.501 | 0.643 | 0.993 | 0.912 | 0.942 |
Endocrine disorders | 0.263 | 0.792 | 0.571 | 0.693 | 0.172 | 0.568 | 0.425 |
Surgical and medical procedures | 0.127 | 0.859 | 0.525 | 0.631 | 0.073 | 0.533 | 0.301 |
Vascular disorders | 0.865 | 0.770 | 0.541 | 0.645 | 0.959 | 0.787 | 0.844 |
Blood and lymphatic system disorders | 0.766 | 0.687 | 0.647 | 0.711 | 0.835 | 0.707 | 0.784 |
Skin and subcutaneous tissue disorders | 0.961 | 0.925 | 0.502 | 0.603 | 0.995 | 0.929 | 0.942 |
Congenital, familial and genetic disorders | 0.062 | 0.816 | 0.480 | 0.581 | 0.035 | 0.313 | 0.244 |
Infections and infestations | 0.826 | 0.720 | 0.547 | 0.630 | 0.941 | 0.737 | 0.806 |
Respiratory, thoracic and mediastinal disorders | 0.846 | 0.741 | 0.522 | 0.588 | 0.947 | 0.764 | 0.798 |
Psychiatric disorders | 0.809 | 0.701 | 0.559 | 0.652 | 0.905 | 0.732 | 0.805 |
Renal and urinary disorders | 0.777 | 0.680 | 0.606 | 0.693 | 0.869 | 0.703 | 0.795 |
Pregnancy, puerperium and perinatal conditions | 0.107 | 0.911 | 0.530 | 0.529 | 0.062 | 0.400 | 0.176 |
Ear and labyrinth disorders | 0.542 | 0.618 | 0.606 | 0.663 | 0.507 | 0.588 | 0.627 |
Cardiac disorders | 0.808 | 0.702 | 0.569 | 0.670 | 0.906 | 0.730 | 0.812 |
Nervous system disorders | 0.955 | 0.914 | 0.514 | 0.664 | 0.991 | 0.921 | 0.950 |
Injury, poisoning and procedural complications | 0.776 | 0.660 | 0.532 | 0.601 | 0.888 | 0.690 | 0.753 |
print("Base RF with balancing:")
base_bal_rf_report = cv_multi_report(X_train_dic, y_train, out_names,
RandomForestClassifier(n_estimators=100, random_state=seed), balancing=True,
n_splits=5, n_jobs=-2, verbose=False, random_state=seed)
Base RF with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [19:11<00:00, 46.34s/it]
print("Scores for RF with balancing:")
base_bal_rf_report
Scores for RF with balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.718 | 0.704 | 0.703 | 0.753 | 0.729 | 0.709 | 0.734 |
Metabolism and nutrition disorders | 0.770 | 0.661 | 0.562 | 0.614 | 0.820 | 0.726 | 0.770 |
Eye disorders | 0.735 | 0.658 | 0.626 | 0.671 | 0.777 | 0.698 | 0.747 |
Musculoskeletal and connective tissue disorders | 0.781 | 0.685 | 0.608 | 0.671 | 0.818 | 0.749 | 0.809 |
Gastrointestinal disorders | 0.942 | 0.892 | 0.597 | 0.732 | 0.962 | 0.922 | 0.953 |
Immune system disorders | 0.804 | 0.699 | 0.576 | 0.640 | 0.851 | 0.761 | 0.815 |
Reproductive system and breast disorders | 0.677 | 0.677 | 0.676 | 0.734 | 0.680 | 0.675 | 0.731 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.383 | 0.748 | 0.612 | 0.700 | 0.309 | 0.507 | 0.429 |
General disorders and administration site conditions | 0.939 | 0.887 | 0.540 | 0.629 | 0.964 | 0.916 | 0.937 |
Endocrine disorders | 0.310 | 0.776 | 0.588 | 0.664 | 0.234 | 0.474 | 0.366 |
Surgical and medical procedures | 0.153 | 0.845 | 0.534 | 0.618 | 0.096 | 0.399 | 0.256 |
Vascular disorders | 0.852 | 0.758 | 0.596 | 0.670 | 0.904 | 0.805 | 0.861 |
Blood and lymphatic system disorders | 0.738 | 0.667 | 0.640 | 0.724 | 0.765 | 0.713 | 0.800 |
Skin and subcutaneous tissue disorders | 0.952 | 0.909 | 0.579 | 0.627 | 0.967 | 0.937 | 0.948 |
Congenital, familial and genetic disorders | 0.109 | 0.799 | 0.498 | 0.572 | 0.070 | 0.267 | 0.235 |
Infections and infestations | 0.792 | 0.692 | 0.600 | 0.673 | 0.826 | 0.761 | 0.832 |
Respiratory, thoracic and mediastinal disorders | 0.828 | 0.722 | 0.548 | 0.617 | 0.896 | 0.771 | 0.816 |
Psychiatric disorders | 0.806 | 0.708 | 0.609 | 0.678 | 0.865 | 0.754 | 0.822 |
Renal and urinary disorders | 0.762 | 0.670 | 0.613 | 0.695 | 0.822 | 0.710 | 0.800 |
Pregnancy, puerperium and perinatal conditions | 0.175 | 0.911 | 0.564 | 0.560 | 0.113 | 0.410 | 0.192 |
Ear and labyrinth disorders | 0.528 | 0.591 | 0.583 | 0.612 | 0.513 | 0.546 | 0.557 |
Cardiac disorders | 0.795 | 0.698 | 0.607 | 0.708 | 0.848 | 0.749 | 0.844 |
Nervous system disorders | 0.948 | 0.904 | 0.600 | 0.721 | 0.966 | 0.932 | 0.961 |
Injury, poisoning and procedural complications | 0.753 | 0.652 | 0.581 | 0.624 | 0.796 | 0.715 | 0.768 |
It was possible to see that the base RF model performed not only better than the base SVC model for most of the classification tasks, it also out-performed the optimized SVC model in some of these. It also improved after oversampling the minority class.
The hyperparameter optimization in this model was done by cross-validation random search using at least 150 combinations of the parameters. The selection of random search instead of grid search was done because of the multiple possible combinations with every parameter.
In RF, six parameters were considered:
n_estimators - number of trees in the forest;
max_features - number of features to consider when looking for the best split;
max_depth - maximum depth of the tree;
min_samples_split - minimum number of samples required to split a node;
min_samples_leaf - minimum number of samples required to be a leaf node;
bootstrap - whether to use bootstrap samples or the whole dataset to build each tree.
# Searching best parameters
n_estimators = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
max_features = ["log2", "sqrt"]
max_depth = [50, 90, 130, 170, 210, 250]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2]
bootstrap = [True, False]
rf_grid = {"randomforestclassifier__n_estimators": n_estimators,
"randomforestclassifier__max_features": max_features,
"randomforestclassifier__max_depth": max_depth,
"randomforestclassifier__min_samples_split": min_samples_split,
"randomforestclassifier__min_samples_leaf": min_samples_leaf,
"randomforestclassifier__bootstrap": bootstrap}
rf_grid_label = {name: rf_grid for name in out_names}
"""The following code was previously executed and its output was saved"""
# best_RF_params_by_label = multi_label_random_search(X_train_dic, y_train, out_names[20:],
# RandomForestClassifier(random_state=seed), rf_grid_label,
# balancing=True, n_splits=3, scoring="f1_micro", n_jobs=-2,
# verbose=True, random_state=seed, n_iter=150)
pprint(best_RF_params_by_label, width=-1)
{'Blood and lymphatic system disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 600}, 'Cardiac disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 700}, 'Congenital, familial and genetic disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 170, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__n_estimators': 300}, 'Ear and labyrinth disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 500}, 'Endocrine disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 400}, 'Eye disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 170, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 400}, 'Gastrointestinal disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 90, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 900}, 'General disorders and administration site conditions': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 90, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__n_estimators': 400}, 'Hepatobiliary disorders': {'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 250, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 200}, 'Immune system disorders': {'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 250, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 600}, 'Infections and infestations': {'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 130, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 200}, 'Injury, poisoning and procedural complications': {'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 1000}, 'Metabolism and nutrition disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 90, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 400}, 'Musculoskeletal and connective tissue disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 170, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 800}, 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 400}, 'Nervous system disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 250, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 300}, 'Pregnancy, puerperium and perinatal conditions': {'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__n_estimators': 700}, 'Psychiatric disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__n_estimators': 100}, 'Renal and urinary disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 170, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 500}, 'Reproductive system and breast disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 200}, 'Respiratory, thoracic and mediastinal disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 100}, 'Skin and subcutaneous tissue disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 300}, 'Surgical and medical procedures': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 210, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 800}, 'Vascular disorders': {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__max_depth': 170, 'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 400}}
print("Improved RF with balancing:")
impr_bal_rf_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamerf,
spec_params=best_RF_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
verbose=False, random_state=seed)
Improved RF with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [18:50<00:00, 45.88s/it]
print("Scores for optimized RF with balancing:")
impr_bal_rf_report
Scores for optimized RF with balancing:
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.721 | 0.698 | 0.695 | 0.756 | 0.754 | 0.691 | 0.742 |
Metabolism and nutrition disorders | 0.776 | 0.666 | 0.557 | 0.610 | 0.838 | 0.724 | 0.761 |
Eye disorders | 0.745 | 0.661 | 0.618 | 0.677 | 0.811 | 0.690 | 0.755 |
Musculoskeletal and connective tissue disorders | 0.795 | 0.698 | 0.608 | 0.685 | 0.853 | 0.745 | 0.825 |
Gastrointestinal disorders | 0.945 | 0.897 | 0.591 | 0.727 | 0.971 | 0.920 | 0.952 |
Immune system disorders | 0.814 | 0.709 | 0.569 | 0.634 | 0.880 | 0.758 | 0.817 |
Reproductive system and breast disorders | 0.684 | 0.679 | 0.679 | 0.740 | 0.695 | 0.674 | 0.752 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.390 | 0.767 | 0.623 | 0.714 | 0.295 | 0.581 | 0.444 |
General disorders and administration site conditions | 0.941 | 0.890 | 0.554 | 0.643 | 0.965 | 0.918 | 0.943 |
Endocrine disorders | 0.289 | 0.778 | 0.579 | 0.666 | 0.213 | 0.469 | 0.378 |
Surgical and medical procedures | 0.123 | 0.851 | 0.521 | 0.645 | 0.072 | 0.427 | 0.277 |
Vascular disorders | 0.856 | 0.761 | 0.570 | 0.682 | 0.927 | 0.796 | 0.872 |
Blood and lymphatic system disorders | 0.739 | 0.664 | 0.634 | 0.727 | 0.775 | 0.707 | 0.815 |
Skin and subcutaneous tissue disorders | 0.954 | 0.912 | 0.583 | 0.614 | 0.971 | 0.937 | 0.949 |
Congenital, familial and genetic disorders | 0.089 | 0.804 | 0.490 | 0.563 | 0.055 | 0.246 | 0.227 |
Infections and infestations | 0.798 | 0.697 | 0.595 | 0.672 | 0.843 | 0.757 | 0.833 |
Respiratory, thoracic and mediastinal disorders | 0.839 | 0.736 | 0.551 | 0.613 | 0.919 | 0.772 | 0.819 |
Psychiatric disorders | 0.800 | 0.699 | 0.594 | 0.663 | 0.861 | 0.747 | 0.810 |
Renal and urinary disorders | 0.767 | 0.677 | 0.622 | 0.698 | 0.826 | 0.715 | 0.798 |
Pregnancy, puerperium and perinatal conditions | 0.165 | 0.913 | 0.560 | 0.534 | 0.103 | 0.460 | 0.195 |
Ear and labyrinth disorders | 0.528 | 0.596 | 0.587 | 0.635 | 0.505 | 0.553 | 0.595 |
Cardiac disorders | 0.807 | 0.710 | 0.609 | 0.718 | 0.876 | 0.749 | 0.852 |
Nervous system disorders | 0.952 | 0.910 | 0.591 | 0.708 | 0.975 | 0.930 | 0.958 |
Injury, poisoning and procedural complications | 0.766 | 0.662 | 0.575 | 0.627 | 0.834 | 0.710 | 0.771 |
print("XGB")
print("Base XGB without balancing:")
base_xgb_report = cv_multi_report(X_train_dic, y_train, out_names,
xgb.XGBClassifier(objective="binary:logistic", random_state=seed), n_splits=5,
n_jobs=-2, verbose=False)
XGB Base XGB without balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [01:31<00:00, 3.50s/it]
base_xgb_report
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.700 | 0.674 | 0.671 | 0.723 | 0.734 | 0.670 | 0.713 |
Metabolism and nutrition disorders | 0.808 | 0.685 | 0.469 | 0.565 | 0.956 | 0.700 | 0.748 |
Eye disorders | 0.735 | 0.623 | 0.536 | 0.613 | 0.859 | 0.644 | 0.701 |
Musculoskeletal and connective tissue disorders | 0.803 | 0.689 | 0.531 | 0.653 | 0.922 | 0.711 | 0.795 |
Gastrointestinal disorders | 0.951 | 0.908 | 0.545 | 0.754 | 0.992 | 0.914 | 0.961 |
Immune system disorders | 0.823 | 0.706 | 0.482 | 0.602 | 0.940 | 0.732 | 0.806 |
Reproductive system and breast disorders | 0.637 | 0.648 | 0.647 | 0.689 | 0.622 | 0.656 | 0.684 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.214 | 0.750 | 0.533 | 0.678 | 0.134 | 0.544 | 0.438 |
General disorders and administration site conditions | 0.952 | 0.908 | 0.476 | 0.655 | 0.998 | 0.910 | 0.949 |
Endocrine disorders | 0.221 | 0.790 | 0.549 | 0.640 | 0.140 | 0.574 | 0.390 |
Surgical and medical procedures | 0.031 | 0.849 | 0.475 | 0.645 | 0.018 | 0.124 | 0.256 |
Vascular disorders | 0.867 | 0.769 | 0.490 | 0.634 | 0.981 | 0.777 | 0.846 |
Blood and lymphatic system disorders | 0.743 | 0.651 | 0.599 | 0.697 | 0.827 | 0.676 | 0.777 |
Skin and subcutaneous tissue disorders | 0.963 | 0.929 | 0.504 | 0.631 | 1.000 | 0.929 | 0.951 |
Congenital, familial and genetic disorders | 0.028 | 0.821 | 0.465 | 0.584 | 0.015 | 0.233 | 0.238 |
Infections and infestations | 0.820 | 0.706 | 0.506 | 0.664 | 0.943 | 0.726 | 0.835 |
Respiratory, thoracic and mediastinal disorders | 0.852 | 0.748 | 0.490 | 0.607 | 0.974 | 0.758 | 0.815 |
Psychiatric disorders | 0.810 | 0.691 | 0.491 | 0.609 | 0.941 | 0.710 | 0.774 |
Renal and urinary disorders | 0.754 | 0.639 | 0.536 | 0.648 | 0.866 | 0.668 | 0.754 |
Pregnancy, puerperium and perinatal conditions | 0.037 | 0.906 | 0.494 | 0.538 | 0.021 | 0.250 | 0.146 |
Ear and labyrinth disorders | 0.505 | 0.613 | 0.593 | 0.630 | 0.444 | 0.589 | 0.584 |
Cardiac disorders | 0.808 | 0.692 | 0.509 | 0.639 | 0.938 | 0.710 | 0.793 |
Nervous system disorders | 0.957 | 0.917 | 0.488 | 0.731 | 0.997 | 0.919 | 0.964 |
Injury, poisoning and procedural complications | 0.773 | 0.642 | 0.461 | 0.599 | 0.917 | 0.668 | 0.751 |
print("Base XGB with balancing:")
base_bal_xgb_report = cv_multi_report(X_train_dic, y_train, out_names,
xgb.XGBClassifier(objective="binary:logistic", random_state=seed), balancing=True,
n_splits=5, n_jobs=-2, verbose=False, random_state=seed)
Base XGB with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [18:56<00:00, 47.61s/it]
base_bal_xgb_report
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.691 | 0.677 | 0.676 | 0.724 | 0.698 | 0.687 | 0.713 |
Metabolism and nutrition disorders | 0.733 | 0.626 | 0.553 | 0.580 | 0.743 | 0.724 | 0.744 |
Eye disorders | 0.699 | 0.622 | 0.594 | 0.624 | 0.720 | 0.682 | 0.706 |
Musculoskeletal and connective tissue disorders | 0.740 | 0.645 | 0.589 | 0.645 | 0.735 | 0.746 | 0.796 |
Gastrointestinal disorders | 0.934 | 0.881 | 0.650 | 0.757 | 0.932 | 0.936 | 0.959 |
Immune system disorders | 0.775 | 0.671 | 0.580 | 0.618 | 0.781 | 0.769 | 0.817 |
Reproductive system and breast disorders | 0.624 | 0.634 | 0.633 | 0.683 | 0.611 | 0.640 | 0.677 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.382 | 0.699 | 0.591 | 0.655 | 0.364 | 0.405 | 0.413 |
General disorders and administration site conditions | 0.929 | 0.869 | 0.547 | 0.622 | 0.940 | 0.917 | 0.941 |
Endocrine disorders | 0.308 | 0.744 | 0.575 | 0.615 | 0.262 | 0.383 | 0.348 |
Surgical and medical procedures | 0.188 | 0.791 | 0.534 | 0.607 | 0.169 | 0.221 | 0.222 |
Vascular disorders | 0.812 | 0.707 | 0.570 | 0.611 | 0.827 | 0.799 | 0.827 |
Blood and lymphatic system disorders | 0.701 | 0.638 | 0.621 | 0.693 | 0.693 | 0.709 | 0.771 |
Skin and subcutaneous tissue disorders | 0.939 | 0.886 | 0.576 | 0.650 | 0.939 | 0.939 | 0.956 |
Congenital, familial and genetic disorders | 0.243 | 0.760 | 0.550 | 0.579 | 0.219 | 0.275 | 0.233 |
Infections and infestations | 0.779 | 0.685 | 0.614 | 0.679 | 0.781 | 0.777 | 0.836 |
Respiratory, thoracic and mediastinal disorders | 0.813 | 0.705 | 0.560 | 0.601 | 0.854 | 0.776 | 0.807 |
Psychiatric disorders | 0.779 | 0.677 | 0.589 | 0.623 | 0.815 | 0.746 | 0.777 |
Renal and urinary disorders | 0.725 | 0.642 | 0.605 | 0.648 | 0.737 | 0.714 | 0.745 |
Pregnancy, puerperium and perinatal conditions | 0.197 | 0.895 | 0.570 | 0.532 | 0.154 | 0.280 | 0.181 |
Ear and labyrinth disorders | 0.563 | 0.578 | 0.577 | 0.610 | 0.609 | 0.525 | 0.571 |
Cardiac disorders | 0.772 | 0.677 | 0.609 | 0.657 | 0.789 | 0.756 | 0.807 |
Nervous system disorders | 0.931 | 0.875 | 0.611 | 0.705 | 0.925 | 0.938 | 0.960 |
Injury, poisoning and procedural complications | 0.721 | 0.630 | 0.586 | 0.606 | 0.719 | 0.723 | 0.749 |
The hyperparameter optimization in XGB was done in a similar way to RF: cross-validation random search with at least 150 combinations of parameters. In this model, six parameters were considered:
eta - step size of the model;
min_child_weight - minimum number of instances needed to be in each node;
max_depth - maximum depth of a tree;
gamma - minimum loss reduction required to make a further partition;
subsample - subsample ratio of the training instance;
colsample_bytree - subsample ratio of features when constructing each tree.
eta = [0.05, 0.1, 0.2]
min_child_weight = [1, 3]
max_depth = [5, 7, 9]
gamma = [0, 0.1, 0.2, 0.3, 0.4]
subsample = [0.6, 0.7, 0.8, 0.9]
colsample_bytree = [0.6, 0.7, 0.8, 0.9]
xgb_grid = {"xgbclassifier__eta": eta,
"xgbclassifier__min_child_weight": min_child_weight,
"xgbclassifier__max_depth": max_depth,
"xgbclassifier__gamma": gamma,
"xgbclassifier__subsample": subsample,
"xgbclassifier__colsample_bytree": colsample_bytree
}
xgb_grid_label = {name: xgb_grid for name in out_names}
"""The following code was previously executed and its output was saved"""
# best_xgb_params_by_label = multi_label_random_search(X_train_dic, y_train, out_names[20:],
# xgb.XGBClassifier(objective="binary:logistic", random_state=seed),
# xgb_grid_label, balancing=True, n_splits=3, scoring="f1_micro",
# n_jobs=-2, verbose=True, random_state=seed, n_iter=150)
'The following code was previously executed and its output was saved'
print("Improved XGB with balancing:")
impr_bal_xgb_report = cv_multi_report(X_train_dic, y_train, out_names, modelname=modelnamexgb,
spec_params=best_xgb_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
verbose=False, random_state=seed)
Improved XGB with balancing:
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [20:04<00:00, 50.80s/it]
impr_bal_xgb_report
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.688 | 0.678 | 0.678 | 0.729 | 0.686 | 0.692 | 0.707 |
Metabolism and nutrition disorders | 0.758 | 0.648 | 0.554 | 0.587 | 0.796 | 0.724 | 0.754 |
Eye disorders | 0.720 | 0.644 | 0.616 | 0.644 | 0.747 | 0.695 | 0.716 |
Musculoskeletal and connective tissue disorders | 0.764 | 0.662 | 0.581 | 0.652 | 0.798 | 0.734 | 0.803 |
Gastrointestinal disorders | 0.940 | 0.889 | 0.608 | 0.723 | 0.956 | 0.924 | 0.951 |
Immune system disorders | 0.780 | 0.672 | 0.570 | 0.621 | 0.800 | 0.761 | 0.823 |
Reproductive system and breast disorders | 0.665 | 0.667 | 0.667 | 0.729 | 0.662 | 0.669 | 0.733 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.422 | 0.739 | 0.626 | 0.680 | 0.374 | 0.484 | 0.445 |
General disorders and administration site conditions | 0.936 | 0.881 | 0.534 | 0.665 | 0.958 | 0.915 | 0.949 |
Endocrine disorders | 0.317 | 0.759 | 0.585 | 0.637 | 0.258 | 0.420 | 0.372 |
Surgical and medical procedures | 0.171 | 0.818 | 0.534 | 0.605 | 0.133 | 0.248 | 0.233 |
Vascular disorders | 0.826 | 0.723 | 0.570 | 0.626 | 0.859 | 0.797 | 0.836 |
Blood and lymphatic system disorders | 0.733 | 0.666 | 0.643 | 0.720 | 0.749 | 0.718 | 0.801 |
Skin and subcutaneous tissue disorders | 0.946 | 0.899 | 0.581 | 0.619 | 0.955 | 0.938 | 0.944 |
Congenital, familial and genetic disorders | 0.177 | 0.774 | 0.523 | 0.586 | 0.139 | 0.249 | 0.234 |
Infections and infestations | 0.784 | 0.688 | 0.610 | 0.685 | 0.796 | 0.773 | 0.845 |
Respiratory, thoracic and mediastinal disorders | 0.821 | 0.715 | 0.562 | 0.612 | 0.872 | 0.775 | 0.812 |
Psychiatric disorders | 0.791 | 0.691 | 0.597 | 0.627 | 0.838 | 0.749 | 0.775 |
Renal and urinary disorders | 0.751 | 0.666 | 0.622 | 0.671 | 0.785 | 0.720 | 0.774 |
Pregnancy, puerperium and perinatal conditions | 0.172 | 0.899 | 0.559 | 0.536 | 0.124 | 0.300 | 0.195 |
Ear and labyrinth disorders | 0.550 | 0.585 | 0.582 | 0.620 | 0.566 | 0.536 | 0.577 |
Cardiac disorders | 0.780 | 0.685 | 0.610 | 0.680 | 0.808 | 0.755 | 0.822 |
Nervous system disorders | 0.937 | 0.884 | 0.586 | 0.721 | 0.944 | 0.931 | 0.957 |
Injury, poisoning and procedural complications | 0.733 | 0.629 | 0.562 | 0.620 | 0.765 | 0.704 | 0.772 |
One observed pattern was the improvement with XGB when classifying labels where the positive examples were the minority class.
pprint(best_model_by_label)
{'Blood and lymphatic system disorders': 'RF', 'Cardiac disorders': 'RF', 'Congenital, familial and genetic disorders': 'XGB', 'Ear and labyrinth disorders': 'SVC', 'Endocrine disorders': 'XGB', 'Eye disorders': 'RF', 'Gastrointestinal disorders': 'RF', 'General disorders and administration site conditions': 'RF', 'Hepatobiliary disorders': 'RF', 'Immune system disorders': 'SVC', 'Infections and infestations': 'RF', 'Injury, poisoning and procedural complications': 'RF', 'Metabolism and nutrition disorders': 'RF', 'Musculoskeletal and connective tissue disorders': 'RF', 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': 'XGB', 'Nervous system disorders': 'RF', 'Pregnancy, puerperium and perinatal conditions': 'XGB', 'Psychiatric disorders': 'RF', 'Renal and urinary disorders': 'RF', 'Reproductive system and breast disorders': 'RF', 'Respiratory, thoracic and mediastinal disorders': 'RF', 'Skin and subcutaneous tissue disorders': 'RF', 'Surgical and medical procedures': 'XGB', 'Vascular disorders': 'SVC'}
scores_best_model = cv_multi_report(X_train_dic, y_train, out_names, modelname=best_model_by_label,
spec_params=best_model_params_by_label, balancing=True, n_splits=5, n_jobs=-2,
verbose=False, random_state=seed)
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [19:05<00:00, 47.24s/it]
print("CV scores for best model by label")
scores_best_model
CV scores for best model by label
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Precision | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.721 | 0.698 | 0.695 | 0.756 | 0.754 | 0.691 | 0.742 |
Metabolism and nutrition disorders | 0.776 | 0.666 | 0.557 | 0.610 | 0.838 | 0.724 | 0.761 |
Eye disorders | 0.745 | 0.661 | 0.618 | 0.677 | 0.811 | 0.690 | 0.755 |
Musculoskeletal and connective tissue disorders | 0.795 | 0.698 | 0.608 | 0.685 | 0.853 | 0.745 | 0.825 |
Gastrointestinal disorders | 0.945 | 0.897 | 0.591 | 0.727 | 0.971 | 0.920 | 0.952 |
Immune system disorders | 0.839 | 0.734 | 0.531 | 0.612 | 0.958 | 0.747 | 0.793 |
Reproductive system and breast disorders | 0.684 | 0.679 | 0.679 | 0.740 | 0.695 | 0.674 | 0.752 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.422 | 0.739 | 0.626 | 0.680 | 0.374 | 0.484 | 0.445 |
General disorders and administration site conditions | 0.941 | 0.890 | 0.554 | 0.643 | 0.965 | 0.918 | 0.943 |
Endocrine disorders | 0.317 | 0.759 | 0.585 | 0.637 | 0.258 | 0.420 | 0.372 |
Surgical and medical procedures | 0.171 | 0.818 | 0.534 | 0.605 | 0.133 | 0.248 | 0.233 |
Vascular disorders | 0.868 | 0.776 | 0.557 | 0.665 | 0.961 | 0.792 | 0.852 |
Blood and lymphatic system disorders | 0.739 | 0.664 | 0.634 | 0.727 | 0.775 | 0.707 | 0.815 |
Skin and subcutaneous tissue disorders | 0.954 | 0.912 | 0.583 | 0.614 | 0.971 | 0.937 | 0.949 |
Congenital, familial and genetic disorders | 0.177 | 0.774 | 0.523 | 0.586 | 0.139 | 0.249 | 0.234 |
Infections and infestations | 0.798 | 0.697 | 0.595 | 0.672 | 0.843 | 0.757 | 0.833 |
Respiratory, thoracic and mediastinal disorders | 0.839 | 0.736 | 0.551 | 0.613 | 0.919 | 0.772 | 0.819 |
Psychiatric disorders | 0.800 | 0.699 | 0.594 | 0.663 | 0.861 | 0.747 | 0.810 |
Renal and urinary disorders | 0.767 | 0.677 | 0.622 | 0.698 | 0.826 | 0.715 | 0.798 |
Pregnancy, puerperium and perinatal conditions | 0.172 | 0.899 | 0.559 | 0.536 | 0.124 | 0.300 | 0.195 |
Ear and labyrinth disorders | 0.558 | 0.586 | 0.585 | 0.614 | 0.583 | 0.535 | 0.547 |
Cardiac disorders | 0.807 | 0.710 | 0.609 | 0.718 | 0.876 | 0.749 | 0.852 |
Nervous system disorders | 0.952 | 0.910 | 0.591 | 0.708 | 0.975 | 0.930 | 0.958 |
Injury, poisoning and procedural complications | 0.766 | 0.662 | 0.575 | 0.627 | 0.834 | 0.710 | 0.771 |
After this selection, the respective models were tested using the test dataset and the results of some of the metrics are shown in following table.
print("Test scores for best model by label")
test_scores_best_model = test_score_multi_report(X_train_dic, y_train, X_test_dic, y_test, out_names,
modelname=best_model_by_label, spec_params=best_model_params_by_label,
random_state=seed, verbose=False, balancing=True, n_jobs=-2, plot=False)
Test scores for best model by label
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [23:27<00:00, 58.07s/it]
<Figure size 432x288 with 0 Axes>
test_scores_best_model
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Prec-Rec | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.720 | 0.678 | 0.671 | 0.671 | 0.771 | 0.674 | 0.769 |
Metabolism and nutrition disorders | 0.825 | 0.727 | 0.603 | 0.597 | 0.893 | 0.767 | 0.812 |
Eye disorders | 0.770 | 0.685 | 0.635 | 0.634 | 0.858 | 0.699 | 0.812 |
Musculoskeletal and connective tissue disorders | 0.844 | 0.755 | 0.635 | 0.624 | 0.892 | 0.802 | 0.859 |
Gastrointestinal disorders | 0.957 | 0.920 | 0.608 | 0.579 | 0.985 | 0.932 | 0.977 |
Immune system disorders | 0.821 | 0.713 | 0.551 | 0.568 | 0.959 | 0.718 | 0.792 |
Reproductive system and breast disorders | 0.721 | 0.689 | 0.685 | 0.684 | 0.737 | 0.706 | 0.806 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.389 | 0.692 | 0.592 | 0.588 | 0.329 | 0.475 | 0.481 |
General disorders and administration site conditions | 0.925 | 0.860 | 0.486 | 0.498 | 0.965 | 0.888 | 0.929 |
Endocrine disorders | 0.415 | 0.734 | 0.622 | 0.613 | 0.342 | 0.529 | 0.502 |
Surgical and medical procedures | 0.092 | 0.794 | 0.488 | 0.501 | 0.064 | 0.167 | 0.222 |
Vascular disorders | 0.881 | 0.794 | 0.547 | 0.547 | 0.948 | 0.823 | 0.865 |
Blood and lymphatic system disorders | 0.768 | 0.696 | 0.663 | 0.663 | 0.770 | 0.766 | 0.846 |
Skin and subcutaneous tissue disorders | 0.957 | 0.920 | 0.698 | 0.662 | 0.977 | 0.937 | 0.943 |
Congenital, familial and genetic disorders | 0.296 | 0.801 | 0.590 | 0.579 | 0.231 | 0.414 | 0.342 |
Infections and infestations | 0.799 | 0.699 | 0.601 | 0.597 | 0.872 | 0.737 | 0.814 |
Respiratory, thoracic and mediastinal disorders | 0.836 | 0.738 | 0.592 | 0.589 | 0.927 | 0.761 | 0.837 |
Psychiatric disorders | 0.851 | 0.762 | 0.632 | 0.621 | 0.890 | 0.815 | 0.917 |
Renal and urinary disorders | 0.769 | 0.689 | 0.646 | 0.642 | 0.822 | 0.722 | 0.789 |
Pregnancy, puerperium and perinatal conditions | 0.103 | 0.878 | 0.518 | 0.518 | 0.071 | 0.182 | 0.116 |
Ear and labyrinth disorders | 0.571 | 0.549 | 0.548 | 0.548 | 0.581 | 0.562 | 0.586 |
Cardiac disorders | 0.813 | 0.713 | 0.600 | 0.598 | 0.904 | 0.739 | 0.856 |
Nervous system disorders | 0.942 | 0.892 | 0.552 | 0.542 | 0.984 | 0.903 | 0.955 |
Injury, poisoning and procedural complications | 0.754 | 0.647 | 0.563 | 0.566 | 0.829 | 0.692 | 0.763 |
test_scores_best_model_sorted = test_scores_best_model.sort_values(by=["F1 Binary"], ascending=True)
ax = test_scores_best_model_sorted.plot(kind="barh",
y=["F1 Binary", "F1 Macro", "F1 Micro"],
title="Test scores by label (SIDER)",
xticks=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
legend="reverse", xlim=(0, 1), figsize = (10,16))
for p in ax.patches: ax.annotate("{:.3f}".format(round(p.get_width(), 3)), (p.get_x() + p.get_width(), p.get_y()),
xytext=(30, 0), textcoords='offset points', horizontalalignment='right')
As we can see, the results vary widely from label to label. In order to better understand the reason why, we can further analyse the three different categories of results: high F1, low F1 Macro; similar F1 and F1 Macro; and low F1 and F1 Macro. It is possible to observe these three results, for example, in "General disorders and administration site conditions", "Hepatobiliary disorders", and "Congenital, familial and genetic disorders", respectively. One of the possible explanations for this is the distribution of positive and negative examples in the dataset, observable in the following table:
df_perc = countsm / 1427
df_filt = df_perc.loc[["General disorders and administration site conditions","Hepatobiliary disorders",
"Congenital, familial and genetic disorders"]]*100
df_filt.apply(lambda x: round(x, 1)).rename(columns = lambda x: "% "+x)
% Positives | % Negatives | |
---|---|---|
General disorders and administration site conditions | 90.5 | 9.5 |
Hepatobiliary disorders | 52.1 | 47.9 |
Congenital, familial and genetic disorders | 17.7 | 82.3 |
In the second type of result, for "Hepatobiliary disorders", similar values can be observed across the metrics. This is the result of the more balanced distribution of the label in the dataset, and, as such, we have a good F1 score for both positive and negative labels, shown by the F1 Macro score. However, for the other two types of results, "General disorders..." and "Congenital...", it is possible to see a much bigger difference across the metrics. With the first one, as almost 91% of the examples in the dataset are positive, and while oversampling helped, it is still possible to observe that the model tends to classify most of the validation examples as positive. It has a very high recall, so every positive example was correctly classified as positive and the high precision could lead us to believe that this is mostly correct but, as it is possible to see by the low F1 Macro score, the F1 for the negative label is very low. This means that the model is classifying most of the examples as positive, but because of the imbalance in the dataset in favour of the positive examples, the precision is still high. The opposite is true when looking at "Congenital...", in this case, the imbalance is in favour of the negative label so the model tends to classify samples as negative.
There are two conclusions that can already be made about the performance of the models: RF had the overall best performance across the board with a majority of labels having this model as the best, and, in labels with very few positive examples, the best model was always the XGB.
When looking at the performance for each label, it is clear that the best performance came from the more balanced labels. While these labels didn't have the best F1 Binary Score, and, as such, the recall is lower than others, they had the best performance when separating positive and negative classifications, showed by having a similar F1 Binary and Macro score.
One of the first apparent problems when analysing this dataset was the even greater imbalance in the dataset, as seen next:
mod_off = pd.read_csv("./datasets/offside_socs_modified.csv")
df = pd.read_csv("./datasets/sider.csv")
todrop = ["Product issues", "Investigations", "Social circumstances"]
df.drop(todrop, axis=1, inplace=True)
# 1332 Rows in Total
df_y_2 = mod_off.drop("smiles", axis=1)
d2 = {"Positives": df_y_2.sum(axis=0), "Negatives": 1332 - df_y_2.sum(axis=0)}
counts = pd.DataFrame(data=d2)
counts.plot(kind='bar', figsize=(16, 10), title="OFFSIDES Adverse Drug Reactions Counts", ylim=(0, 1400), stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1b408bf44e0>
This dataset was then combined with SIDER, considering a label present in both datasets as positive when it was positive in either OFFSIDES or SIDER, and the new global counts are present next:
df_all = pd.read_csv("./dataframes/df_all.csv") # (2043, 25)
# New counts (SIDER + OFFSIDES)
df_all_y = df_all.drop("smiles", axis=1)
da2 = {"Positives": df_all_y.sum(axis=0), "Negatives": 2043 - df_all_y.sum(axis=0)}
counts = pd.DataFrame(data=da2)
counts.plot(kind='bar', figsize=(16, 10), title="Adverse Drug Reactions Counts (SIDER + OFFSIDES)", ylim=(0, 2100),
stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1b46cca3c88>
As it is possible to see, most of the labels became even more imbalanced with the exception of the labels that had a minority in the positive examples.
# Repeat process of dataframe transformation with dataframe SIDER + OFFSIDES
df_off_y, df_off_mols = create_original_df(usedf=True, file=df_all, write_s=False, write_off=False)
df_off_mols.drop("smiles", axis=1, inplace=True)
df_off_mols_train, df_off_mols_test, y_off_train, y_off_test = train_test_split(df_off_mols, df_off_y, test_size=0.2,
random_state=seed)
# Create X datasets with fingerprint length
X_off_all, _, _, _ = createfingerprints(df_off_mols, length=1125)
X_off_train_fp, _, _, _ = createfingerprints(df_off_mols_train, length=1125)
X_off_test_fp, _, _, _ = createfingerprints(df_off_mols_test, length=1125)
# Selects and create descriptors dataset
df_off_desc = createdescriptors(df_off_mols) # Create all descriptors
# Splits in train and test
df_off_desc_base_train, df_off_desc_base_test = train_test_split(df_off_desc, test_size=0.2, random_state=seed)
# Creates a dictionary with key = class label and value = dataframe with fingerprint + best K descriptors for that label
X_off_train_dic, X_off_test_dic, selected_off_cols = create_dataframes_dic(df_off_desc_base_train,
df_off_desc_base_test, X_off_train_fp,
X_off_test_fp, y_off_train, out_names,
score_func=f_classif, k=3)
100%|██████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 94.78it/s]
test_scores_sioff = test_score_multi_report(X_off_train_dic, y_off_train, X_off_test_dic, y_off_test, out_names,
modelname=best_model_by_label, spec_params=best_model_params_by_label,
random_state=seed, verbose=False, balancing=True, n_jobs=-2, plot=False)
100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [38:21<00:00, 100.86s/it]
print("Test scores for SIDER + OFFSIDE")
test_scores_sioff
Test scores for SIDER + OFFSIDE
F1 Binary | F1 Micro | F1 Macro | ROC_AUC | Recall | Precision | Average Prec-Rec | |
---|---|---|---|---|---|---|---|
Hepatobiliary disorders | 0.852 | 0.758 | 0.594 | 0.587 | 0.877 | 0.828 | 0.871 |
Metabolism and nutrition disorders | 0.910 | 0.839 | 0.571 | 0.562 | 0.935 | 0.886 | 0.928 |
Eye disorders | 0.875 | 0.785 | 0.552 | 0.547 | 0.914 | 0.839 | 0.855 |
Musculoskeletal and connective tissue disorders | 0.907 | 0.834 | 0.549 | 0.543 | 0.946 | 0.872 | 0.909 |
Gastrointestinal disorders | 0.966 | 0.934 | 0.547 | 0.544 | 0.969 | 0.962 | 0.980 |
Immune system disorders | 0.828 | 0.714 | 0.493 | 0.514 | 0.921 | 0.751 | 0.792 |
Reproductive system and breast disorders | 0.774 | 0.672 | 0.588 | 0.585 | 0.813 | 0.740 | 0.795 |
Neoplasms benign, malignant and unspecified (incl cysts and polyps) | 0.789 | 0.667 | 0.504 | 0.505 | 0.812 | 0.767 | 0.782 |
General disorders and administration site conditions | 0.968 | 0.939 | 0.521 | 0.517 | 0.982 | 0.955 | 0.966 |
Endocrine disorders | 0.607 | 0.538 | 0.523 | 0.525 | 0.646 | 0.573 | 0.604 |
Surgical and medical procedures | 0.679 | 0.557 | 0.484 | 0.485 | 0.690 | 0.668 | 0.697 |
Vascular disorders | 0.934 | 0.878 | 0.550 | 0.542 | 0.962 | 0.908 | 0.915 |
Blood and lymphatic system disorders | 0.862 | 0.765 | 0.538 | 0.536 | 0.877 | 0.847 | 0.911 |
Skin and subcutaneous tissue disorders | 0.961 | 0.924 | 0.480 | 0.487 | 0.974 | 0.947 | 0.966 |
Congenital, familial and genetic disorders | 0.656 | 0.575 | 0.549 | 0.550 | 0.695 | 0.622 | 0.643 |
Infections and infestations | 0.932 | 0.873 | 0.518 | 0.517 | 0.957 | 0.908 | 0.945 |
Respiratory, thoracic and mediastinal disorders | 0.915 | 0.844 | 0.473 | 0.486 | 0.950 | 0.882 | 0.926 |
Psychiatric disorders | 0.902 | 0.826 | 0.585 | 0.574 | 0.929 | 0.876 | 0.892 |
Renal and urinary disorders | 0.871 | 0.778 | 0.540 | 0.537 | 0.897 | 0.845 | 0.893 |
Pregnancy, puerperium and perinatal conditions | 0.461 | 0.548 | 0.536 | 0.536 | 0.467 | 0.454 | 0.439 |
Ear and labyrinth disorders | 0.640 | 0.575 | 0.560 | 0.560 | 0.640 | 0.640 | 0.660 |
Cardiac disorders | 0.897 | 0.817 | 0.518 | 0.521 | 0.945 | 0.854 | 0.905 |
Nervous system disorders | 0.967 | 0.936 | 0.484 | 0.487 | 0.975 | 0.960 | 0.978 |
Injury, poisoning and procedural complications | 0.896 | 0.814 | 0.516 | 0.515 | 0.911 | 0.881 | 0.924 |
diff_offsides = test_scores_sioff - test_scores_best_model
ax2 = diff_offsides.plot(kind="barh", y=["F1 Binary", "F1 Macro", "F1 Micro"],
title="Changes in test scores by label with OFFSIDES",legend="reverse", figsize = (10,16))
As we can see from the previous results, the biggest and only improvements when combining SIDER and OFFSIDES were in classes that had very low recall since the addition OFFSIDES dataset greatly increased the number of positives examples, thus balancing the dataset. But, even with these improvements, the overall performance of the models was worse after the merge of the datasets.
One of the possibilities to try to avoid this could be to break down the used SOCs in their components. SOCs are the highest level in the MedDRA hierarchy and they encompass multiple more specific reactions so, in theory breaking down these SOCs in their lower level would reduce the number of positive examples in each of them.
The other possibility could be in the model development by trying other types of models such as deep learning or methods of over or under-sampling, for example using the "class weights" in some of the scikit-learn models.