import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, \
learning_curve, validation_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.decomposition import PCA
from scipy.stats import mannwhitneyu
sns.set(font_scale=1.5)
pd.options.display.max_columns = 50
df = pd.read_csv('data.csv')
df.head()
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
The data used is available through https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
And can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
The task here is to predict whether the cancer is benign or malignant based in 30 real-valued features.
Attribute Information:
3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
target = pd.DataFrame(df['diagnosis'])
data = df.drop(['diagnosis'], axis=1)
General data overview:
data.info()
Drop constant column ** Unnamed: 32 ** and id column which is useless for analize:
data.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
Check data for missing values:
print("Are there missing values:", data.isnull().values.any())
General data statistics overview:
data.describe()
Conclusion: here we can see vary different min/max values for features, for example area_mean and smoothness_mean. Thus we should check for outliers (box plot is good option for that).
Check if difference of features mean values is statistically important. We will use Mann Whitney criteria, because of it unsesetive to outliers and different samples distribution.
for column in data.columns:
m = data[column][target['diagnosis']=='M']
b = data[column][target['diagnosis']=='B']
statistic, pvalue = mannwhitneyu(m, b)
print('Column:', column, 'Important:', pvalue < 0.05 )
Conclusion: differences in almost all features are statistically important. So they will contribute more enough information to classification.
Number of eamples for each class:
target['diagnosis'].value_counts()
Let's check the ratio of examples belong to each class:
target['diagnosis'].value_counts() / target['diagnosis'].size
Conclusion: there are a lot more examples for benign class, but not enough for skewed classes problem.
For the sake of informative data visualization we need to standardize and scale features, because of some features have very different max/min values.
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
data_scaled = pd.DataFrame(scaled_data, columns=data.columns)
data_scaled['diagnosis'] = target['diagnosis']
Helper function for plotting feature correlations:
def plot_corr(data):
plt.figure(figsize=[40, 40])
ax = sns.heatmap(data.corr(), annot=True, fmt= '.1f', linewidths=.5)
ax.set_xticklabels(ax.get_xticklabels(), size='xx-large')
ax.set_yticklabels(ax.get_yticklabels(), size='xx-large')
plt.show();
Data correlations:
plot_corr(data)
Conclusion: there are several groups of correlated features:
data_z = pd.melt(data_scaled, id_vars="diagnosis", var_name="features", value_name='value')
plt.figure(figsize=(20, 10));
ax = sns.boxplot(x='features', y='value', hue='diagnosis', data=data_z);
ax.set_xticklabels(ax.get_xticklabels());
plt.xticks(rotation=90);
Conclusion: there are a lot of variable with outliers. So before training we have to handle it.
plt.figure(figsize=(30, 20));
ax = sns.violinplot(x="features", y="value", hue="diagnosis", data=data_z, split=True, inner="quartile");
ax.set_xticklabels(ax.get_xticklabels(), size='large');
plt.xticks(rotation=90);
Conclusion: in some features, like radius_mean, texture_mean, median of each class separated, so they can be useful for classification. Other features, like smoothness_se, are not so separated and my be less useful for classification. Most all the features have normal-like distribution with long tail.
Apply pca for dimensionality reduction:
pca = PCA(random_state=24)
pca.fit(scaled_data)
plt.figure(figsize=(10, 10))
plt.plot(pca.explained_variance_ratio_, linewidth=2)
plt.xlabel('Number of components');
plt.ylabel('Explained variance ratio');
Conclusion: according to elbow method 3 components may be choosen.
Check the number of components for explaining data variance:
components = range(1, pca.n_components_ + 1)
plt.figure(figsize=(15, 5));
plt.bar(components, np.cumsum(pca.explained_variance_ratio_));
plt.hlines(y = .95, xmin=0, xmax=len(components), colors='green');
Conclusion: The two first components explains the 0.6324 of the variance. We need 10 principal components to explain more than 0.95 of the variance and 17 to explain more than 0.99.
Reduce dimensions of data and plot it:
pca_two_comp = PCA(n_components=2, random_state=24)
two_comp_data = pca_two_comp.fit_transform(scaled_data)
plt.scatter(x=two_comp_data[:, 0], y=two_comp_data[:, 1],
c=target['diagnosis'].map({'M': 'red', 'B': 'green'}))
plt.show()
Conclusion: data is good enough separable using only two components.
Data summary:
Predict whether the cancer is benign or malignant is a binary classification task. Here we don't face the probem of skewed classes. So accuracy metric will be a good choice for model evaluation. Also this metric is simple enough, thus highly interpretable.
Aslo for the test set we will calculate precision and recall.
As model was selected Logistic regression because:
Drop constant column Unnamed: 32 and useless folumn id for classification.
X = df.drop(['id', 'Unnamed: 32', 'diagnosis'], axis=1)
y = df['diagnosis'].map(lambda x: 1 if x=='M' else 0)
Split data into train/test with proportional 0.7/0.3 which is common split for such amount of data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)
print('Train size:', X_train.size)
print('Test size:', X_test.size)
First of all we should handle multi-collinearity. From each group of correleted features we will select only by one feature. So here columns to drop:
corr_columns = ['perimeter_mean','radius_mean','compactness_mean',
'concave points_mean','radius_se','perimeter_se',
'radius_worst','perimeter_worst','compactness_worst',
'concave points_worst','compactness_se','concave points_se',
'texture_worst','area_worst',
'concavity_mean']
Drop correlated columns from train data:
X_train = X_train.drop(corr_columns, axis=1)
Drop correlated columns from test data:
X_test = X_test.drop(corr_columns, axis=1)
Check number of features left:
print('Current number of features:', X_train.shape[1])
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Use 3 splits because of we don't have large amount of training data and shuffle samples in random order.
cv = StratifiedKFold(n_splits=3, random_state=24)
Model:
model = LogisticRegression(random_state=24)
Model parameters:
model_parameters = {'penalty': ['l1', 'l2'],
'C': np.linspace(.1, 1, 10)}
To find best hyperparameters we will use grid search as in is quite simple and efficient enough.
grig_search = GridSearchCV(model, model_parameters, n_jobs=-1, cv=cv, scoring='accuracy')
%%time
grig_search.fit(X_train_scaled, y_train);
Best model parameters:
grig_search.best_params_
Best cv score:
print('Accuracy:', grig_search.best_score_)
Helper function for applying map operation to data frame attributes:
def apply_cat_op(data, attrs, operation, prefix):
"""
Apply one operation to data attributes.
"""
series = [data[attr].map(operation) for attr in attrs]
_data = pd.concat(series, axis=1).add_prefix(prefix)
new_attrs = _data.columns.values
return _data, new_attrs
Creating new features based on medicine requires strong domain knowledge. So we will create them based on mathematics nature of current features. Basic approach for numerical features for regression model is to calculate squares of features in order to capture non-linear dependencies.
Square function:
sq_operation = lambda x: x**2
Create squared feature for each columns and test in with model:
for column in X_train.columns:
X_train_sq, sq_attr = apply_cat_op(X_train, [column], sq_operation, 'sq_')
data = pd.concat([X_train, X_train_sq], axis=1)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
grig_search = GridSearchCV(model, model_parameters, n_jobs=-1, cv=cv, scoring='accuracy')
grig_search.fit(data_scaled, y_train);
print('Column:', column, ' ',
'Accuracy:', grig_search.best_score_, ' ',
'Best params:', grig_search.best_params_)
As we ca see squaring feature fractal_dimension_mean, gives score improving with params {'C': 0.2, 'penalty': 'l2'}
Add new feature to train data:
X_train_sq, atr = apply_cat_op(X_train, ['fractal_dimension_mean'], sq_operation, 'sq_')
X_train = pd.concat([X_train, X_train_sq], axis=1)
Add new feature to test data:
X_test_sq, atr = apply_cat_op(X_test, ['fractal_dimension_mean'], sq_operation, 'sq_')
X_test = pd.concat([X_test, X_test_sq], axis=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
final_model = LogisticRegression(penalty='l2', C=0.2)
final_model.fit(X_train_scaled, y_train)
Plotting learning curve fuction:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
plot_learning_curve(final_model, 'Logistic regression',
X_train_scaled, y_train, cv=cv);
Conclusion: such gap between training and validating curve indicates overfitting. But we can see that validation curve increasing with increasing amount of training examples, so more data is likely to help beat overfitting.
Plotting validation curve function:
def plot_validation_curve(estimator, title, X, y, param_name, param_range,
cv=None, scoring=None, ylim=None, n_jobs=None):
"""
Generates a simple plot of training and validation scores for different parameter values.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
param_name : string
Name of the parameter that will be varied.
param_range : array-like, shape (n_values,)
The values of the parameter that will be evaluated.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or
a scorer callable object / function with signature
``scorer(estimator, X, y)``.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
"""
train_scores, test_scores = validation_curve(
estimator, X, y, param_name, param_range,
cv=cv, scoring=scoring, n_jobs=n_jobs)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure()
plt.grid()
plt.title(title)
plt.xlabel(param_name)
plt.ylabel("Score")
if ylim is not None:
plt.ylim(*ylim)
plt.semilogx(param_range, train_scores_mean, 'o-', label="Training score",
color="darkorange")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange")
plt.semilogx(param_range, test_scores_mean, 'o-', label="Cross-validation score",
color="navy")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy")
plt.legend(loc="best")
return plt
Plot validation curve for model complexity parameter:
plot_validation_curve(final_model, 'Logistic regression', X_train_scaled, y_train,
'C', model_parameters['C'],
cv=cv, scoring='accuracy');
Conclusion: gap between training and validating curve indicates overfitting. The best C parameter is 0.2
Make predictions for test samples:
test_predictions = final_model.predict(X_test_scaled)
print('Accuracy test score:', accuracy_score(y_test, test_predictions))
Conclusion: result on the test samples are comparable to the results on cross-validation, even better. Thus our validation scheme is valid.
test_confusion_matrix = confusion_matrix(test_predictions, y_test);
sns.heatmap(test_confusion_matrix, annot=True, fmt='d');
From confusion matrix we can see that we have made a few wrong predicions.
print('Precision:', precision_score(y_test, test_predictions))
print('Recall:', recall_score(y_test, test_predictions))
Although we try simple model, it gives 98% accuracy, 98% precision and 97% recall on the test set. There are several (3-5) most important features for classification, which could indicates that our data is not representable or biased. So, it's a good option to try model on more data. Feature generation based on medicine knowledge for such data is quite challenging, so we build them based on math nature.