One key question planetary scientists strive to understand is the longterm stability of exoplanetary systems. That is, they would like to know whether, over billions of orbits, planets will collide or be ejected from the system. Due to the chaotic nature of planetary systems, the "answer" of whether a particular planetary system is longterm stable can only be explored statistically. For example, Laskar & Gastineau (2009) researched the longterm stability of the Solar System by performing 2,501 N-body simulations, each 5 billion years in length, and found that 1% of solutions lead to a large unstable increase in Mercury’s eccentricity. However, this study would have taken roughly 200 years to complete on a standard workstation (they had access to a very large computing cluster), motivating the exploration of other methods to speed up the process.
One such method is to use machine learning to predict the longterm behaviour of a planetary system based off its initial conditions. Once the model is trained, it can take as little as a second to generate new predictions, arriving at an answer quickly. Such a method is described and presented in Tamayo, Silburt, et al. (2016). In this notebook, we will explore a simplified version of this work, using a dataset of 25,000 simulated 3-planet systems to train and test a variety of machine learning models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn.metrics import precision_recall_curve, roc_curve
pd.options.mode.chained_assignment = None
# load the dataset
df = pd.read_csv('Stability_Data_inial_conditions.csv', index_col=0)
df.tail()
runstring | a1 | e1 | omega1 | inc1 | m1 | Omega1 | true_anom1 | mean_anom1 | a2 | ... | a3 | e3 | omega3 | inc3 | m3 | Omega3 | true_anom3 | mean_anom3 | Stable | instability_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24995 | 0024995.bin | 1.0 | 0.001983 | -0.103719 | 0.034756 | 6.384047e-07 | 1.891501 | 2.323505 | 2.320607 | 1.171907 | ... | 1.387978 | 0.136176 | 2.998349 | 0.013797 | 0.000001 | -1.687570 | -0.102447 | -0.077197 | 0.0 | 1.843085e+04 |
24996 | 0024996.bin | 1.0 | 0.000435 | -0.528805 | 0.009183 | 4.383173e-06 | -1.320371 | 2.943623 | 2.943452 | 1.118284 | ... | 1.220488 | 0.000275 | 2.799539 | 0.006669 | 0.000033 | -0.174781 | -4.296110 | 1.986572 | 0.0 | 1.868255e+03 |
24997 | 0024997.bin | 1.0 | 0.000159 | 0.995542 | 0.001832 | 3.176214e-05 | 1.857300 | 0.520523 | 0.520365 | 1.579004 | ... | 1.697671 | 0.014659 | 0.939074 | 0.090331 | 0.000011 | -2.313583 | 1.605318 | 1.576007 | 0.0 | 6.939622e+04 |
24998 | 0024998.bin | 1.0 | 0.042915 | 2.803428 | 0.034103 | 4.817579e-07 | 0.320698 | -2.821058 | -2.793167 | 1.055424 | ... | 1.286817 | 0.002571 | -1.435588 | 0.002550 | 0.000011 | -0.858794 | 1.509610 | 1.504478 | 0.0 | 3.230627e+04 |
24999 | 0024999.bin | 1.0 | 0.000022 | -0.388908 | 0.019481 | 4.806837e-05 | 0.756854 | 2.089563 | 2.089526 | 1.316695 | ... | 1.849302 | 0.001253 | -2.367310 | 0.005140 | 0.000041 | -0.267279 | 0.119263 | 0.118965 | 1.0 | 1.000000e+09 |
5 rows × 27 columns
Above, we have loaded the raw initial conditions for each system, using pandas
. For each of the three planets (identified 1
, 2
, 3
):
a
= $a$ = semi-major axis.e
= $e$ = eccentricity.omega
= $\omega$ = argument of periapsis.Omega
= $\Omega$ = Longitude of ascending node.inc
= $i$ = inclination.m
= $m$ = planet mass.true_anom
= $\nu$ = true anomaly.mean_anom
= $M$ = mean anomaly.In addition:
runstring
= identifier for each simulation.Stable
= whether the system is stable over a billion orbits of the inner planet. This is the quantity we want to train our machine learning algorithm to predict!instability_time
= The physical number of years the system was stable for. If instability_time
=$10^9$, then Stable = 1
. This is an alternative quantity we could train a regression model on, but we won't do that here (this is a much harder problem).For reference, the orbital elements are shown here:
Let's plot the data to get a sense of what it looks like. Try and plot other variables to see if you can better separate the stable (blue) systems from the unstable (red) systems!
# x and y plot variables
x = df['a2'].values/df['a1'].values
y = df['a3'].values/df['a1'].values
# plotting
fig, ax = plt.subplots(figsize=(12,8))
stable = df['Stable'] == 1
unstable = df['Stable'] == 0
ax.scatter(x[stable], y[stable], color='blue', s=5, label='Stable')
ax.scatter(x[unstable], y[unstable], color='red', s=5, label='Unstable')
ax.set_xlim([1,2.25])
ax.set_ylim([1,3.25])
ax.set_xlabel('a2/a1', fontsize=20)
ax.set_ylabel('a3/a2', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x103686d90>
A machine learning model requires an input (features), X
, and ground truth output (i.e. the answer), y
, to train.
cols = df.columns.drop('runstring').drop('Stable').drop('instability_time')
# input
X = df[cols]
# output (target)
y = df['Stable']
X.head()
a1 | e1 | omega1 | inc1 | m1 | Omega1 | true_anom1 | mean_anom1 | a2 | e2 | ... | true_anom2 | mean_anom2 | a3 | e3 | omega3 | inc3 | m3 | Omega3 | true_anom3 | mean_anom3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.005031 | 1.400202 | 0.008978 | 3.414016e-05 | 1.770839 | -1.534327 | -1.524274 | 1.221502 | 0.059391 | ... | -0.567081 | -0.505606 | 1.589043 | 0.002515 | 1.084278 | 0.065498 | 1.826900e-06 | -1.192440 | -0.614709 | -0.611814 |
1 | 1.0 | 0.010350 | -2.070391 | 0.001541 | 2.529838e-07 | -1.493638 | 0.013233 | 0.012961 | 1.190647 | 0.013048 | ... | 1.437355 | 1.411526 | 1.671403 | 0.072450 | 0.275475 | 0.046939 | 1.955797e-05 | -0.619350 | 0.192203 | 0.165935 |
2 | 1.0 | 0.051912 | -1.710828 | 0.016289 | 7.380789e-05 | 2.705977 | 2.472633 | 2.406234 | 1.125660 | 0.042326 | ... | -0.317958 | -0.292271 | 1.936762 | 0.003395 | 0.624064 | 0.014535 | 1.477931e-07 | 2.795101 | 1.685406 | 1.678659 |
3 | 1.0 | 0.000152 | -2.300890 | 0.003301 | 5.174804e-07 | -1.027543 | 2.993015 | 2.992970 | 1.216684 | 0.000026 | ... | -2.294238 | -2.294199 | 1.503164 | 0.057380 | 2.166845 | 0.098014 | 1.287851e-06 | -2.995941 | -1.625763 | -1.510968 |
4 | 1.0 | 0.001457 | 0.653565 | 0.033910 | 5.106746e-07 | 1.084883 | 0.667169 | 0.665367 | 1.034058 | 0.020334 | ... | -1.074927 | -1.039417 | 1.059346 | 0.010325 | 0.038984 | 0.011840 | 1.542321e-06 | -1.253823 | 1.215392 | 1.196085 |
5 rows × 24 columns
Good features are essential to an accurate model, which often requires domain expertise. For example, if you were trying to train a model to distinguish cars from pickup-trucks, a "number of wheels" feature would not be very useful since they both have 4 wheels. A better feature might be "size", since pickup-trucks are generally bigger than cars.
For our problem, let's create a "Mutual Hill Radius" feature, which describes the dynamical spacing between planets, and is calculated according to: $$ R_{Hi,j} = \left(\frac{a_1 + a_2}{2}\right)\left(\frac{m_1 + m_2}{3M_*}\right)^{1/3} $$ Mutual Hill Radius can be thought of as a "normalized" spacing parameter - a system with two Earth-sized planets will be equally as interactive as a system with two Jupiter-sized planets if they are spaced by the same number of mutual hill radii.
X['Rhill_12'] = 0.5*(X['a1'] + X['a2'])*((X['m1'] + X['m2'])/3.)**(1./3.) # mutual hill spacing between planets 1, 2
X['Rhill_23'] = 0.5*(X['a3'] + X['a2'])*((X['m3'] + X['m2'])/3.)**(1./3.) # mutual hill spacing between planets 2, 3
X.head()
a1 | e1 | omega1 | inc1 | m1 | Omega1 | true_anom1 | mean_anom1 | a2 | e2 | ... | a3 | e3 | omega3 | inc3 | m3 | Omega3 | true_anom3 | mean_anom3 | Rhill_12 | Rhill_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.005031 | 1.400202 | 0.008978 | 3.414016e-05 | 1.770839 | -1.534327 | -1.524274 | 1.221502 | 0.059391 | ... | 1.589043 | 0.002515 | 1.084278 | 0.065498 | 1.826900e-06 | -1.192440 | -0.614709 | -0.611814 | 0.028916 | 0.026716 |
1 | 1.0 | 0.010350 | -2.070391 | 0.001541 | 2.529838e-07 | -1.493638 | 0.013233 | 0.012961 | 1.190647 | 0.013048 | ... | 1.671403 | 0.072450 | 0.275475 | 0.046939 | 1.955797e-05 | -0.619350 | 0.192203 | 0.165935 | 0.024868 | 0.037600 |
2 | 1.0 | 0.051912 | -1.710828 | 0.016289 | 7.380789e-05 | 2.705977 | 2.472633 | 2.406234 | 1.125660 | 0.042326 | ... | 1.936762 | 0.003395 | 0.624064 | 0.014535 | 1.477931e-07 | 2.795101 | 1.685406 | 1.678659 | 0.038585 | 0.043732 |
3 | 1.0 | 0.000152 | -2.300890 | 0.003301 | 5.174804e-07 | -1.027543 | 2.993015 | 2.992970 | 1.216684 | 0.000026 | ... | 1.503164 | 0.057380 | 2.166845 | 0.098014 | 1.287851e-06 | -2.995941 | -1.625763 | -1.510968 | 0.012972 | 0.016725 |
4 | 1.0 | 0.001457 | 0.653565 | 0.033910 | 5.106746e-07 | 1.084883 | 0.667169 | 0.665367 | 1.034058 | 0.020334 | ... | 1.059346 | 0.010325 | 0.038984 | 0.011840 | 1.542321e-06 | -1.253823 | 1.215392 | 1.196085 | 0.006304 | 0.008739 |
5 rows × 26 columns
To test the accuracy of an algorithm, we need to see how it performs on unseen data. This is the test set, which is reserved until the model is fully trained.
# use subset of data - don't need 25,000 for initial model building
frac_data = 0.7
# randomly split into train/test data
train_frac = 0.8
N = int(len(X)*frac_data)
# build training/test datasets
rN = np.arange(0,N)
np.random.shuffle(rN) # randomly shuffle data
train_i, test_i = rN[0: int(train_frac*N)], rN[int(train_frac*N):]
Xtrain, Xtest = X.iloc[train_i], X.iloc[test_i]
ytrain, ytest = y.iloc[train_i], y.iloc[test_i]
Algorithm: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
In this cell we are performing:
KFold
.C
and solver
.In order to determine what the best set of hyperparameters are, both cross validation and grid search is necessary.
from sklearn.linear_model import LogisticRegression
fold = KFold(n_splits=3, shuffle=False, random_state=None)
# hyparameters to grid search over
grid = {
'C': [1, 1e-2], # regularization penalty term
'solver': ['newton-cg']
}
# initialize algorithm
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
# Search over grid
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
# perform grid search, might take a minute!
gs.fit(Xtrain, ytrain)
print ('gs.best_score_:', gs.best_score_)
print ('gs.best_params_:', gs.best_params_)
('gs.best_score_:', 0.87783448545759668) ('gs.best_params_:', {'C': 1, 'solver': 'newton-cg'})
# Load and train best model
LR_best = LogisticRegression(C=gs.best_params_['C'], solver=gs.best_params_['solver'],
penalty='l2', random_state=777, max_iter=10000, tol=10)
LR_best.fit(Xtrain, ytrain)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=10000, multi_class='ovr', n_jobs=1, penalty='l2', random_state=777, solver='newton-cg', tol=10, verbose=0, warm_start=False)
Precision-Recall and ROC curves are both used to measure the performance of an algorithm. In particular, a classification algorithm is likely to make mistakes in the form of false positives (negative cases incorrectly classified as positive) and false negatives (positive cases incorrectly classified as negative). Most algorithms output a class probability and the user must specify what the threshold for a positive classification is. Precision-Recall and ROC curves generate performance scores for each possible threshold (between 0 and 1), giving a sense of overall performance. These are statistics calculated from the number of True Positives, $T_P$, True Negatives $T_N$, False Positives, $F_P$, and False Negatives, $F_N$.
For each threshold, a False Positive Rate, $R_{FP}$ and True Positive Rate, $R_{TP}$ are calculated and plotted on an ROC curve according to:
$$R_{FP} = \frac{T_P}{F_P + T_N}, \ \ R_{TP} = \frac{T_P}{F_N + T_P}$$
For each threshold, a Precision, $P$, and Recall, $R$, value is calculated and plotted on a precision-recall curve according to:
$$P = \frac{T_P}{F_P + T_P}, \ \ R = \frac{T_P}{F_N + T_P}$$
The $y=x$ line on an ROC plot corresponds to random guessing (i.e. terrible performance), and the closer to the upper left corner an ROC curve lies the better the performance. By calculating the area under an ROC curve (AUC_ROC
), one can get a measure of absolute performance.
def PR_ROC_curve(model, Xtest, ytest, test_i):
ypred = model.predict_proba(Xtest)[:,1]
index = np.isfinite(ypred) == True # occasionally a prediction -> inf/nan
ytest, ypred = ytest[index], ypred[index]
precision, recall, thresholds = precision_recall_curve(ytest, ypred)
fpr, tpr, thresholds = roc_curve(ytest, ypred)
auc = metrics.roc_auc_score(ytest, ypred)
# best model
best = pd.read_csv('XGB_model_preds.csv', index_col=0)
ytest_best, ypred_best = best['Stable'][test_i], best['preds'][test_i]
index = np.isfinite(ypred_best) == True # occasionally a prediction -> inf/nan
ytest_best, ypred_best = ytest_best[index], ypred_best[index]
precision_best, recall_best, thresholds_best = precision_recall_curve(ytest_best, ypred_best)
fpr_best, tpr_best, thresholds_best = roc_curve(ytest_best, ypred_best)
auc_best = metrics.roc_auc_score(ytest_best, ypred_best)
f, ax = plt.subplots(1, 2, figsize=(15,6))
ax[0].plot(recall, precision)
ax[0].plot(recall_best, precision_best)
ax[0].set_xlim([0.0, 1.0])
ax[0].set_ylim([0.0, 1.0])
ax[0].set_xlabel('Recall',fontsize=12)
ax[0].set_ylabel('Precision',fontsize=12)
ax[0].set_title("Precision-Recall Curve")
ax[1].plot(fpr, tpr, label='auc, your model: %f'%auc)
ax[1].plot(fpr_best, tpr_best, label='auc, best model: %f'%auc_best)
ax[1].plot([0, 1], [0, 1], 'k--')
ax[1].set_xlim([0.0, 1.0])
ax[1].set_ylim([0.0, 1.0])
ax[1].set_xlabel('False Positive Rate')
ax[1].set_ylabel('True Positive Rate')
ax[1].set_title("Receiver Operating Characteristic Curve")
ax[1].legend()
# Generate PR/ROC curves
PR_ROC_curve(LR_best, Xtest, ytest, test_i)
Algorithm: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
As you will notice, the training of an SVM is considerably slower than logistic regression. This is because the training time using a non-linear kernel (default for SVC
is the non-linear rbf
kernel) is approximately $O(\rm{max}(n, d) * \rm{min}(n, d)^2)$ where $n$ is the number of data points and $d$ is the number of features. Source: Chapelle, Olivier. "Training a support vector machine in the primal." Neural Computation 19.5 (2007): 1155-1178.
from sklearn.svm import SVC # SVC = Support Vector for Classification
fold = KFold(n_splits=3, shuffle=False, random_state=None)
grid = {
'C': [1, 1e-3], # regularization penalty term
'kernel': ['rbf','linear']
}
clf = SVC()
# Search over grid
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
# perform grid search, might take a minute!
gs.fit(Xtrain, ytrain)
print ('gs.best_score_:', gs.best_score_)
print ('gs.best_params_:', gs.best_params_)
gs.best_score_: 0.885465880352 gs.best_params_: {'C': 1, 'kernel': 'linear'}
# Load and train best model on full training set (no cross validation)
SVC_best = SVC(C=gs.best_params_['C'], kernel=gs.best_params_['kernel'], probability=True)
SVC_best.fit(Xtrain, ytrain)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)
# Generate PR/ROC curves
PR_ROC_curve(SVC_best, Xtest, ytest, test_i)
Now that you know the basics of how to:
It's time to explore this notebook to try and achieve the best possible performance on the test set, and perhaps even compete with the "best model" performance on this dataset! This can be done by:
The best model was developed by Dan Tamayo, Naireen Hussain and Ari Silburt, and uses sophisticated hand-crafted features to generate predictions using the ML algorithm XGBoost. Many of these features are generated from the early evolution of the planetary system. For example, the minimum, maximum, standard deviation and mean eccentricity of each planet over the first 10,000 orbits are used as features. As a result, 'best model' uses the initial conditions and the early evolution of a planetary system to generate a prediction. These sophistocated features can be found by loading "Stability_Data_BestModel_Features.csv" provided in the folder. Feel free to try it out!
Since "Stability_Data_inial_conditions.csv" contains only the initial conditions, it is unlikely that you would be able to match/beat the performance of the best model. Please let us know if you even get close, as maybe you have created an important feature we didn't think of!!
We will briefly analyze this dataset from a clustering perspective. If we plot a histograms of instability_time
and log10(instability_time)
, we can see that clusters emerge. In particular, it appears that when binning instability_time
systems cluster around $<0.1$ or $1$ billion years. Although it is very common to analyze the stability of planetary systems in terms of log10(instability_time)
, we will instead try to successfully cluster the two groups in instability_time
space.
f, ax = plt.subplots(1, 3, figsize=(14,4), sharey=True)
ax[0].hist(np.log10(df['instability_time']));
ax[1].hist(df['instability_time']);
ax[2].hist(df['Stable']);
ax[0].set_ylabel('counts');
ax[0].set_xlabel('log10(instability_time)');
ax[1].set_xlabel('instability_time (years)');
ax[2].set_xlabel('Stability (binary)')
<matplotlib.text.Text at 0x10e652a50>
From here we can see that Spectral Clustering can do a good job of separating intertwined/messy classes, while the other methods do not do as well. However, from here we can see that spectral clustering ("sklearn spectral" in the first figure) is a very slow algorithm, and scales poorly with the amount of data. However, we will attempt to cluster the stable/unstable systems using this algorithm for a very small amount of data.
from sklearn.cluster import SpectralClustering
clf = SpectralClustering(n_clusters=2, affinity='rbf', gamma=1.0, n_init=10)
clf
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3, eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None, n_clusters=2, n_init=10, n_jobs=1, n_neighbors=10, random_state=None)
frac_data = 0.03
N = len(X)
X_clf = X[0:int(frac_data*N)]
y_clf = y[0:int(frac_data*N)]
X_clf.tail()
a1 | e1 | omega1 | inc1 | m1 | Omega1 | true_anom1 | mean_anom1 | a2 | e2 | ... | a3 | e3 | omega3 | inc3 | m3 | Omega3 | true_anom3 | mean_anom3 | Rhill_12 | Rhill_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
744 | 1.0 | 0.005276 | -2.280183 | 0.003580 | 8.450904e-07 | -0.521908 | 4.171273 | -2.102849 | 1.046257 | 0.007543 | ... | 1.059626 | 0.022941 | -1.575458 | 0.004367 | 1.018926e-06 | 2.793342 | 1.065387 | 1.025576 | 0.012080 | 0.012576 |
745 | 1.0 | 0.007163 | 1.572270 | 0.018378 | 4.589269e-07 | -1.765554 | -1.770789 | -1.756734 | 1.414292 | 0.004847 | ... | 1.454929 | 0.017930 | -1.248724 | 0.072972 | 1.261339e-06 | 3.067091 | 1.311294 | 1.276756 | 0.018345 | 0.022342 |
746 | 1.0 | 0.045097 | 1.730455 | 0.061635 | 1.867993e-06 | 2.250357 | -3.620189 | 2.620180 | 1.155877 | 0.012655 | ... | 1.671401 | 0.010231 | -1.424115 | 0.002171 | 2.352218e-05 | -1.661276 | 0.842566 | 0.827372 | 0.009669 | 0.028202 |
747 | 1.0 | 0.005446 | -0.626045 | 0.081258 | 7.862898e-07 | 2.284007 | 2.358184 | 2.350475 | 1.125095 | 0.015416 | ... | 1.503039 | 0.021079 | -3.070809 | 0.007092 | 2.187174e-05 | 1.096155 | 2.599894 | 2.577860 | 0.014999 | 0.028160 |
748 | 1.0 | 0.004916 | -2.321251 | 0.013209 | 1.638547e-07 | 0.331210 | 0.190913 | 0.189054 | 1.349109 | 0.000609 | ... | 1.761103 | 0.184905 | -0.267640 | 0.020949 | 8.206347e-07 | 2.488754 | -2.440396 | -2.174502 | 0.033689 | 0.044742 |
5 rows × 26 columns
# This might take a minute!
y_clf_pred = clf.fit_predict(X_clf, y_clf)
metrics.adjusted_rand_score(y_clf, y_clf_pred)
-0.0005843978451994811
zip(y_clf, y_clf_pred)
[(0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 1), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 1), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 1), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 1), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 1), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 1), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 1), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 1), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (1.0, 0), (1.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (1.0, 0), (0.0, 0), (1.0, 0)]
From the adjusted rand score and comparing the ground truth and predicted class labels with zip(y_clf, y_clf_pred)
, we can see that it doesn't do a great job. The clustering algorithm predicts mostly zeros, and few samples with y_clf=1
have a corresponding y_clf_pred=1
value. This highlights the importance of choosing the right machine learning algorithm for your problem, and also supports the (no free lunch theorem)[https://en.wikipedia.org/wiki/No_free_lunch_theorem]. It is important to choose the right algorithm for your problem!
There are many other clustering algorithms here, try and implement a few different ones, they may perform better! For example, DBSCAN scales much better with the amount of data and uses a different algorithm to cluster. Perhaps it will better separate the two classes!