#!/usr/bin/env python # coding: utf-8 # # Machine Learning (ML) to Predict the Stability of Planetary Systems # One key question planetary scientists strive to understand is the longterm stability of exoplanetary systems. That is, they would like to know whether, over billions of orbits, planets will collide or be ejected from the system. Due to the [chaotic](https://en.wikipedia.org/wiki/Chaos_theory) nature of planetary systems, the "answer" of whether a particular planetary system is longterm stable can only be explored statistically. For example, [Laskar & Gastineau (2009)](https://www.nature.com/articles/nature08096) researched the longterm stability of the Solar System by performing 2,501 N-body simulations, each 5 billion years in length, and found that 1% of solutions lead to a large unstable increase in Mercury’s eccentricity. However, this study would have taken roughly **200 years** to complete on a standard workstation (they had access to a very large computing cluster), motivating the exploration of other methods to speed up the process. # # One such method is to use machine learning to predict the longterm behaviour of a planetary system based off its initial conditions. Once the model is trained, it can take as little as a second to generate new predictions, arriving at an answer quickly. Such a method is described and presented in [Tamayo, Silburt, et al. (2016)](https://arxiv.org/abs/1610.05359). In this notebook, we will explore a simplified version of this work, using a dataset of 25,000 simulated 3-planet systems to train and test a variety of machine learning models. # In[2]: import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold from sklearn import metrics from sklearn.metrics import precision_recall_curve, roc_curve pd.options.mode.chained_assignment = None # In[3]: # load the dataset df = pd.read_csv('Stability_Data_inial_conditions.csv', index_col=0) df.tail() # Above, we have loaded the raw initial conditions for each system, using ```pandas```. For each of the three planets (identified ```1```, ```2```, ```3```): # - ```a``` = $a$ = semi-major axis. # - ```e``` = $e$ = eccentricity. # - ```omega``` = $\omega$ = argument of periapsis. # - ```Omega``` = $\Omega$ = Longitude of ascending node. # - ```inc``` = $i$ = inclination. # - ```m``` = $m$ = planet mass. # - ```true_anom``` = $\nu$ = true anomaly. # - ```mean_anom``` = $M$ = mean anomaly. # # In addition: # - ```runstring``` = identifier for each simulation. # - ```Stable``` = whether the system is stable over a billion orbits of the inner planet. *This is the quantity we want to train our machine learning algorithm to predict!* # - ```instability_time``` = The physical number of years the system was stable for. If ```instability_time```=$10^9$, then ```Stable = 1```. This is an alternative quantity we could train a *regression* model on, but we won't do that here (this is a much harder problem). # - Stellar Mass = $M_*$ = 1 for all systems. # # For reference, the orbital elements are shown here: # # ## Plotting # Let's plot the data to get a sense of what it looks like. Try and plot other variables to see if you can better separate the stable (blue) systems from the unstable (red) systems! # In[4]: # x and y plot variables x = df['a2'].values/df['a1'].values y = df['a3'].values/df['a1'].values # plotting fig, ax = plt.subplots(figsize=(12,8)) stable = df['Stable'] == 1 unstable = df['Stable'] == 0 ax.scatter(x[stable], y[stable], color='blue', s=5, label='Stable') ax.scatter(x[unstable], y[unstable], color='red', s=5, label='Unstable') ax.set_xlim([1,2.25]) ax.set_ylim([1,3.25]) ax.set_xlabel('a2/a1', fontsize=20) ax.set_ylabel('a3/a2', fontsize=20) ax.legend() # ## Get Input and Target # A machine learning model requires an input (features), ```X```, and ground truth output (i.e. the answer), ```y```, to train. # In[5]: cols = df.columns.drop('runstring').drop('Stable').drop('instability_time') # input X = df[cols] # output (target) y = df['Stable'] X.head() # ## Feature Engineering # Good features are essential to an accurate model, which often requires domain expertise. For example, if you were trying to train a model to distinguish cars from pickup-trucks, a "number of wheels" feature would not be very useful since they both have 4 wheels. A better feature might be "size", since pickup-trucks are generally bigger than cars. # # For our problem, let's create a "Mutual Hill Radius" feature, which describes the *dynamical* spacing between planets, and is calculated according to: # $$ R_{Hi,j} = \left(\frac{a_1 + a_2}{2}\right)\left(\frac{m_1 + m_2}{3M_*}\right)^{1/3} $$ # Mutual Hill Radius can be thought of as a "normalized" spacing parameter - a system with two Earth-sized planets will be equally as interactive as a system with two Jupiter-sized planets if they are spaced by the same number of mutual hill radii. # In[6]: X['Rhill_12'] = 0.5*(X['a1'] + X['a2'])*((X['m1'] + X['m2'])/3.)**(1./3.) # mutual hill spacing between planets 1, 2 X['Rhill_23'] = 0.5*(X['a3'] + X['a2'])*((X['m3'] + X['m2'])/3.)**(1./3.) # mutual hill spacing between planets 2, 3 X.head() # ## Split into Train/Test data # To test the accuracy of an algorithm, we need to see how it performs on unseen data. This is the test set, which is reserved until the model is fully trained. # In[6]: # use subset of data - don't need 25,000 for initial model building frac_data = 0.7 # randomly split into train/test data train_frac = 0.8 N = int(len(X)*frac_data) # build training/test datasets rN = np.arange(0,N) np.random.shuffle(rN) # randomly shuffle data train_i, test_i = rN[0: int(train_frac*N)], rN[int(train_frac*N):] Xtrain, Xtest = X.iloc[train_i], X.iloc[test_i] ytrain, ytest = y.iloc[train_i], y.iloc[test_i] # ## Logistic Regression # Algorithm: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html # # In this cell we are performing: # - cross validation (CV) using ```KFold```. # - a grid search over hyperparameters ```C``` and ```solver```. # - applying an "l2" regularization term the help prevent overfitting. # # In order to determine what the best set of hyperparameters are, both cross validation and grid search is necessary. # In[7]: from sklearn.linear_model import LogisticRegression fold = KFold(n_splits=3, shuffle=False, random_state=None) # hyparameters to grid search over grid = { 'C': [1, 1e-2], # regularization penalty term 'solver': ['newton-cg'] } # initialize algorithm clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10) # Search over grid gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold) # In[8]: # perform grid search, might take a minute! gs.fit(Xtrain, ytrain) print ('gs.best_score_:', gs.best_score_) print ('gs.best_params_:', gs.best_params_) # In[9]: # Load and train best model LR_best = LogisticRegression(C=gs.best_params_['C'], solver=gs.best_params_['solver'], penalty='l2', random_state=777, max_iter=10000, tol=10) LR_best.fit(Xtrain, ytrain) # ## Code to generate precision-recall and receiver operating characteristic (ROC) curves # Precision-Recall and ROC curves are both used to measure the performance of an algorithm. In particular, a classification algorithm is likely to make mistakes in the form of false positives (negative cases incorrectly classified as positive) and false negatives (positive cases incorrectly classified as negative). Most algorithms output a class probability and the user must specify what the threshold for a positive classification is. Precision-Recall and ROC curves generate performance scores for each possible threshold (between 0 and 1), giving a sense of overall performance. These are statistics calculated from the number of True Positives, $T_P$, True Negatives $T_N$, False Positives, $F_P$, and False Negatives, $F_N$. # # For each threshold, a False Positive Rate, $R_{FP}$ and True Positive Rate, $R_{TP}$ are calculated and plotted on an ROC curve according to: # $$R_{FP} = \frac{T_P}{F_P + T_N}, \ \ R_{TP} = \frac{T_P}{F_N + T_P}$$ # # For each threshold, a Precision, $P$, and Recall, $R$, value is calculated and plotted on a precision-recall curve according to: # $$P = \frac{T_P}{F_P + T_P}, \ \ R = \frac{T_P}{F_N + T_P}$$ # # The $y=x$ line on an ROC plot corresponds to random guessing (i.e. terrible performance), and the closer to the upper left corner an ROC curve lies the better the performance. By calculating the area under an ROC curve (```AUC_ROC```), one can get a measure of absolute performance. # In[10]: def PR_ROC_curve(model, Xtest, ytest, test_i): ypred = model.predict_proba(Xtest)[:,1] index = np.isfinite(ypred) == True # occasionally a prediction -> inf/nan ytest, ypred = ytest[index], ypred[index] precision, recall, thresholds = precision_recall_curve(ytest, ypred) fpr, tpr, thresholds = roc_curve(ytest, ypred) auc = metrics.roc_auc_score(ytest, ypred) # best model best = pd.read_csv('XGB_model_preds.csv', index_col=0) ytest_best, ypred_best = best['Stable'][test_i], best['preds'][test_i] index = np.isfinite(ypred_best) == True # occasionally a prediction -> inf/nan ytest_best, ypred_best = ytest_best[index], ypred_best[index] precision_best, recall_best, thresholds_best = precision_recall_curve(ytest_best, ypred_best) fpr_best, tpr_best, thresholds_best = roc_curve(ytest_best, ypred_best) auc_best = metrics.roc_auc_score(ytest_best, ypred_best) f, ax = plt.subplots(1, 2, figsize=(15,6)) ax[0].plot(recall, precision) ax[0].plot(recall_best, precision_best) ax[0].set_xlim([0.0, 1.0]) ax[0].set_ylim([0.0, 1.0]) ax[0].set_xlabel('Recall',fontsize=12) ax[0].set_ylabel('Precision',fontsize=12) ax[0].set_title("Precision-Recall Curve") ax[1].plot(fpr, tpr, label='auc, your model: %f'%auc) ax[1].plot(fpr_best, tpr_best, label='auc, best model: %f'%auc_best) ax[1].plot([0, 1], [0, 1], 'k--') ax[1].set_xlim([0.0, 1.0]) ax[1].set_ylim([0.0, 1.0]) ax[1].set_xlabel('False Positive Rate') ax[1].set_ylabel('True Positive Rate') ax[1].set_title("Receiver Operating Characteristic Curve") ax[1].legend() # In[11]: # Generate PR/ROC curves PR_ROC_curve(LR_best, Xtest, ytest, test_i) # ## Support Vector Machine (SVM) # Algorithm: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html # # As you will notice, the training of an SVM is considerably slower than logistic regression. This is because the training time using a non-linear kernel (default for ```SVC``` is the non-linear ```rbf``` kernel) is approximately $O(\rm{max}(n, d) * \rm{min}(n, d)^2)$ where $n$ is the number of data points and $d$ is the number of features. Source: Chapelle, Olivier. "Training a support vector machine in the primal." Neural Computation 19.5 (2007): 1155-1178. # In[13]: from sklearn.svm import SVC # SVC = Support Vector for Classification fold = KFold(n_splits=3, shuffle=False, random_state=None) grid = { 'C': [1, 1e-3], # regularization penalty term 'kernel': ['rbf','linear'] } clf = SVC() # Search over grid gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold) # In[16]: # perform grid search, might take a minute! gs.fit(Xtrain, ytrain) print ('gs.best_score_:', gs.best_score_) print ('gs.best_params_:', gs.best_params_) # In[17]: # Load and train best model on full training set (no cross validation) SVC_best = SVC(C=gs.best_params_['C'], kernel=gs.best_params_['kernel'], probability=True) SVC_best.fit(Xtrain, ytrain) # In[47]: # Generate PR/ROC curves PR_ROC_curve(SVC_best, Xtest, ytest, test_i) # ## Improving the models: # Now that you know the basics of how to: # - Use the stability dataset. # - Split the data into train/test sets. # - Generate new features. # - Train/test a model. # - Measure it's performance. # # It's time to explore this notebook to try and achieve the best possible performance on the test set, and perhaps even compete with the "best model" performance on this dataset! This can be done by: # 1. Trying different Machine Learning algorithms. There are many easy options available here - http://scikit-learn.org/stable/supervised_learning.html#supervised-learning. # 2. Increasing the amount of training data. # 3. Perform a more thorough grid search over the hyperparameters for a model. Is there a better optimum set of values? # 4. Feature Engineering: are there better ways to express the data that could better delineate between stable/unstable systems? # 5. Ensembling: averaging the test-set predictions from many models. This tends to improve predictions since the aggregate opinions of many models generally reduces variance (i.e. less noisy) than the opinion of one model. Each algorithm has unique weaknesses which can be averaged out by the other model opinions that generally do not have the same weaknesses. # 6. Standardize features. # # ## Questions: # 1. Does one machine learning model perform much better compared to the others? Why? # 2. Which features are most important to the performance? Does this make sense from a physics perspective? # ## Information About 'Best Model' # The best model was developed by [Dan Tamayo](https://github.com/dtamayo), [Naireen Hussain](https://github.com/naireen) and [Ari Silburt](https://github.com/silburt), and uses sophisticated hand-crafted features to generate predictions using the ML algorithm XGBoost. Many of these features are generated from the early evolution of the planetary system. For example, the minimum, maximum, standard deviation and mean eccentricity of each planet over the first 10,000 orbits are used as features. As a result, 'best model' uses the initial conditions *and* the early evolution of a planetary system to generate a prediction. These sophistocated features can be found by loading "Stability_Data_BestModel_Features.csv" provided in the folder. Feel free to try it out! # # Since "Stability_Data_inial_conditions.csv" contains only the initial conditions, it is unlikely that you would be able to match/beat the performance of the best model. Please let us know if you even get close, as maybe you have created an important feature we didn't think of!! # # Clustering # We will briefly analyze this dataset from a clustering perspective. If we plot a histograms of ```instability_time``` and ```log10(instability_time)```, we can see that clusters emerge. In particular, it appears that when binning ```instability_time``` systems cluster around $<0.1$ or $1$ billion years. Although it is very common to analyze the stability of planetary systems in terms of ```log10(instability_time)```, we will instead try to successfully cluster the two groups in ```instability_time``` space. # In[7]: f, ax = plt.subplots(1, 3, figsize=(14,4), sharey=True) ax[0].hist(np.log10(df['instability_time'])); ax[1].hist(df['instability_time']); ax[2].hist(df['Stable']); ax[0].set_ylabel('counts'); ax[0].set_xlabel('log10(instability_time)'); ax[1].set_xlabel('instability_time (years)'); ax[2].set_xlabel('Stability (binary)') # ## Spectral Clustering # Algorithm: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering # # From [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html) we can see that Spectral Clustering can do a good job of separating intertwined/messy classes, while the other methods do not do as well. However, from [here](http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html) we can see that spectral clustering ("sklearn spectral" in the first figure) is a very slow algorithm, and scales poorly with the amount of data. However, we will attempt to cluster the stable/unstable systems using this algorithm for a very small amount of data. # In[1]: from sklearn.cluster import SpectralClustering clf = SpectralClustering(n_clusters=2, affinity='rbf', gamma=1.0, n_init=10) clf # In[7]: frac_data = 0.03 N = len(X) X_clf = X[0:int(frac_data*N)] y_clf = y[0:int(frac_data*N)] X_clf.tail() # In[8]: # This might take a minute! y_clf_pred = clf.fit_predict(X_clf, y_clf) # In[17]: metrics.adjusted_rand_score(y_clf, y_clf_pred) # In[19]: zip(y_clf, y_clf_pred) # From the adjusted rand score and comparing the ground truth and predicted class labels with ```zip(y_clf, y_clf_pred)```, we can see that it doesn't do a great job. The clustering algorithm predicts mostly zeros, and few samples with ```y_clf=1``` have a corresponding ```y_clf_pred=1``` value. This highlights the importance of choosing the right machine learning algorithm for your problem, and also supports the (no free lunch theorem)[https://en.wikipedia.org/wiki/No_free_lunch_theorem]. It is important to choose the right algorithm for your problem! # # There are many other clustering algorithms [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html), try and implement a few different ones, they may perform better! For example, DBSCAN scales much better with the amount of data and uses a different algorithm to cluster. Perhaps it will better separate the two classes! # In[ ]: