In this module, I look at time series light intensity data provided by the Kepler Space Telescope to detect the presence of exoplanets.
The fields are explained here:
LABEL
uses either 1 or 2, where 1 means the star has no orbiting planet and 2 means it has at least one.
FLUX 1 : 3192
are the light intensity data retreived from each planet over some time interval.
I intend to analyze the plots then train a predictor that will understand the relationships between light intensity values and determine whether or not the star has at least one orbiting planet.
For our purposes, we will assume that there are only two types of stars, ones with orbiting planets and ones with none.
Data is retreived from Kaggle, check out their publicly available datasets for machine learning and data analysis.
import pandas as pd # data analysis
import scipy # stats and dependancies
import numpy as np # linear algebra
import matplotlib.pyplot as plt # plotting library
import seaborn as sns # statistics library
import IPython # shell commands
np.random.seed(42)
%matplotlib inline
df = pd.read_csv('exoTrain.csv')
y = df['LABEL']
# show the first 5 rows
df.head()
LABEL | FLUX.1 | FLUX.2 | FLUX.3 | FLUX.4 | FLUX.5 | FLUX.6 | FLUX.7 | FLUX.8 | FLUX.9 | ... | FLUX.3188 | FLUX.3189 | FLUX.3190 | FLUX.3191 | FLUX.3192 | FLUX.3193 | FLUX.3194 | FLUX.3195 | FLUX.3196 | FLUX.3197 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 93.85 | 83.81 | 20.10 | -26.98 | -39.56 | -124.71 | -135.18 | -96.27 | -79.89 | ... | -78.07 | -102.15 | -102.15 | 25.13 | 48.57 | 92.54 | 39.32 | 61.42 | 5.08 | -39.54 |
1 | 2 | -38.88 | -33.83 | -58.54 | -40.09 | -79.31 | -72.81 | -86.55 | -85.33 | -83.97 | ... | -3.28 | -32.21 | -32.21 | -24.89 | -4.86 | 0.76 | -11.70 | 6.46 | 16.00 | 19.93 |
2 | 2 | 532.64 | 535.92 | 513.73 | 496.92 | 456.45 | 466.00 | 464.50 | 486.39 | 436.56 | ... | -71.69 | 13.31 | 13.31 | -29.89 | -20.88 | 5.06 | -11.80 | -28.91 | -70.02 | -96.67 |
3 | 2 | 326.52 | 347.39 | 302.35 | 298.13 | 317.74 | 312.70 | 322.33 | 311.31 | 312.42 | ... | 5.71 | -3.73 | -3.73 | 30.05 | 20.03 | -12.67 | -8.77 | -17.31 | -17.35 | 13.98 |
4 | 2 | -1107.21 | -1112.59 | -1118.95 | -1095.10 | -1057.55 | -1034.48 | -998.34 | -1022.71 | -989.57 | ... | -594.37 | -401.66 | -401.66 | -357.24 | -443.76 | -438.54 | -399.71 | -384.65 | -411.79 | -510.54 |
5 rows × 3198 columns
This will display some statistics for the data sets. Blues are 1 (do not contain exoplanets) and Reds are 2 (contain at least one exoplanet)
import warnings
warnings.filterwarnings("ignore") # remove featurewarnings
sns.set()
means1 = df[y==1].mean(axis=1)
medians1 = df[y==1].median(axis=1)
std1 = df[y==1].std(axis=1)
maxval1 = df[y==1].max(axis=1)
minval1 = df[y==1].min(axis=1)
skew1 = df[y==1].skew(axis=1)
means2 = df[y==2].mean(axis=1)
medians2 = df[y==2].median(axis=1)
std2 = df[y==2].std(axis=1)
maxval2 = df[y==2].max(axis=1)
minval2 = df[y==2].min(axis=1)
skew2 = df[y==2].skew(axis=1)
# plot each category
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(231)
ax.hist(means1,alpha=0.8,bins=50,color='b',normed=True,range=(-250,250))
ax.hist(means2,alpha=0.8,bins=50,color='r',normed=True,range=(-250,250))
ax.get_legend()
ax.set_xlabel('Mean Intensity')
ax.set_ylabel('No. of Stars')
ax = fig.add_subplot(232)
ax.hist(medians1,alpha=0.8,bins=50,color='b',normed=True,range=(-0.1,0.1))
ax.hist(medians2,alpha=0.8,bins=50,color='r',normed=True,range=(-0.1,0.1))
ax.get_legend()
ax.set_xlabel('Median Intensity')
ax = fig.add_subplot(233)
ax.hist(std1,alpha=0.8,bins=50,normed=True,color='b',range=(0,4000))
ax.hist(std2,alpha=0.8,bins=50,normed=True,color='r',range=(0,4000))
ax.get_legend()
ax.set_xlabel('Intensity Standard Deviation')
ax = fig.add_subplot(234)
ax.hist(maxval1,alpha=0.8,bins=50,normed=True,color='b',range=(-10000,50000))
ax.hist(maxval2,alpha=0.8,bins=50,normed=True,color='r',range=(-10000,50000))
ax.get_legend()
ax.set_xlabel('Maximum Intensity')
ax = fig.add_subplot(235)
ax.hist(minval1,alpha=0.8,bins=50,normed=True,color='b',range=(-50000,10000))
ax.hist(minval2,alpha=0.8,bins=50,normed=True,color='r',range=(-50000,10000))
ax.get_legend()
ax.set_xlabel('Minimum Intensity')
ax = fig.add_subplot(236)
ax.hist(skew1,alpha=0.8,bins=50,normed=True,color='b',range=(-40,60))
ax.hist(skew2,alpha=0.8,bins=50,normed=True,color='r',range=(-40,60))
ax.get_legend()
ax.set_xlabel('Intensity Skewness')
plt.show()
This will sample out 12 stars of each category and plot the intensities over time.
fig = plt.figure(figsize=(12,40))
t = np.array(range(3198))
for i in range(12):
ax = fig.add_subplot(13,3,i+1)
ax.scatter(t,df[y==2].iloc[i,:], s = 1, c = 'r')
fig = plt.figure(figsize=(12,40))
t = np.array(range(3198))
for i in range(12):
ax = fig.add_subplot(13,3,i+1)
ax.scatter(t,df[y==1].iloc[i,:], s = 1, c = 'b')
Notice that the time varying data is far too difficult for simple boolean statements. We need to employ machine learning algorithms and train the models using this data. Then testing the models against new data to check the accuracy of each ML model.
This will look at the percentage of stars that contain planets versus the ones that don't. If the dataset is imbalanced, then we need to revise the training data since we want the ML model to decern the differences between the classifications of the stars rather than just classify all as the majority.
df['LABEL'].value_counts() # print the sums of each unique value
1 5050 2 37 Name: LABEL, dtype: int64
This seems pretty imbalanced. Let's look at the percentages:
print('Stars with no planets are', round(df['LABEL'].value_counts()[1]/len(df) * 100,2), '% of the dataset')
print('Stars with orbiting planets are', round(df['LABEL'].value_counts()[2]/len(df) * 100,2), '% of the dataset')
Stars with no planets are 99.27 % of the dataset Stars with orbiting planets are 0.73 % of the dataset
The data is far too Imbalanced. Thus we need to somehow balance the dataset before applying machine learning or else it will just classify all stars as they have no exoplanets instead of learning the differences.
sns.countplot('LABEL', data=df)
plt.title('Label Distributions \n (1: No planets || 2: With Planets)', fontsize=14);
Undersample the dataset to account for imbalanced datasets.
label_1 = df[df['LABEL'] == 1]
label_2 = df[df['LABEL'] == 2]
sample_size = sum(df['LABEL'] == 2)
random_indices = np.random.choice(label_1.index, sample_size, replace=False)
under_sample_idx = np.concatenate([random_indices, label_2.index])
under_sample_idx = np.random.permutation(under_sample_idx)
under_sample = df.loc[under_sample_idx]
sns.countplot('LABEL', data=under_sample)
plt.title('Label Distributions \n (1: No planets || 2: With Planets)', fontsize=14);
X = under_sample.drop(['LABEL'], axis = 1) # features
y = under_sample['LABEL']
# import testing data
test = pd.read_csv('exoTest.csv')
X_test = test.drop(['LABEL'], axis = 1)
y_test = test['LABEL']
Here I will work with scikit-learn libraries and print the prediction accuracy using the testing file exoTest.csv
which will be uploaded with this document.
First step is to normalize the inputs and make the output between 0 and 1 for binary classifiers
X_train = np.array([v/np.linalg.norm(v) for v in X.values])
X_test = np.array([v/np.linalg.norm(v) for v in X_test.values])
y_test = y_test.values - 1
y_train = y.values - 1
# this is Decision Tree Classifier method
from sklearn import tree
# use the data from before 'exoTrain.csv' for training
decTree = tree.DecisionTreeClassifier().fit(X_train, y_train)
# this is to check the accuracy versus testing data
from sklearn import metrics
decTree_score = metrics.accuracy_score(y_test, decTree.predict(X_test))
print('Accuracy using Decision Tree Classifier is ' + f'{decTree_score *100} %')
Accuracy using Decision Tree Classifier is 52.45614035087719 %
This is how this decision tree is made
# export to a dot file in the same directory as the notebook
tree.export_graphviz(decTree, out_file='tree.dot')
!! dot -Tpng tree.dot -o tree.png # convert to png
import matplotlib.image as mpimg
img=mpimg.imread('tree.png')
plt.imshow(img)
plt.subplots_adjust(left=0.0, right=2.0, bottom=0.0, top=3.0)
plt.title('Decision Tree Classifier')
plt.show()
Here I use SVM classifier using Gaussian RBF kernel and attempting gridsearch to find the best hyperparameters.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
params = {'C': [1, 2, 5, 7, 10, 100, 150, 200, 500], 'gamma': [7e-4, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]}
svm_rbf = GridSearchCV(SVC(kernel = 'rbf'), params, scoring = 'accuracy')
svm_rbf.fit(X_train, y_train)
y_score_rbf = svm_rbf.decision_function(X_test)
y_pred_rbf = svm_rbf.predict(X_test)
rbf_acc = metrics.accuracy_score(y_pred = y_pred_rbf, y_true = y_test)
rbf_prec = metrics.precision_score(y_pred = y_pred_rbf, y_true = y_test)
rbf_recall = metrics.recall_score(y_pred = y_pred_rbf, y_true = y_test)
rbf_f1 = metrics.f1_score(y_true = y_test, y_pred = y_pred_rbf)
rbf_auc = metrics.roc_auc_score(y_true = y_test, y_score = y_score_rbf)
print('Using the best parameters for Gaussian RBF', svm_rbf.best_params_)
print('Accuracy {:2.2%}'.format(rbf_acc))
print('Precision score {:2.2%}'.format(rbf_prec))
print('Recall score {:2.2%}'.format(rbf_recall))
print('F1 score {:2.2%}'.format(rbf_f1))
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal. DeprecationWarning)
Using the best parameters for Gaussian RBF {'C': 5, 'gamma': 0.5} Accuracy 73.51% Precision score 1.97% Recall score 60.00% F1 score 3.82%
# This is the KNN or K-Nearest-Neighbors
from sklearn.neighbors import KNeighborsClassifier
# leave it at the default k=5
neigh = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
neigh_score = metrics.accuracy_score(y_test, neigh.predict(X_test))
print('Accuracy using K-Nearest-Neighbors (k = 5) is ' + f'{neigh_score *100} %')
Accuracy using K-Nearest-Neighbors (k = 5) is 69.2982456140351 %
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
rfc_score = metrics.accuracy_score(y_test, rfc.predict(X_test))
print('Accuracy using Random Forest Classifier is nearly ' + f'{rfc_score *100} %')
Accuracy using Random Forest Classifier is nearly 78.0701754385965 %
Here I will try Multi-Layer perceptron, or a simple deep neural network by scikit-learn to find a better possible estimator. I will try the Adam Optimization training method and using ReLU for my activation function.
from sklearn.neural_network import MLPClassifier
dnn = MLPClassifier(hidden_layer_sizes=(67, 4,), solver='adam', alpha = 8.0e-4,
max_iter = 500, activation='relu', random_state = 42)
dnn.fit(X_train, y_train)
dnn_score = dnn.score(X_test, y_test)
print('Accuracy using this Multi-layer Perceptron is ' + f'{dnn_score * 100} %')
Accuracy using this Multi-layer Perceptron is 74.91228070175438 %
We can predict the presense of at least one exoplanet with an accuracy of around 78% using Random Forest. A neural network with a proper number of layers might be able to beat that score but more work needed to find the optimal layer parameters for this dataset.