Breakout: Exploring Machine Learning¶

In this breakout, we'll explore the use of scikit-learn on some astronomical data.

This will use astroML, a package for machine learning in astronomy which includes several convenient loaders for various astronomical datasets.

Recall that if you don't have astroML installed, you can install it easily from the command-line with pip:

$ pip install astroML

In [1]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()

1. Classification: Labeling Photometric Sources¶

Here we'll do some automated classification of photometric sources. First the data, which can be fetched via astroML. If you don't have astroML installed, use pip install astroML

In [2]:

from astroML.datasets import fetch_rrlyrae_combined
from sklearn.cross_validation import train_test_split

X, y = fetch_rrlyrae_combined()

# For now, we'll only fit the first two colors
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [3]:

N_plot = 5000
plt.scatter(X[-N_plot:, 0], X[-N_plot:, 1], c=y[-N_plot:],
            edgecolors='none', cmap='RdBu')
plt.xlabel('u-g color')
plt.ylabel('g-r color')
plt.xlim(0.7, 1.4)
plt.ylim(-0.2, 0.4);

Blue points are RR-Lyrae, Red points are main sequence stars.

Now we'll do a simple and fast $K$-neighbors classification.

Note that we train on part of the data and test on another part. Otherwise, an estimator which simply remembers the entire dataset would obtain a perfect classification!

In [4]:

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Now let's check how we're doing. You might be tempted to simply check the accuracy: i.e. in what fraction of points the predictions match the true value. Let's see this:

In [5]:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Out[5]:

0.99643562655672935

In [6]:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred,
                            target_names=['MS star', 'RR Lyrae']))

             precision    recall  f1-score   support

    MS star       1.00      1.00      1.00     23154
   RR Lyrae       0.69      0.67      0.68       132

avg / total       1.00      1.00      1.00     23286

Note on the meaning of these terms:¶

$$ {\rm precision} \equiv \frac{\rm true~positives}{\rm true~positives + false~positives} $$$$ {\rm recall} \equiv \frac{\rm true~positives}{\rm true~positives + false~negatives} $$$$ F_1 \equiv \frac{\rm precision \cdot recall}{\rm precision + recall} $$

The range for precision, recall, and F1 score is 0 to 1, with 1 being perfect. Often in astronomy, we call the recall the completeness, and (1 - precision) the contamination.

Because this is an unbalanced dataset, we see that the RR Lyrae stars are completely overwhelmed by the larger number of normal Main Sequence stars.

Classification Exercise¶

Use another classifier and try to improve on this precision/recall score.

There are several you might try:

sklearn.naive_bayes.GaussianNB (fast but inaccurate)
sklearn.lda.LDA (fast but inaccurate)
sklearn.svm.SVC (slow but accurate)
sklearn.svm.SVC with kernel='rbf' (slow but accurate)
sklearn.ensemble.RandomForestClassifier (fast & potentially accurate with tuning)

For the slower algorithms, it might be a good idea to use only part of the dataset as you experiment, i.e. when training do something like the following:

clf.fit(X_train[::5], y_train[::5])

Once you're happy, you can run the training on the full dataset.

What's the best precision/recall you can obtain for the RR-Lyrae data?

In [6]:

2. Regression: Photometric Redshifts¶

The photometric redshift problem is a classic Regression problem

In [7]:

from astroML.datasets import fetch_sdss_specgals

In [8]:

data = fetch_sdss_specgals()

# put magnitudes in a matrix
X = np.vstack([data['modelMag_%s' % f] for f in 'ugriz']).T
y = data['z']

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [9]:

from sklearn.linear_model import LinearRegression
est = LinearRegression()
est.fit(X_train, y_train)
y_pred = est.predict(X_test)

In [10]:

plt.plot(y_test, y_pred, ',k')
plt.plot([0, 1], [0, 1], ':k')
plt.xlim(0, 0.6)
plt.ylim(0, 0.6)

Out[10]:

(0, 0.6)

Evidently, a simple linear model is not a very good fit!

Let's compute the RMS deviation to see how poor this actually is:

In [11]:

rms = np.sqrt(np.mean((y_test - y_pred) ** 2))
print(rms)

0.0322454288984

Regression Exercise¶

Try to improve on this result using another regression algorithm. Some potential choices:

sklearn.neighbors.KNeighborsRegressor (fast-ish, but often inaccurate)
sklearn.svm.SVR (potentially accurate, but slow)
sklearn.ensemble.RandomForestRegressor (fast, and accurate when well-tuned)
sklearn.ensemble.GradientBoostingRegressor (accurate with tuning, can be very slow depending on parameters)

Again, sub-sampling the data can help as you explore some of the slower techniques.

In [12]:

from sklearn.ensemble import RandomForestClassifier

# Use a question mark in IPython to find help:
RandomForestClassifier?

In [12]: