Statistical Programming DC
10/23/14
<h1>Modeling with scikit-learn: common workflows<br>
for reproducible results</h1>
sklearn
workflows:The `caret` package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. [...] There are many different modeling functions in R. Some have different syntax for model training and/or prediction. The package started off as a way to provide a uniform interface the functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).
scikit-learn
's approach¶Consistent API with a cohesive "grammar" of modeling actions:
Nouns
Verbs
fit
transform
fit_transform
score
predict
We'll be using data on blood donations from the UC Irvine Machine Learning Repository. Given some data on some blood donors' previous donations, we'll be trying to predict a binary outcome of whether the donor gave blood in a certain time period.
Here's the official description:
To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Source: https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center
From beginning to end, we'll treat the raw data as immutable and base all of our analyses on views and transformations of this "ground truth" data set.
!wget -nc --directory-prefix data \
https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data
File ‘data/transfusion.data’ already there; not retrieving.
!head data/transfusion.data
pandas
DataFrame¶import numpy as np
import pandas as pd
df = pd.read_csv('data/transfusion.data')
df.head()
Recency (months) | Frequency (times) | Monetary (c.c. blood) | Time (months) | whether he/she donated blood in March 2007 | |
---|---|---|---|---|---|
0 | 2 | 50 | 12500 | 98 | 1 |
1 | 0 | 13 | 3250 | 28 | 1 |
2 | 1 | 16 | 4000 | 35 | 1 |
3 | 2 | 20 | 5000 | 45 | 1 |
4 | 1 | 24 | 6000 | 77 | 0 |
df.shape
(748, 5)
df.dtypes
Recency (months) int64 Frequency (times) int64 Monetary (c.c. blood) int64 Time (months) int64 whether he/she donated blood in March 2007 int64 dtype: object
# save the current names in case we need them later
original_column_names = df.columns
# make the names less ugly
names = ['recency', 'frequency', 'cc', 'time', 'donated']
df.columns = names
df.head()
recency | frequency | cc | time | donated | |
---|---|---|---|---|---|
0 | 2 | 50 | 12500 | 98 | 1 |
1 | 0 | 13 | 3250 | 28 | 1 |
2 | 1 | 16 | 4000 | 35 | 1 |
3 | 2 | 20 | 5000 | 45 | 1 |
4 | 1 | 24 | 6000 | 77 | 0 |
# import our graphics tools
%matplotlib inline
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns # nice defaults for matplotlib styles
set2 = sns.color_palette('Set2')
# add on some settings from 'Bayesian Methods for Hackers'
plt.style.use('bmh')
# set larger default fonts for presentation-friendliness
mpl.rc('figure', figsize=(10, 8))
mpl.rc('axes', labelsize=16, titlesize=20)
from pandas.tools.plotting import scatter_matrix
axeslist = scatter_matrix(df, alpha=0.8, figsize=(10, 10))
for ax in axeslist.flatten():
ax.grid(False)
import numpy as np
# create a figure with 4 subplots
fig, axs = plt.subplots(nrows=2, ncols=2)
feature_column_names = df.columns[:-1]
label_column_name = df.columns[-1]
for i, col in enumerate(feature_column_names):
# get the current subplot to work on
ax = axs.ravel()[i]
# create some random y jitter to add
jitter = np.random.uniform(low=-0.05, high=0.05, size=len(df))
# plot the data
ax.scatter(x=df[col], y=df[label_column_name] + jitter,
c=df.donated, cmap='coolwarm', alpha=0.5)
# label the axes
ax.set_xlabel(col)
ax.set_ylabel(label_column_name)
plt.tight_layout()
plt.show()
scikit-learn
:¶from sklearn.cross_validation import train_test_split
# using conventional sklearn variable names
X = df[feature_column_names].astype(float)
y = df.donated.ravel()
# break up the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
X_train.shape
(374, 4)
y_train.shape
(374,)
X_test.shape
(374, 4)
y_test.shape
(374,)
The basic modeling workflow in sklearn
involves instantiating a model object with the desired settings, and then fitting it to given data.
from sklearn.tree import DecisionTreeClassifier
clf_tree = DecisionTreeClassifier(max_depth=3)
clf_tree.fit(X_train, y_train)
DecisionTreeClassifier(compute_importances=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=None, splitter='best')
print 'Score:', clf_tree.score(X_test, y_test)
Score: 0.794117647059
# viz adapted from http://scikit-learn.org/stable/modules/tree.html
import pydot
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
dot_data = StringIO()
export_graphviz(clf_tree, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('output/decision_tree.png')
True
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2', fit_intercept=True)
clf.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
print 'Score:', clf.score(X_test, y_test)
Score: 0.772727272727
sklearn
tries to use the same syntax for most modeling tasks, so it's pretty easy to plug and play.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
other_clfs = {
'Random forest': RandomForestClassifier(),
'AdaBoost': AdaBoostClassifier(),
'Naive Bayes': MultinomialNB(),
'Linear SVC': LinearSVC(),
'KNN': KNeighborsClassifier(5),
}
# iterating through all of these models we want to fit ...
for name, other_clf in other_clfs.iteritems():
# fit the model with the training data
print other_clf.fit(X_train, y_train)
# cross validation score
print '---\nScore:', other_clf.score(X_test, y_test)
print '\n'
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform') --- Score: 0.756684491979 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0) --- Score: 0.299465240642 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) --- Score: 0.729946524064 RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0) --- Score: 0.756684491979 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=50, random_state=None) --- Score: 0.775401069519
scikit-learn
model¶The object itself:
clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
Coefficients (if fit):
clf.coef_
array([[ -9.74764830e-02, 1.49194232e-06, 3.72985587e-04, -1.17108498e-02]])
Intercept (if fit):
clf.intercept_
array([-0.53074433])
Model parameters:
clf.get_params()
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'penalty': 'l2', 'random_state': None, 'tol': 0.0001}
Making predictions for $\hat y$:
clf.predict(X_test)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Predicting probabilities, $p(y_i = 1)$
pd.DataFrame(clf.predict_proba(X_test))\
.head(10)
0 | 1 | |
---|---|---|
0 | 0.691294 | 0.308706 |
1 | 0.952584 | 0.047416 |
2 | 0.726055 | 0.273945 |
3 | 0.770998 | 0.229002 |
4 | 0.680134 | 0.319866 |
5 | 0.705628 | 0.294372 |
6 | 0.637364 | 0.362636 |
7 | 0.888941 | 0.111059 |
8 | 0.559386 | 0.440614 |
9 | 0.936608 | 0.063392 |
The default metric for logistic regression is mean accuracy:
$$\mathrm{accuracy}(y, \hat y) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(\hat y_i = y_i)$$clf.score(X_test, y_test)
0.77272727272727271
But what if we want a more holistic view of how well this model works?
Our data set here is fairly small $(N=748)$, so what if we happened to randomly pick a non-representative chunk for testing?
Instead of summarizing the performance just on the test data we want to use $k$-fold cross-validation, splitting the data randomly into chunks — "folds" — and evaluating on each to get an idea of the variation in performance.
from sklearn import cross_validation
# come up with random folds of the data
kf = cross_validation.KFold(len(X), n_folds=5, shuffle=True)
def plot_scores(scores):
N = len(scores)
plt.bar(np.arange(1, N + 1) - 0.4, scores, color=set2[2])
plt.title('{}-fold cross-validation scores'.format(N), fontsize=18)
plt.xlabel('fold', fontsize=14)
plt.ylabel('score', fontsize=14)
plt.xlim(0.5, N + 0.5)
plt.ylim(0, 1)
plt.show()
# evaluate the fitted model on each fold in turn, returns a score for each fold
scores = cross_validation.cross_val_score(clf, X, y, cv=kf, n_jobs=1)
print 'scores:', scores
print 'average score:', np.mean(scores)
plot_scores(scores)
scores: [ 0.78 0.76666667 0.81333333 0.75167785 0.73154362] average score: 0.768644295302
We can try all sorts of other metrics, though, such as log loss:
$$\textrm{LogLoss}(y, \hat y) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$from sklearn.metrics import log_loss
log_loss(y_test, clf.predict_proba(X_test))
0.45710527540996404
Or the F1 score:
$$\textrm{F1} = 2 \frac{\textrm{precision} * \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$from sklearn.metrics import f1_score
f1_score(y_test, clf.predict(X_test))
0.12371134020618556
And if we want more than a single quantity summarizing score, sklearn
comes with all sorts of helpful tools like confusion_matrix
:
from itertools import permutations
from sklearn.metrics import confusion_matrix
# get the raw confusion matrix
cm = confusion_matrix(y_test, clf.predict(X_test))
# create a dataframe
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'pred {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
cmdf
pred 0 | pred 1 | |
---|---|---|
actual 0 | 283 | 5 |
actual 1 | 80 | 6 |
there are many available ...
metrics.accuracy_score, metrics.auc, metrics.average_precision_score, metrics.classification_report, metrics.confusion_matrix, metrics.f1_score, metrics.fbeta_score, metrics.hamming_loss, metrics.hinge_loss, metrics.jaccard_similarity_score, metrics.log_loss, metrics.matthews_corrcoef, metrics.precision_recall_curve, metrics.precision_recall_fscore_support, metrics.precision_score, metrics.recall_score, metrics.roc_auc_score, metrics.roc_curve, metrics.zero_one_loss, metrics.explained_variance_score, metrics.mean_absolute_error, metrics.mean_squared_error, metrics.r2_score, metrics.adjusted_mutual_info_score, metrics.adjusted_rand_score, metrics.completeness_score, metrics.homogeneity_completeness_v_measure, metrics.homogeneity_score, metrics.mutual_info_score, metrics.normalized_mutual_info_score, metrics.silhouette_score, metrics.silhouette_samples, metrics.v_measure_score, metrics.consensus_score, metrics.pairwise.additive_chi2_kernel, metrics.pairwise.chi2_kernel, metrics.pairwise.distance_metrics, metrics.pairwise.euclidean_distances, metrics.pairwise.kernel_metrics, metrics.pairwise.linear_kernel, metrics.pairwise.manhattan_distances, metrics.pairwise.pairwise_distances, metrics.pairwise.pairwise_kernels, metrics.pairwise.polynomial_kernel, metrics.pairwise.rbf_kernel, metrics.pairwise_distances, metrics.pairwise_distances_argmin, metrics.pairwise_distances_argmin_min
... and they all use the same basic syntax.
The regularization constant, $C$, is unknown. Let's say we want to try some different orders of magnitude as well experimenting with L1 regularization (default for sklearn
logistic regression is L2):
We know that the outcome can be different depending on what settings we choose.
from IPython.html.widgets import interact
from sklearn.metrics import roc_curve, auc
def plot_roc_curve(y_test, probas):
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.clf()
plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.legend(loc="lower right", fontsize=20)
plt.show()
def fit_model(penalty, C):
clf = LogisticRegression(penalty=penalty, C=C)
clf.fit(X_train, y_train)
plot_roc_curve(y_test, clf._predict_proba_lr(X_test))
interact(fit_model, penalty=('l2','l1'), C=(0.01, 100, 1))
<function __main__.fit_model>
From sklearn
, we get built in grid search for exploring the space. Using grid search cross validation, we can try all combinations, and automatically keep the most performant.
from sklearn.grid_search import GridSearchCV
params_to_try = {
'C': [0.01, 0.1, 1, 10, 100, 100],
'penalty': ['l1', 'l2']
}
gs = GridSearchCV(clf, param_grid=params_to_try, cv=5)
gs.fit(X, y)
GridSearchCV(cv=5, estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 100]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None, verbose=0)
print "Best parameters:", gs.best_params_
print "Best score:", gs.best_score_
Best parameters: {'penalty': 'l1', 'C': 0.01} Best score: 0.791443850267
One thing we might want to do is scale all of the data before fitting any models. As the sklearn docs point out:
Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
It's painful and error prone to wing it every time we want to do common transformations. The sklearn
toolbox comes with many useful tools for doing this in a consistent way.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_standardized = scaler.fit_transform(X_train.astype(np.float))
X_train_standardized
array([[ 1.15376999, 0.35655147, 0.35655147, 2.38280942], [-0.56644764, 6.41792654, 6.41792654, 2.50245107], [ 0.22749896, -0.60050775, -0.60050775, -0.9671566 ], ..., [-0.03714991, -0.12197814, -0.12197814, -0.64811222], [ 0.62447226, -0.44099788, -0.44099788, -0.01002345], [-0.83109651, 0.03753173, 0.03753173, -0.56835112]])
print 'column means:', np.round(X_train_standardized.mean(axis=0))
print 'column variances:', np.round(X_train_standardized.var(axis=0))
column means: [-0. 0. 0. 0.] column variances: [ 1. 1. 1. 1.]
Now that the scaler is fit on the training data, we can use it to transform new data:
X_new = np.array([25., 35., 9200., 90.])
scaler.transform(X_new)
array([ 2.08004102, 4.66331797, 4.95043573, 2.18340669])
from sklearn.decomposition import PCA
# instantiate the PCA transformation object
pca = PCA(n_components=2, whiten=True)
# fit the PCA object on and transform the training data
X_train_pca = pca.fit_transform(X_train_standardized)
# create a 3d figure
fig = plt.figure()
ax = fig.add_subplot(111)
# scatterplot the PCA points
ax.scatter(*np.hsplit(X_train_pca, 2), c=y_train, s=40,
cmap='coolwarm')
# annotate and show the figure
ax.set_xlabel('component 1')
ax.set_ylabel('component 2')
plt.show()
We have all seen the data science version of this:
But there's no reason for data_aug_13_v2_scaled_pca_3_dims_WHITENED_plus_some_tweaks_final_V4_FINAL.csv, right?
We shouldn't have to do that.
Pipeline
¶One big takeaway: it's all about documented, repeatable data-to-model pipelines.
At some point, especially for tasks we will be doing more than once, we might want to codify all of the exploratory work we've done. The standard way to express a beginning-to-end data transformation and model fitting workflow in sklearn
is the Pipeline.
from sklearn.pipeline import Pipeline
# define a pipeline with some transforms and a simple classifier
pipeline = Pipeline([
('scale', StandardScaler()),
('reduce_dim', PCA()),
('clf', LogisticRegression()),
])
# enumerate all of the different settings we wish to try out
parameters = {
'reduce_dim__n_components': (1, 2, 3),
'reduce_dim__whiten': (True, False),
'clf__penalty': ('l1', 'l2'),
'clf__C': (1e-3, 1e-2, 1, 1e1, 1e2, 1e3),
}
# grid search the parameter space
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
print("Performing grid search...\n")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.0s [Parallel(n_jobs=1)]: Done 50 jobs | elapsed: 0.2s [Parallel(n_jobs=1)]: Done 200 jobs | elapsed: 0.7s [Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed: 0.8s finished
Performing grid search... ('pipeline:', ['scale', 'reduce_dim', 'clf']) parameters: {'clf__penalty': ('l1', 'l2'), 'clf__C': (0.001, 0.01, 1, 10.0, 100.0, 1000.0), 'reduce_dim__whiten': (True, False), 'reduce_dim__n_components': (1, 2, 3)} Fitting 3 folds for each of 72 candidates, totalling 216 fits Best score: 0.777 Best parameters set: clf__C: 0.01 clf__penalty: 'l2' reduce_dim__n_components: 2 reduce_dim__whiten: True
sklearn
in the real world¶sklearn
Why choose? Use whichever one is best for the task at hand.
# load the %R cell magic extension
%load_ext rmagic
# send the dataframe over to the R instance
%Rpush df
%%R
library(ggplot2)
qplot(log(time), log(cc), data=df, color=donated)
%%R
blood.glm <- glm(donated ~ log(cc) + log(time), data=df, family="binomial")
print(summary(blood.glm))
par(mfrow=c(2, 2))
plot(blood.glm)
Call: glm(formula = donated ~ log(cc) + log(time), family = "binomial", data = df) Deviance Residuals: Min 1Q Median 3Q Max -1.5214 -0.7665 -0.5352 -0.2782 2.5447 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.9523 0.8749 -9.089 < 2e-16 *** log(cc) 1.4448 0.1693 8.535 < 2e-16 *** log(time) -1.0278 0.1454 -7.069 1.56e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 820.89 on 747 degrees of freedom Residual deviance: 728.63 on 745 degrees of freedom AIC: 734.63 Number of Fisher Scoring iterations: 5
... and we can pull data back from R to Python:
r_coeffs = %R coef(blood.glm)
r_coeffs
array([-7.95228363, 1.44481591, -1.02784654])
Even better for those in the Julia community ... you can replicate this whole experience when working in an IJulia notebook:
In fact - IPython notebooks are not just for Python anymore. The core functionality has been generalized as Project Jupyter, and language kernels are available or under active development for Julia, Haskell, and R.
Currently in beta — launching our first for-prize competitions in the next few weeks.