from IPython.display import Image
Image('https://kaggle2.blob.core.windows.net/competitions/kaggle/3966/media/overview-AfSIS-kaggle.png', width=700)
The challenge of this Kaggle competition is to predict predict physical and chemical properties of soil using spectral measurements
Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management. Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum. The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.
Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.
This competition asks to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
%pylab inline
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import cPickle
from datetime import datetime
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import svm
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['indices', 'axes', 'mean', 'datetime', 'mod', 'test'] `%matplotlib` prevents importing * from pylab and numpy
# reding the csv file and setting sirts column as index
soil = pd.read_csv('./train/training.csv', index_col = 0)
# printing size of data frame
print 'Size of data frame: ', soil.shape
# summary of the databy column
soil.describe()
Size of data frame: (1157, 3599)
m7497.96 | m7496.04 | m7494.11 | m7492.18 | m7490.25 | m7488.32 | m7486.39 | m7484.46 | m7482.54 | m7480.61 | ... | REF3 | REF7 | RELI | TMAP | TMFI | Ca | P | pH | SOC | Sand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | ... | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 | 1157.000000 |
mean | 0.245666 | 0.240454 | 0.235631 | 0.238994 | 0.248176 | 0.251674 | 0.243996 | 0.235162 | 0.232874 | 0.232518 | ... | -0.661642 | -0.638464 | 0.276786 | 0.563194 | 0.746303 | 0.006442 | -0.014524 | -0.028543 | 0.080414 | -0.012646 |
std | 0.114439 | 0.114804 | 0.115288 | 0.115075 | 0.114185 | 0.113603 | 0.113974 | 0.114723 | 0.115031 | 0.115021 | ... | 0.365572 | 0.326460 | 1.074667 | 0.649622 | 0.825242 | 1.070541 | 0.995469 | 0.920224 | 1.141989 | 0.988520 |
min | -0.042260 | -0.048559 | -0.055518 | -0.052353 | -0.040608 | -0.034516 | -0.042619 | -0.053856 | -0.057699 | -0.058482 | ... | -1.265010 | -1.115423 | -0.639823 | -0.670742 | -0.862741 | -0.535828 | -0.418309 | -1.886946 | -0.857863 | -1.493378 |
25% | 0.171156 | 0.166020 | 0.161043 | 0.164470 | 0.173065 | 0.175476 | 0.169058 | 0.161094 | 0.159238 | 0.158868 | ... | -0.917184 | -0.881048 | -0.452939 | 0.190708 | 0.056843 | -0.451077 | -0.345681 | -0.717841 | -0.615639 | -0.899649 |
50% | 0.252899 | 0.247918 | 0.244594 | 0.247920 | 0.255784 | 0.258029 | 0.251061 | 0.243775 | 0.241991 | 0.241599 | ... | -0.753623 | -0.740423 | -0.130139 | 0.316667 | 0.729111 | -0.348682 | -0.269595 | -0.175376 | -0.349974 | -0.134651 |
75% | 0.315508 | 0.310354 | 0.304742 | 0.309540 | 0.317786 | 0.320834 | 0.314091 | 0.304301 | 0.303235 | 0.302438 | ... | -0.445135 | -0.432460 | 0.532450 | 0.955935 | 1.414215 | -0.042654 | -0.089755 | 0.376442 | 0.275121 | 0.786391 |
max | 0.730793 | 0.725493 | 0.720711 | 0.723293 | 0.731205 | 0.733872 | 0.726075 | 0.717652 | 0.716443 | 0.716307 | ... | 0.366460 | 0.290323 | 5.612300 | 2.161892 | 2.976315 | 9.645815 | 13.266841 | 3.416117 | 7.619989 | 2.251685 |
8 rows × 3598 columns
# printing the first five rows of the whole data frame
soil.head()
m7497.96 | m7496.04 | m7494.11 | m7492.18 | m7490.25 | m7488.32 | m7486.39 | m7484.46 | m7482.54 | m7480.61 | ... | REF7 | RELI | TMAP | TMFI | Depth | Ca | P | pH | SOC | Sand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PIDN | |||||||||||||||||||||
XNhoFZW5 | 0.302553 | 0.301137 | 0.299748 | 0.300354 | 0.302679 | 0.303799 | 0.301702 | 0.298936 | 0.298126 | 0.298120 | ... | -0.646673 | 1.687734 | 0.190708 | 0.056843 | Topsoil | -0.295749 | -0.041336 | -1.129366 | 0.353258 | 1.269748 |
9XNspFTd | 0.270192 | 0.268555 | 0.266964 | 0.267938 | 0.271013 | 0.272346 | 0.269870 | 0.266976 | 0.266544 | 0.266766 | ... | -0.646673 | 1.687734 | 0.190708 | 0.056843 | Subsoil | -0.387442 | -0.231552 | -1.531538 | -0.264023 | 1.692209 |
WDId41qG | 0.317433 | 0.316265 | 0.314948 | 0.315224 | 0.316942 | 0.317764 | 0.316067 | 0.313874 | 0.313301 | 0.313296 | ... | -0.814516 | 1.806660 | 0.190708 | 0.056843 | Topsoil | -0.248601 | -0.224635 | -0.259551 | 0.064152 | 2.091835 |
JrrJf1mN | 0.261116 | 0.259767 | 0.258384 | 0.259001 | 0.261310 | 0.262417 | 0.260534 | 0.258039 | 0.257246 | 0.257124 | ... | -0.814516 | 1.806660 | 0.190708 | 0.056843 | Subsoil | -0.332195 | -0.318014 | -0.577548 | -0.318719 | 2.118477 |
ZoIitegA | 0.260038 | 0.258425 | 0.256544 | 0.257030 | 0.259602 | 0.260786 | 0.258717 | 0.256352 | 0.255902 | 0.255822 | ... | -0.780242 | 0.430513 | 0.190708 | 0.056843 | Topsoil | -0.438350 | -0.010210 | -0.699135 | -0.310905 | 2.164148 |
5 rows × 3599 columns
# checking whether index is unique
soil.index.is_unique
True
# defininig a function to compute missing values
def count_missing(frame):
return (frame.shape[0] * frame.shape[1]) - frame.count().sum()
# checking for missing values in data frame
print 'Missing Values: ', count_missing(soil)
Missing Values: 0
The first 3578 columns represent IR measurements. Basically if we plot the first 3578 columns for each entry in the data frame we'll get the Infra Red spectrum of each soil sample. As it is unfeasible to check that for the whole dataset we plot 20 random rows. If you keep on running the following commands you'll get different images every time.
# getting a list of all the columns starting with 'm'
wavecolumns = [wn for wn in soil.columns.values if wn.startswith('m')]
# getting 20 random rows
randomrows = np.random.random_integers(0,soil.shape[0]-1,20)
# getting the indices of the random rows selected
indices = [soil.index.values[index] for index in randomrows]
# plotting
fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(18,70), sharex=False)
fig.subplots_adjust(hspace = .08, wspace=.1)
axes = axes.ravel()
for i, index in enumerate(indices):
(soil.loc[index,wavecolumns[0]:wavecolumns[-1]]).plot(ax=axes[i], lw=2.)
axes[i].legend((index,),loc='upper left')
axes[i].set_ylim(0,2.5)
As you can see the last 21 columns are what is left after the IR measurements. The last five columns are the target of prediction, while the other ones are (shown in the following command line):
BSA : average long-term Black Sky Albedo measurements from MODIS satellite images (BSAN = near-infrared, BSAS = shortwave, BSAV = visible)
CTI: compound topographic index calculated from Shuttle Radar Topography Mission elevation data
ELEV: Shuttle Radar Topography Mission elevation data
EVI: average long-term Enhanced Vegetation Index from MODIS satellite images.
LST: average long-term Land Surface Temperatures from MODIS satellite images (LSTD = day time temperature, LSTN = night time temperature)
Ref: average long-term Reflectance measurements from MODIS satellite images (Ref1 = blue, Ref2 = red, Ref3 = near-infrared, Ref7 = mid-infrared)
Reli: topographic Relief calculated from Shuttle Radar Topography mission elevation data
TMAP & TMFI: average long-term Tropical Rainfall Monitoring Mission data (TMAP = mean annual precipitation, TMFI = modified Fournier index)
Depth: Depth of the soil sample (2 categories: "Topsoil", "Subsoil"). We are going to focus on this categorical variable after having inspected the other ones.
soil.columns[-21:]
Index([u'BSAN', u'BSAS', u'BSAV', u'CTI', u'ELEV', u'EVI', u'LSTD', u'LSTN', u'REF1', u'REF2', u'REF3', u'REF7', u'RELI', u'TMAP', u'TMFI', u'Depth', u'Ca', u'P', u'pH', u'SOC', u'Sand'], dtype='object')
The following commands are devoted to inspect a little bit more in detail the variables which have been listed.
An aspect that should be checked is the distribution of each of these features. Thus we are going to plot some histograms to visualize what's going on. The target of predictions are plotted in red.
histog = [col for col in soil.columns[-21:].values if col != 'Depth']
fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(18,70), sharex=False)
fig.subplots_adjust(hspace = .08, wspace=.1)
axes = axes.ravel()
for i, col in enumerate(histog):
if i > 14:
soil[col].hist(ax=axes[i], bins=18, color='red')
leg = col + ' (target of regression)'
axes[i].legend((leg,),loc='upper right')
else:
soil[col].hist(ax=axes[i], bins=18)
axes[i].legend((col,),loc='upper right')
Now it's time to check the catetorical variable (Depth). To do that we generate boxplots of all the features except the IR ones.
fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(18,70), sharex=False)
fig.subplots_adjust(hspace = .2, wspace=.07)
axes = axes.ravel()
for i, col in enumerate(histog):
if i > 14:
soil.boxplot(ax=axes[i], column=col, by='Depth')
leg = col + ' (target of regression)'
axes[i].legend((leg,),loc='upper right')
for soiltype in ['Subsoil', 'Topsoil']:
y = soil[col][soil.Depth == soiltype]
for mean in [1,2]:
x = np.random.normal(mean, 0.04, size=len(y))
axes[i].plot(x, y, 'r.', alpha=0.08)
else:
soil.boxplot(ax=axes[i], column=col, by='Depth')
axes[i].legend((col,),loc='upper right')
for soiltype in ['Subsoil', 'Topsoil']:
y = soil[col][soil.Depth == soiltype]
for mean in [1,2]:
x = np.random.normal(mean, 0.04, size=len(y))
axes[i].plot(x, y, 'r.', alpha=0.08)
The distributions of a single feature split by Depth do not seem to different too much. Let's generate two dummy variables out of the Depth column and go trough further checks.
# generating dummy variables out of Depth
soildepth = pd.get_dummies(soil['Depth'])
soildepth.head(10)
Subsoil | Topsoil | |
---|---|---|
PIDN | ||
XNhoFZW5 | 0 | 1 |
9XNspFTd | 1 | 0 |
WDId41qG | 0 | 1 |
JrrJf1mN | 1 | 0 |
ZoIitegA | 0 | 1 |
QkhF9azr | 1 | 0 |
kFpRoFpt | 0 | 1 |
QH0y7Gc0 | 1 | 0 |
yTfzYmjL | 0 | 1 |
9my11nPJ | 1 | 0 |
The following figure is the plot of 50 entries (1000 to 1050) of the newly created 'Subsoil' column. As you can see there is definitely something going on there. In particular what's happening is that apparently the rows of the data frame are coupled. Two consecutive entries of the table are simply just soil samples taken at differents depths (Subsoil, Topsoil) from the same geographical area. Further evidence is provided by plotting some of the other features and noticing that the trend is not randomic but that consecutive values are paired. For the sake of simplicity we'll focus on REF3 only.
# plotting 50 consecutive entries os Subsoil
soildepth.Subsoil[1000:1050].plot(figsize=(10, 8))
<matplotlib.axes.AxesSubplot at 0x7f5a1deca450>
# plotting the first 30 entries os REF3 feature. As you can see consecutive values are paired, probably meaning that soil
# samples are being collected by the same exact point at two different depths. This also explaines the fact that boxplots
# of the same predictor split by Depth are not so different one to the other
soil.REF3[:30].plot(figsize=(10, 8))
<matplotlib.axes.AxesSubplot at 0x7f5a1c3e5210>
# plotting the whole REF3 feature.
soil.REF3.plot(figsize=(10, 8))
<matplotlib.axes.AxesSubplot at 0x7f5a1d53d310>
# more than the previous feature we would probably like something like the following. Randomized data.
# we'll take care of the shuffling during the Machine Learning pipeline.
pd.DataFrame(soil.REF3).apply(np.random.permutation).plot(figsize=(10, 8))
<matplotlib.axes.AxesSubplot at 0x7f5a1d91e210>
def computeMCRMSE(y_actual, y_predicted):
"""
computes the mean columnwise root mean squared error given
5-column prediction and 5-column actual value
"""
n = len(y_actual[0])
summ = 0
for i in range(len(y_actual)):
summ += np.sqrt(((np.array(y_actual[i]) - y_predicted[i])**2).sum()/n)
return summ/5
def performParameterSelection(model_name, X, y):
"""
performs gridSearchCV on train set with specific model and parameters
"""
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs=-1, cv=5, verbose=1)
gs.fit(X, y)
pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
fname_out = '{}-{}-{}.pickle'.format(model_name, y.name, datetime.now())
with open(fname_out, 'wb') as fout:
cPickle.dump(gs, fout, -1)
print "Saved model to {}".format(fname_out)
def getDummiedSoilDepth(df):
"""
get dummy variables from binary categorical Soil Depth (Subsoil, Supsoil).
Then drop categorical column.
"""
soildepth = pd.get_dummies(df['Depth'])
df.insert(0,'Subsoil',soildepth['Subsoil'])
df.insert(0,'Topsoil',soildepth['Topsoil'])
df = df.drop('Depth', 1)
return df
def shuffleData(df):
"""
random shuffle dataframe cause entries are paired in Depth
"""
df = df.reindex(np.random.permutation(df.index))
return df
def splitTrainTest(dataframe, percentage):
"""
splits shuffled dataframe in train and test set according to percentage (float).
"""
index = int(np.floor(dataframe.shape[0]*percentage))
predictors = dataframe.iloc[:,:-5]
predictors = pd.DataFrame(preprocessing.scale(predictors), index=predictors.index, columns=predictors.columns)
output = dataframe.iloc[:,-5:]
X_train = predictors[:index]
y_train = output[:index]
Ca_train = y_train['Ca']
P_train = y_train['P']
pH_train = y_train['pH']
SOC_train = y_train['SOC']
Sand_train = y_train['Sand']
X_test = predictors[index:]
y_test = output[index:]
Ca_test = y_test['Ca']
P_test = y_test['P']
pH_test = y_test['pH']
SOC_test = y_test['SOC']
Sand_test = y_test['Sand']
return X_train, Ca_train, P_train, pH_train, SOC_train, Sand_train, X_test, Ca_test, P_test, pH_test, SOC_test, Sand_test
def performSVR(X_train, y_train, X_test, parameters):
"""
SVR regression
"""
C = parameters[0]
epsilon = parameters[1]
kernel = parameters[2]
clf = svm.SVR(C=C, epsilon=epsilon, kernel= kernel)
clf.fit(X_train, y_train)
fname_out = '{}-{}final.pickle'.format(y_train.name, datetime.now())
with open(fname_out, 'wb') as f:
cPickle.dump(clf, f, -1)
return clf.predict(X_test)
def performRidge(X_train, y_train, X_test, parameters):
"""
Ridge regression
"""
alpha = parameters[0]
clf = linear_model.Ridge(alpha=alpha)
clf.fit(X_train, y_train)
fname_out = '{}-{}final.pickle'.format(y_train.name, datetime.now())
with open(fname_out, 'wb') as f:
cPickle.dump(clf, f, -1)
return clf.predict(X_test)
def performLasso(X_train, y_train, X_test, parameters):
"""
Lasso regression
"""
alpha = parameters[0]
clf = linear_model.Lasso()(alpha=alpha)
clf.fit(X_train, y_train)
fname_out = '{}-{}final.pickle'.format(y_train.name, datetime.now())
with open(fname_out, 'wb') as f:
cPickle.dump(clf, f, -1)
return clf.predict(X_test)
def performENET(X_train, y_train, X_test, parameters):
"""
ElasticNet regression
"""
alpha = parameters[0]
l1_ratio = parameters[1]
clf = linear_model.ElasticNet()(alpha=alpha, l1_ratio=l1_ratio)
clf.fit(X_train, y_train)
fname_out = '{}-{}final.pickle'.format(y_train.name, datetime.now())
with open(fname_out, 'wb') as f:
cPickle.dump(clf, f, -1)
return clf.predict(X_test)
Given an Algorithm and a list of parameters SearchGrid generates all the possible combinations of the parameters and performs K-fold cross validation on top of the training set with each of the previously generated combinations. Eventually it returns the best model and saves it.
From a technical point of view we are facing a data frame in which the number of features is higher than the number of entries. We should definitely implement proper algorithms to handle this situation which is prone to overfitting. Thus it is not wise to turn to Ensemble Methods or SVM with radial kernels. The intuition is that we want to keep it simple and not plug additional flexibility into the dataset. General Linear Models are a good starting point, together with SVR with a linear kernel. The regularization term introduced by Lasso, Ridge and Elastic Net (in the middle between the previous two) is mandatory in this situation.
# Scikit-Learn Algorithm Cheat Sheet
from IPython.core.display import HTML
HTML("<iframe src=http://scikit-learn.org/stable/tutorial/machine_learning_map/ width=1000 height=600></iframe>")
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.001, 0.01, 0.1, 1]},
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.001, 0.01, 0.1, 1]},
),
'elasticnet': (
linear_model.ElasticNet(),
{'alpha': [0.001, 0.01, 0.1, 1],
'l1_ratio': [0.001, 0.001, 0.01, 0.1, 1]},
),
'svr': (
svm.SVR(),
{'C': [0.001, 0.01, 0.1, 1],
'epsilon': [0.001, 0.01, 0.1, 1],
'kernel': ['linear']},
)
}
# reread the original csv file to overwrite the soil data frame and start everything again
soil = pd.read_csv('./train/training.csv', index_col = 0)
# generates dummy variables out of Depth feature
soil = getDummiedSoilDepth(soil)
# setting seed and shuffling data frame
np.random.seed(2013)
soil = shuffleData(soil)
# generate all test and train set with target variables
X_train, Ca_train, P_train, pH_train, SOC_train, Sand_train, X_test, Ca_test, P_test, pH_test, SOC_test, Sand_test = splitTrainTest(soil, 0.8)
trainCV = [Ca_train, P_train, pH_train, SOC_train, Sand_train]
# for each target variable a CV on train set will be performed.
# SearchGrid will print to screen but in the same time a log file with all results will be saved.
# Each best model for each algorithm and target will be saved
#######################################################################################################
# 2 LINES COMMENTED OUT IN THE FOLLOWING CODE TO PREVENT TO SAVE TO FILE.
# IF YOU WANT TO HAVE EVERYTHING RUNNING PROPERLY
#######################################################################################################
for y_tr in trainCV:
logname = './{}-{}.txt'.format(y_tr.name, datetime.now())
#sys.stdout = open(logname, 'w')
for model in MODELS.keys():
pass
#print '======================================='
#print y_tr.name, model
#print '======================================='
#performParameterSelection(model, X_train, y_tr)
This is an example of how the log file for the Ca target variable look like.
from IPython.core.display import Image
Image(filename='cv.png', width=700)
The previous command generates 5 log files along with 4 pickle best model files for each target.
These are the results of the SearchGrid (best model for each target variable):
# test set with all the test targets previously generated
test = [X_test, Ca_test, P_test, pH_test, SOC_test, Sand_test]
# best models
best_Ca = './Sep25/svr-Ca-2014-09-25 08:11:09.934405.pickle'
best_P = './Sep25/svr-P-2014-09-25 08:31:53.611168.pickle'
best_pH = './Sep25/ridge-pH-2014-09-25 08:48:25.247992.pickle'
best_SOC = './Sep25/ridge-SOC-2014-09-25 09:06:50.757604.pickle'
best_Sand = './Sep25/svr-Sand-2014-09-25 09:23:22.520904.pickle'
best_models = [best_Ca, best_P, best_pH, best_SOC, best_Sand]
y_predicted = []
# every model is used to be applied on the test set to predict the target value
for model in best_models:
with open(model, 'rb') as fin:
mod = cPickle.load(fin)
y_predicted.append(mod.predict(X_test))
# comput the mean columnwise root mean squared error
print 'MCRMSE: ', computeMCRMSE(test[1:], y_predicted)
MCRMSE: 0.410190847722
The purpose of this notebook was to show the potentials of Pandas and Scikit-Learn in an IPython Environment on a real dataset. From a Machine Learning point of view the approach we followed is pretty simple and naive. No proper feature selection has been performed (except for the Regularization ensured by GLM). Anyway we got a result and built a pipeline which can be reused and improved in the future.