In [39]:
from IPython.display import Image
Image('https://kaggle2.blob.core.windows.net/competitions/kaggle/3966/media/overview-AfSIS-kaggle.png', width=700)
Out[39]:

Africa Soil Property Prediction Challenge (Kaggle)

The challenge of this Kaggle competition is to predict predict physical and chemical properties of soil using spectral measurements

Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management. Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum. The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.

This competition asks to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.

  • SOC: Soil organic carbon
  • pH: pH values
  • Ca: Mehlich-3 extractable Calcium
  • P: Mehlich-3 extractable Phosphorus
  • Sand: Sand content

Importing modules

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
%pylab inline
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import cPickle
from datetime import datetime
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import svm
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['indices', 'axes', 'mean', 'datetime', 'mod', 'test']
`%matplotlib` prevents importing * from pylab and numpy

Loading Data and Performing Basic Checks

In [41]:
# reding the csv file and setting sirts column as index
soil = pd.read_csv('./train/training.csv', index_col = 0)
# printing size of data frame
print 'Size of data frame: ', soil.shape 
# summary of the databy column
soil.describe()
Size of data frame:  (1157, 3599)
Out[41]:
m7497.96 m7496.04 m7494.11 m7492.18 m7490.25 m7488.32 m7486.39 m7484.46 m7482.54 m7480.61 ... REF3 REF7 RELI TMAP TMFI Ca P pH SOC Sand
count 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 ... 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000 1157.000000
mean 0.245666 0.240454 0.235631 0.238994 0.248176 0.251674 0.243996 0.235162 0.232874 0.232518 ... -0.661642 -0.638464 0.276786 0.563194 0.746303 0.006442 -0.014524 -0.028543 0.080414 -0.012646
std 0.114439 0.114804 0.115288 0.115075 0.114185 0.113603 0.113974 0.114723 0.115031 0.115021 ... 0.365572 0.326460 1.074667 0.649622 0.825242 1.070541 0.995469 0.920224 1.141989 0.988520
min -0.042260 -0.048559 -0.055518 -0.052353 -0.040608 -0.034516 -0.042619 -0.053856 -0.057699 -0.058482 ... -1.265010 -1.115423 -0.639823 -0.670742 -0.862741 -0.535828 -0.418309 -1.886946 -0.857863 -1.493378
25% 0.171156 0.166020 0.161043 0.164470 0.173065 0.175476 0.169058 0.161094 0.159238 0.158868 ... -0.917184 -0.881048 -0.452939 0.190708 0.056843 -0.451077 -0.345681 -0.717841 -0.615639 -0.899649
50% 0.252899 0.247918 0.244594 0.247920 0.255784 0.258029 0.251061 0.243775 0.241991 0.241599 ... -0.753623 -0.740423 -0.130139 0.316667 0.729111 -0.348682 -0.269595 -0.175376 -0.349974 -0.134651
75% 0.315508 0.310354 0.304742 0.309540 0.317786 0.320834 0.314091 0.304301 0.303235 0.302438 ... -0.445135 -0.432460 0.532450 0.955935 1.414215 -0.042654 -0.089755 0.376442 0.275121 0.786391
max 0.730793 0.725493 0.720711 0.723293 0.731205 0.733872 0.726075 0.717652 0.716443 0.716307 ... 0.366460 0.290323 5.612300 2.161892 2.976315 9.645815 13.266841 3.416117 7.619989 2.251685

8 rows × 3598 columns

In [42]:
# printing the first five rows of the whole data frame
soil.head()
Out[42]:
m7497.96 m7496.04 m7494.11 m7492.18 m7490.25 m7488.32 m7486.39 m7484.46 m7482.54 m7480.61 ... REF7 RELI TMAP TMFI Depth Ca P pH SOC Sand
PIDN
XNhoFZW5 0.302553 0.301137 0.299748 0.300354 0.302679 0.303799 0.301702 0.298936 0.298126 0.298120 ... -0.646673 1.687734 0.190708 0.056843 Topsoil -0.295749 -0.041336 -1.129366 0.353258 1.269748
9XNspFTd 0.270192 0.268555 0.266964 0.267938 0.271013 0.272346 0.269870 0.266976 0.266544 0.266766 ... -0.646673 1.687734 0.190708 0.056843 Subsoil -0.387442 -0.231552 -1.531538 -0.264023 1.692209
WDId41qG 0.317433 0.316265 0.314948 0.315224 0.316942 0.317764 0.316067 0.313874 0.313301 0.313296 ... -0.814516 1.806660 0.190708 0.056843 Topsoil -0.248601 -0.224635 -0.259551 0.064152 2.091835
JrrJf1mN 0.261116 0.259767 0.258384 0.259001 0.261310 0.262417 0.260534 0.258039 0.257246 0.257124 ... -0.814516 1.806660 0.190708 0.056843 Subsoil -0.332195 -0.318014 -0.577548 -0.318719 2.118477
ZoIitegA 0.260038 0.258425 0.256544 0.257030 0.259602 0.260786 0.258717 0.256352 0.255902 0.255822 ... -0.780242 0.430513 0.190708 0.056843 Topsoil -0.438350 -0.010210 -0.699135 -0.310905 2.164148

5 rows × 3599 columns

In [43]:
# checking whether index is unique
soil.index.is_unique
Out[43]:
True
In [44]:
# defininig a function to compute missing values
def count_missing(frame):
    return (frame.shape[0] * frame.shape[1]) - frame.count().sum()
# checking for missing values in data frame
print 'Missing Values: ', count_missing(soil)
Missing Values:  0

Exploratory Data Analysis and Plotting

The first 3578 columns represent IR measurements. Basically if we plot the first 3578 columns for each entry in the data frame we'll get the Infra Red spectrum of each soil sample. As it is unfeasible to check that for the whole dataset we plot 20 random rows. If you keep on running the following commands you'll get different images every time.

In [45]:
# getting a list of all the columns starting with 'm'
wavecolumns = [wn for wn in soil.columns.values if wn.startswith('m')]
# getting 20 random rows
randomrows = np.random.random_integers(0,soil.shape[0]-1,20)
# getting the indices of the random rows selected 
indices = [soil.index.values[index] for index in randomrows]
# plotting
fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(18,70), sharex=False)
fig.subplots_adjust(hspace = .08, wspace=.1)
axes = axes.ravel()
for i, index in enumerate(indices):
    (soil.loc[index,wavecolumns[0]:wavecolumns[-1]]).plot(ax=axes[i], lw=2.)
    axes[i].legend((index,),loc='upper left')
    axes[i].set_ylim(0,2.5)