Annotated Example: Using Random Forests on Images

This tutorial is an annotated version of http://www.kaggle.com/c/datasciencebowl/details/tutorial, including commentary and some minor alterations to compare alternative approaches.

It's good practise to put all your import statements at the top of the file. If for no other reason, this is "what everyone does", so conforming to this will help others read your code, and help you better understand examples.

In [1]:
#Import libraries for doing image analysis
from skimage.io import imread
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier as RF
import glob
import os
from sklearn import cross_validation
from sklearn.cross_validation import StratifiedKFold as KFold
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
from matplotlib import colors
from pylab import cm
from skimage import segmentation
from skimage.morphology import watershed
from skimage import measure
from skimage import morphology
import numpy as np
import pandas as pd
from scipy import ndimage
from skimage.feature import peak_local_max

The original tutorial included this later. Thak makes sense in term of the progression of ideas, however it also makes the flow more complex. I'm going to get this method out of the way early: this is the scoring method. You don't really need to understand it now. Although optimising the submission according to the scoring function is relevant eventually, you really want to focus on the integrity of your solution at the moment. Just forget about it for now...

In [2]:
def multiclass_log_loss(y_true, y_pred, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    https://www.kaggle.com/wiki/MultiClassLogLoss

    Parameters
    ----------
    y_true : array, shape = [n_samples]
            true class, intergers in [0, n_classes - 1)
    y_pred : array, shape = [n_samples, n_classes]

    Returns
    -------
    loss : float
    """
    predictions = np.clip(y_pred, eps, 1 - eps)

    # normalize row sums to 1
    predictions /= predictions.sum(axis=1)[:, np.newaxis]

    actual = np.zeros(y_pred.shape)
    n_samples = actual.shape[0]
    actual[np.arange(n_samples), y_true.astype(int)] = 1
    vectsum = np.sum(actual * np.log(predictions))
    loss = -1.0 / n_samples * vectsum
    return loss

You'll need to set up a variable to point to the location of your data on disk. The data isn't included in the STML repository, since it's not licensed for such redistribution. However, it is readily obtained from Kaggle.

The second step also suppresses a lot of warnings raised by the underlying code. This shouldn't be needed -- it's the result of various package authors not having had the time to fully resolve and respond to issues. It's usually safe enough, but if you experience a particularly strange problem, you might want to skip this step to reveal more debugging information.

In [3]:
BASE_DIR = '../data/'
import warnings
warnings.filterwarnings("ignore")

print os.listdir(BASE_DIR)
['.ipynb_checkpoints', 'clean_output.py', 'expected_headers.csv', 'found_headers.csv', 'plankton_identification.pdf', 'resize_train', 'sampleSubmission.csv', 'sampleSubmission.csv.zip', 'sub1.csv.gz', 'submission.csv', 'test', 'test.zip', 'train', 'train.zip']

Something I'm hoping to do at some point is put together a better framework for solving machine learning problems in Python. This code is taking a directory layout like:

/data
--/category_one_label/example 1... example n
--/category_two_label/example 2... example n

and fetching the labels of all the categories. You need to be super-careful about the order of the labels in this case. For example, if you sort them, you might lose the association with the relevant image files and mis-label your learned categories. Tiny mistakes like this are a real problem, particularly if you need to spend an hour on the CPU to re-learn with adjusted labels. Lowering the barrier to entry for working with data will mean writing software which handles this kind of low-level data handling.

In [4]:
# get the classnames from the directory structure
directory_names = list(set(glob.glob(os.path.join(BASE_DIR,"train", "*"))\
 ).difference(set(glob.glob(os.path.join(BASE_DIR,"train","*.*")))))

Look at the comments in the next two functions ... I've commented there instead. When making your library, you should comment the functions you include there to record any tips and tricks with using those functions.

In [5]:
def getLargestRegion(props=None, labelmap=None, imagethres=None):
    '''
    This wasn't documented in the original tutorial, but it's doing something a bit magical. I'm still not sure I 
    know what's going on in here. Feedback appreciated!
    '''
    regionmaxprop = None
    for regionprop in props:
        # check to see if the region is at least 50% nonzero
        if sum(imagethres[labelmap == regionprop.label])*1.0/regionprop.area < 0.50:
            continue
        if regionmaxprop is None:
            regionmaxprop = regionprop
        if regionmaxprop.filled_area < regionprop.filled_area:
            regionmaxprop = regionprop
    return regionmaxprop


def getMinorMajorRatio(image):
    '''
    This also wasn't documented, but its functionality is also more obvious. This is getting the aspect 
    ratio of the underlying image. This is the first piece of 'insight' being applied. The information 
    is inherent in the information, so a sufficiently advanced machine learning algorithm should be able 
    to discover this feature for itself. A neural network, for example, might not need this step. The 
    author has recognised, however, that the aspect ratio is a useful characteristic of an image, and 
    therefore it is useful information to supply to the algorithm.

    How this is done is actually incredibly interesting. The aspect ratio value is NOT distinguished 
    from the rest of the image pixel data. As far as the random forest algorithm is concerned, this 
    might just be another pixel value to consider. The semantic difference between "the aspect ratio" 
    and "an image intensity value" is discarded completely.

    This method is highly generic and would apply to any image. You should create yourself a library 
    containing common methods for image analysis, and keep a copy there. 
    '''
    
    image = image.copy()
    # Create the thresholded image to eliminate some of the background
    imagethr = np.where(image > np.mean(image),0.,1.0)

    #Dilate the image
    imdilated = morphology.dilation(imagethr, np.ones((4,4)))

    # Create the label list
    label_list = measure.label(imdilated)
    label_list = imagethr*label_list
    label_list = label_list.astype(int)
    
    region_list = measure.regionprops(label_list)
    maxregion = getLargestRegion(region_list, label_list, imagethr)
    
    # guard against cases where the segmentation fails by providing zeros
    ratio = 0.0
    if ((not maxregion is None) and  (maxregion.major_axis_length != 0.0)):
        ratio = 0.0 if maxregion is None else  maxregion.minor_axis_length*1.0 / maxregion.major_axis_length
    return ratio

Handling Image Sizes

Images come in all shapes and sizes. As humans, we are all used to seeing things: the real world; images on computers and images on physical media. Somehow, we have no problem coping with concepts like rotation, distance and size. Our brain just takes care of it. If I reflect, I'm aware of an ability to somewhat 'zoom in' on an object, or at least focus my attention on a small area of it.

The computations equivalents are image resizing and image transformation.

The basic random forest algorithm doesn't really have an elegant way to handle missing data. That is equivalent to saying it doesn't really have an elegant way to handle images of varying sizes. Traditional image techniques such as resizing and transformation will be used. Resizing images to a computable size is also a technique to speed up network learning. On the other hand, it also risks losing key information. In the supplied tutorial, images are rescaled to be 25x25 squares regardless of their initial size or aspect ratio.

Later on, we will revisit this method to include additional features beyond the aspect ratio.

In [6]:
# Rescale the images and create the combined metrics and training labels

#get the total training images
def countImages():
    numberofImages = 0
    for folder in directory_names:
        for fileNameDir in os.walk(folder):   
            for fileName in fileNameDir[2]:
                 # Only read in the images
                if fileName[-4:] != ".jpg":
                  continue
                numberofImages += 1
    return numberofImages

%time numberofImages = countImages()
CPU times: user 234 ms, sys: 451 ms, total: 685 ms
Wall time: 1.18 s
In [7]:
def create_X_and_Y():
    '''
    This is basically a big block that only needs to be executed once (or so we might think at first).
    Putting it in a method still serves a couple of purposes.

      1) Makes it clear which variables which have lasting value (they are returned)
         as opposed to those which are only used temporarily (no need to pay attention to this)
      2) Allows us to easily run timing functions on the execution of the code
      3) Let's us put the code into a library later if it turns out to be relevant
    '''
    
    # We'll rescale the images to be 25x25
    maxPixel = 25   # This is a good candidate for a function argument
    imageSize = maxPixel * maxPixel
    num_rows = numberofImages # one row for each image in the training dataset
    num_features = imageSize + 1 # for our ratio
    num_features += 1 # for the size

    # X is the feature vector with one row of features per image
    # consisting of the pixel values and our metric
    X = np.zeros((num_rows, num_features), dtype=float)
    # y is the numeric class label 
    y = np.zeros((num_rows))
    
    files = []
    # Generate training data
    i = 0    
    label = 0
    # List of string of class names
    namesClasses = list()
    
    print "Reading images"
    # Navigate through the list of directories
    for folder in directory_names:
        # Append the string class name for each class
        currentClass = folder.split(os.pathsep)[-1]
        namesClasses.append(currentClass)
        for fileNameDir in os.walk(folder):   
            for fileName in fileNameDir[2]:
                # Only read in the images
                if fileName[-4:] != ".jpg":
                  continue
                
                # Read in the images and create the features
                nameFileImage = "{0}{1}{2}".format(fileNameDir[0], os.sep, fileName)            
                image = imread(nameFileImage, as_grey=True)
                size = image.size  # THIS IS ME GETTING THE IMAGE SIZE
                files.append(nameFileImage)
                axisratio = getMinorMajorRatio(image)
                image = resize(image, (maxPixel, maxPixel))
                
                # Store the rescaled image pixels and the axis ratio
                X[i, 0:imageSize] = np.reshape(image, (1, imageSize))
                X[i, imageSize] = axisratio
                X[i, imageSize + 1] = size  # THIS IS ME ADDING IT TO THE INPUT LAYER
                
                # Store the classlabel
                y[i] = label
                i += 1
                # report progress for each 5% done  
                report = [int((j+1)*num_rows/20.) for j in range(20)]
                if i in report: 
                    print np.ceil(i *100.0 / num_rows), "% done    ", 
        label += 1
        
    return (X, y)

%time X, y = create_X_and_Y()
Reading images
5.0 % done     10.0 % done     15.0 % done     20.0 % done     25.0 % done     30.0 % done     35.0 % done     40.0 % done     45.0 % done     50.0 % done     55.0 % done     60.0 % done     65.0 % done     70.0 % done     75.0 % done     80.0 % done     85.0 % done     90.0 % done     95.0 % done     100.0 % done    CPU times: user 1min 47s, sys: 5.18 s, total: 1min 52s
Wall time: 2min 3s

The original tutorial included a plot of the aspect ratio as it varied between classes. For the sake of brevity, that has been omitted here.

The next step is to proceed with training. This is not the simplest possible method of training. It's actually training five times, using different subsets of the input data each time. The goal here is to avoid something called overfitting, which is where you hone in too much on the details of the samples provided rather than the factors which apply more generally to examples in the wider population.

In [8]:
def perform_k_fold():

    # Get the probability predictions for computing the log-loss function
    kf = KFold(y, n_folds=5)
    # prediction probabilities number of samples, by number of classes
    y_pred = np.zeros((len(y),len(set(y))))
    for train, test in kf:
        X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]
        clf = RF(n_estimators=100, n_jobs=3)
        clf.fit(X_train, y_train)
        y_pred[test] = clf.predict_proba(X_test)
        
    return y_pred
        
%time y_pred =  perform_k_fold()
CPU times: user 10min 4s, sys: 33.4 s, total: 10min 38s
Wall time: 4min 4s
In [9]:
multiclass_log_loss(y, y_pred)
Out[9]:
3.5233668828943854

Success! (Kinda)

The log loss from our previous attempt was approximately 3.71. That's an improvement of 0.19. The actual results will vary slightly between runs because of the non-deterministic nature of initialising the decision trees which underpin the random forests.

So, we have achieved approximately a 0.19 degree of improvement as a result of this change. However, I have to say that doesn't feel like very much to me. Fortunately (and this is the great advantage of open competitions) we have an immediate benchmark to compare against: the Kaggle leaderboard. For more discussion of the significance of our gains and what it might mean going forward, return to the blog post...

In [ ]: