Notebook

Introduction¶

In this article we present our Cats-Dogs-Classifier, which can tell whether a given image shows a dog or a cat with an accuracy of 80%. We achieve this by reproducing the results of the paper Machine Learning Attacks Against the Asirra CAPTCHA (Philippe Golle). The "Cats vs Dogs - Classification problem" raised a lot of interest in the context of the Kaggle "Dogs vs. Cats" competition. Our classifier is built on top of the python scientific eco-system. We expect our reader to have read the mentioned paper.

Organisation of this article¶

In the spirit of "open science", we make our source code easily available to you, so that you can play with it and reproduce the results easily. We therefore placed the whole source code in a GitHub repository. You can use this notebook to play interactively with the data, but we also make it available as a static html page for those who just want to have a quick look.

Along this documentation, you will find the following files in the repository:

download.sh Script which downloads the data
resize.sh Script which resizes the data
README.md README with installation instructions

Used tools¶

We performed our calculation using a 64core/512gb compute server, having four 16-core AMD "Abu Dhabi" 6376 CPUs (2.3GHz standard clockrate).

We used the following software along with its version numbers:

Python 2.7.3
sklearn 0.14.1
scipy 0.10.1
numpy 1.8.1

In [1]:

import os
import multiprocessing
import re
import time
import random
from glob import glob
import itertools
import pickle

import numpy as np

import skimage
from skimage import io

from sklearn import cross_validation
from sklearn import svm
from sklearn import preprocessing
from sklearn.linear_model.logistic import LogisticRegression

Data¶

We use as input data the data from the kaggle competition. The Data came as images of different sizes. As the features we will use are based on fixed sizes images and the Asirra challenge also presents images of fixed size, we resize all the images to 250×250 pixels. If the picture is not a square, we use a white background. That seems to be the same way as the picture are presented in the asirra challenge, so our data should be very similar to the data used in the article. The script "resize.sh" will do the job for us. We will shuffle the list of filenames randomly and take the first 10,000 files for all our calculations.

In [2]:

def build_file_list(dir):
  """ Given a directory, it builds a shuffled list of the file """
  random.seed(42)
  image_filenames = glob('{}/*.jpg'.format(dir))
  image_filenames.sort() # make the function independent of the order your operating system returns the files
  random.shuffle(image_filenames)
  return image_filenames

def build_labels(file_list,n_samples=None):
  """ build the labels from the filenames: cats corresponds to a 1, dogs corresonds to a -1 """
  if(n_samples==None): n_samples=len(file_list)
  n_samples=max(n_samples,len(file_list))
  file_list = file_list[:n_samples]
  y = np.zeros(n_samples,dtype=np.int32)
  for (i,f) in enumerate(file_list):
    if "dog" in str(f): 
      y[i]=-1
    else:
      y[i]=1
      assert("cat" in str(f)) 
  return y

In [3]:

file_list = build_file_list("data/train_resized")
pickle.dump(file_list, open("file_list.pkl","wb"))

y=build_labels(file_list,n_samples=None)
np.save('y',y)

Building Color feature¶

The following functions build the feature matrices for the color features exactly as described in the paper.

In [11]:

def file_to_rgb(filename):
  """ return an image in rgb format: a gray scale image will be converted, a rgb image will be left untouched"""
  bild = io.imread(filename)
  if (bild.ndim==2):
    rgb_bild= skimage.color.gray2rgb(bild)
  else:
    rgb_bild = bild
  return rgb_bild

def hsv_to_feature(hsv,N,C_h,C_s,C_v):
  """ Takes an hsv picture and returns a feature vector for it.
  The vector is built as described in the paper 'Machine Learning Attacks Against the Asirra CAPTCHA' """  
  res = np.zeros((N,N,C_h,C_s,C_v))
  cell_size= 250/N
  h_range = np.arange(0.0,1.0,1.0/C_h)
  h_range = np.append(h_range,1.0)
  s_range = np.arange(0.0,1.0,1.0/C_s)
  s_range = np.append(s_range,1.0)
  v_range = np.arange(0.0,1.0,1.0/C_v)
  v_range = np.append(v_range,1.0)
  for i in range(N):
    for j in range(N):
      cell= hsv[i*cell_size:i*cell_size+cell_size,j*cell_size:j*cell_size+cell_size,:]
      # check for h
      for h in range(C_h):
        h_cell = np.logical_and(cell[:,:,0]>=h_range[h],cell[:,:,0]<h_range[h+1])
        for s in range(C_s): 
          s_cell = np.logical_and(cell[:,:,1]>=s_range[s],cell[:,:,1]<s_range[s+1])
          for v in range(C_v):
            v_cell = np.logical_and(cell[:,:,2]>=v_range[v],cell[:,:,2]<v_range[v+1])
            gesamt = np.logical_and(np.logical_and(h_cell,s_cell),v_cell)
            res[i,j,h,s,v] = gesamt.any()
  return np.asarray(res).reshape(-1)

def build_color_featurevector(pars):
  """ Takes a jpeg file and the parameters of the feature vector and builds such a vector"""
  filename,N,C_h,C_s,C_v =pars
  rgb_bild = file_to_rgb(filename)
  assert (rgb_bild.shape[2]==3)
  return hsv_to_feature(skimage.color.rgb2hsv(rgb_bild),N,C_h,C_s,C_v)
	    
def build_color_featurematrix(file_list,N,C_h,C_s,C_v):
    """ Builds the feature matrix of the jpegs in file list
    return featurematrix where the i-th row corresponds to the feature in the i-th image of the file list"
    """
    pool = multiprocessing.Pool()
    x = [(f,N,C_h,C_s,C_v) for f in file_list]
    res = pool.map(build_color_featurevector,x)
    return np.array(res)

In [9]:

def build_color_feature_matrices_or_load(file_list):
  try:
    F1 = np.load("F1.npy")
  except IOError:
    F1 = build_color_featurematrix(file_list,1,10,10,10)
  try:
    F2 = np.load("F2.npy")
  except IOError:
    F2 = build_color_featurematrix(file_list,3,10,8,8)
  try:
    F3 = np.load("F3.npy")
  except IOError:
    F3 = build_color_featurematrix(file_list,5,10,6,6)
  return F1,F2,F3

In [12]:

file_list = pickle.load(open("file_list.pkl","rb"))
%time F1,F2,F3 =build_color_feature_matrices_or_load(file_list[:10000])
np.save("F1",F1) 
np.save("F2",F2)
np.save("F3",F3)

CPU times: user 14.6 s, sys: 9.18 s, total: 23.8 s
Wall time: 3min 8s

We approximately needed 25 minutes to build the feature matrices on our computing cluster.

Classifying with Color features¶

In [15]:

def classify_color_feature(F,y):
  start = time.time()
  clf = svm.SVC(kernel='rbf',gamma=0.001)
  scores = cross_validation.cross_val_score(clf, F, y, cv=5,n_jobs=-1) 
  time_diff = time.time() - start 
  print "Accuracy: %.1f  +- %.1f   (calculated in %.1f seconds)"   % (np.mean(scores)*100,np.std(scores)*100,time_diff)

In [16]:

F1=np.load("F1.npy")
F2=np.load("F2.npy")
F3=np.load("F3.npy")
y=np.load("y.npy")
union = np.hstack((F1,F2,F3))

classify_color_feature(F1[:5000],y[:5000])
classify_color_feature(F2[:5000],y[:5000])
classify_color_feature(F3[:5000],y[:5000])

classify_color_feature(F3[:10000],y[:10000]) 

classify_color_feature(union[:5000],y[:5000])
classify_color_feature(union[:10000],y[:10000]) 

Accuracy: 66.6  +- 1.1   (calculated in 226.3 seconds)
Accuracy: 75.4  +- 0.6   (calculated in 2305.1 seconds)
Accuracy: 74.4  +- 0.9   (calculated in 1577.5 seconds)
Accuracy: 75.6  +- 1.0   (calculated in 7111.8 seconds)
Accuracy: 75.8  +- 0.9   (calculated in 1014.7 seconds)
Accuracy: 77.1  +- 0.8   (calculated in 5067.6 seconds)

These result are very close to what has been reported in the paper.

Building texture feature¶

The following functions build the texture feature matrices as described in the paper. We experienced that this process is computationally very, very heavy. Therefore we chose a shortcut and our texture features are not as fine-grained as described in the paper.

In [2]:

def texture_texture_distance(T1,T2):
  """ Returns the distance between two tiles. """
  y=np.linalg.norm(T1-T2,axis=2)
  assert(y.shape==(5,5))
  return np.mean(y)

def build_tiles(number_of_tiles,files,threshold):
  """ Returns a number_of_tiles*5*5 - Matrix, where every 5*5_texture is at least threshold from each other """
  current=0
  textures = np.zeros((number_of_tiles,5,5,3))
  while(current<number_of_tiles):
    file_index = random.randint(0,len(files)-1)
    i = random.randint(0,49)
    j = random.randint(0,49)
    bild = io.imread(files[file_index])
    if (bild.ndim==2):
      rgb_bild= skimage.color.gray2rgb(bild)
    else:
      rgb_bild = bild
    cell = rgb_bild[i*5:i*5+5,j*5:j*5+5,:] 
    close = False
    for i in range(current):
      T = textures[i,:,:] 
      if(texture_texture_distance(cell,T)<threshold):
        close=True
        break
    if(not close):
      textures[current,:,:]=cell
      current+=1
  return textures
    
def build_textures_or_load(number_of_tiles,files,threshold):
    try:
        textures = np.load("textures.npy")
    except IOError:
        textures = build_tiles(number_of_tiles,files,threshold)
    return textures
    

def texture_image_distance_simple(rgb,T):
  """ Returns the distance between an image and a tile. 
      This is a simplified version of the distance described in the paper. Instead of using every possible
      upper-left corner, we only use these, which are multiplies of five"""
  assert(rgb.shape==(250,250,3))
  assert(T.shape==(5,5,3))
  bigtile = np.tile(T,(50,50,1))
  distances = np.linalg.norm(rgb-bigtile,axis=2)
  assert(distances.shape==(250,250))
  splitted = [np.hsplit(x,50) for x in np.vsplit(distances,50)]
  merged = list(itertools.chain.from_iterable(splitted)) # flatten the list
  assert(len(merged)==50*50) # splitted should contain a list of the submatrices
  maxvalues=[np.max(x) for x in merged]
  return np.min(maxvalues)

def build_texture_feature_vector(pars):
  filename,textures=pars
  bild = io.imread(filename)
  if (bild.ndim==2):
    rgb= skimage.color.gray2rgb(bild)
  else:
    rgb = bild
  res=[]
  for t in textures:
    res.append(texture_image_distance_simple(rgb,t))
  return res

def build_texture_feature_matrix(file_list,texture):
  """ Builds the feature matrix of the jpegs in file_list, takes maximal n_samples
    return X """
  pool = multiprocessing.Pool()
  res = pool.map(build_texture_feature_vector,[(f,texture) for f in file_list])
  return np.array(res)

def build_texture_feature_matrix_or_load(file_list,textures):
    try:
        G = np.load("G.npy")
    except IOError:
        print "Building matrix"
        G = build_texture_feature_matrix(file_list,textures)
    return G

In [28]:

file_list = pickle.load(open("file_list.pkl","rb"))
%time textures = build_textures_or_load(5000,file_list,40)
np.save("textures",textures) 

CPU times: user 20min 15s, sys: 42.6 s, total: 20min 57s
Wall time: 20min 37s

It took approximately 20 minutes to build the textures.

In [4]:

textures=np.load("textures.npy")
file_list = pickle.load(open("file_list.pkl","rb"))
%time G=build_texture_feature_matrix_or_load(file_list[:10000],textures)
np.save("G",G)

CPU times: user 4 ms, sys: 556 ms, total: 560 ms
Wall time: 574 ms

The calculation of the matrix G took really, really long, we spent a whole day on it.

Classifying with texture features¶

We found that the reported classifier in the paper did a very bad job in classifying the images. Therefore, we switched to logistic regression. Nevertheless, we don't reach the values reported in the paper.

In [5]:

def classify_texture_feature(G,y):
  start = time.time()
  scores = cross_validation.cross_val_score(LogisticRegression(), G, y, cv=5,n_jobs=-1) 
  time_diff = time.time() - start 
  print "Accuracy: %.1f  +- %.1f   (calculated in %.1f seconds)"   % (np.mean(scores)*100,np.std(scores)*100,time_diff)

In [6]:

G=np.load("G.npy")
y=np.load("y.npy")
classify_texture_feature(G[:5000,:1000],y[:5000])
classify_texture_feature(G[:5000,:5000],y[:5000])
classify_texture_feature(G[:10000,:5000],y[:10000])

Accuracy: 66.8  +- 0.7   (calculated in 145.1 seconds)
Accuracy: 68.6  +- 1.4   (calculated in 134.9 seconds)
Accuracy: 68.0  +- 0.5   (calculated in 1279.2 seconds)

Combined Classifiers¶

As we didn't arrive at the values for the texture feature, we don't expect our combined classifier to arrive at the values reported in the figure. Nevertheless, we implement the combined classifier as described in the paper and report its results.

In [2]:

class Combined:
  def __init__(self,clf1,clf2):
    self.clf1=clf1
    self.clf2=clf2
  def predict(self,F,G):
    y1=self.clf1.predict_proba(F)
    y2=self.clf2.predict_proba(G)
    y_out= 2*y1/3+y2/3
    m=np.argmax(y_out, axis=1)
    m[m==1]=1.0
    m[m==0]=-1.0
    return m

In [4]:

F1=np.load("F1.npy")
F2=np.load("F2.npy")
F3=np.load("F3.npy")
union = np.hstack((F1,F2,F3))
G=np.load("G.npy")
y=np.load("y.npy")

In [ ]:

clf_color = svm.SVC(kernel='rbf',gamma=0.001,probability=True)
clf_color.fit(union[:1000],y[:1000])

In [ ]:

clf_texture = LogisticRegression()
clf_texture.fit(G[:1000],y[:1000])

In [5]:

combined=Combined(clf_color,clf_texture)
print "Accuracy: ", np.mean(combined.predict(union[1000:2000],G[1000:2000])==y[1000:2000])

Accuracy:  0.708

Discussion and Questions¶

Please take some time to help me improve my findings and help me to learn. I'm especially interested in the following questions:

Why does it take so long to build the texture features? Do I do something wrong? Shall I implement these functions in a faster language?
How can I achieve the accuracy reported for the texture features?
Is this texture feature approach standard? I didn't find so much in the literature about it...
Is this combination of classifier standard? Is there a standard way to do it sklearn?
Did I do the combination as described in the paper? Do I need to use the decision function instead?

Furthermore, I would be really happy to receive criticism on the organisation of the paper, python style, and any hint I can learn from. Thank you very much!