Pet Banner

Predicting Pet Adoption Time Using Decision Tree Ensembles


The goal of this project is to develop an algorithm to predict the adoptability of pets - specifically, how quickly is a pet adopted? The dataset was part of the PetFinder.my Adoption Prediction.

The final ensemble model used four different classifier algorithms (Gradient Boosting, AdaBoost, Random Forest, and XGBoost). The final prediction classes were determined as the average of all four models weighted equally.


Table of Contents

  1. Data Columns
  2. Dataset Overview
  3. Missing Data
  4. Variable Visualizations
  5. Caution Variables
  6. Data Cleaning
  7. Tree Ensemble Modelling
  8. Predictions & Submission


Model Accuracy: (Cohen's Kappa = 0.33809)

Top submissions on the public leaderboard are around 0.45


Competition Description

Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. If homes can be found for them, many precious lives can be saved — and more happy families created.

PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos.

In this competition you will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization.

Top participants may be invited to collaborate on implementing their solutions into AI tools for assessing and improving pet adoption performance, which will benefit global animal welfare.

Competition Evaluation

Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two ratings. This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters).

In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.

Data Columns

Source

  • PetID - Unique hash ID of pet profile
  • AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
  • Type - Type of animal (1 = Dog, 2 = Cat)
  • Name - Name of pet (Empty if not named)
  • Age - Age of pet when listed, in months
  • Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
  • Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
  • Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
  • Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
  • Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
  • Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
  • MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
  • FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
  • Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
  • Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
  • Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
  • Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
  • Quantity - Number of pets represented in profile
  • Fee - Adoption fee (0 = Free)
  • State - State location in Malaysia (Refer to StateLabels dictionary)
  • RescuerID - Unique hash ID of rescuer
  • VideoAmt - Total uploaded videos for this pet
  • PhotoAmt - Total uploaded photos for this pet
  • Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

AdoptionSpeed

  • Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:
  • 0 - Pet was adopted on the same day as it was listed.
  • 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
  • 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
  • 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
  • 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Data Overview

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import missingno as msno
import statsmodels.api as sm
import scipy as sp 

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import cohen_kappa_score, make_scorer
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier, RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV

%matplotlib inline
sns.set_style("whitegrid") 
plt.rcParams["figure.figsize"] = [10,5]



train = pd.read_csv('../input/train/train.csv')
test = pd.read_csv('../input/test/test.csv')
states = pd.read_csv('../input/state_labels.csv')
In [2]:
def count_overview(variable, title):
    plt.figure(figsize=(10,4))
    ax = sns.barplot(x=variable, y = variable.index, palette="Spectral")
    ax.set_title(title)
    plt.xlabel('Count')
    plt.tight_layout()
    plt.show()
    
def stacked_barplot(tab, title):
    tab_percent = tab.apply(lambda r: r/r.sum(), axis=1)
    ax = tab_percent.plot.barh(stacked=True, cmap='Spectral', alpha=0.8)
    ax.set_title(title)
    vals = ax.get_yticks()
    ax.xaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), shadow=True, ncol=5)
    plt.tight_layout()
    plt.show()

Training Dataset Overview

In [3]:
print("Training Dataset Shape: ", train.shape)
display(train.head(2))
display(train.info())
display(train.describe())
Training Dataset Shape:  (14993, 24)
Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State RescuerID VideoAmt Description PetID PhotoAmt AdoptionSpeed
0 2 Nibble 3 299 0 1 1 7 0 1 1 2 2 2 1 1 100 41326 8480853f516546f6cf33aa88cd76c379 0 Nibble is a 3+ month old ball of cuteness. He ... 86e1089a3 1.0 2
1 2 No Name Yet 1 265 0 1 1 2 0 2 2 3 3 3 1 1 0 41401 3082c7125d8fb66f7dd4bff4192c8b14 0 I just found it alone yesterday near my apartm... 6296e909a 2.0 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 24 columns):
Type             14993 non-null int64
Name             13736 non-null object
Age              14993 non-null int64
Breed1           14993 non-null int64
Breed2           14993 non-null int64
Gender           14993 non-null int64
Color1           14993 non-null int64
Color2           14993 non-null int64
Color3           14993 non-null int64
MaturitySize     14993 non-null int64
FurLength        14993 non-null int64
Vaccinated       14993 non-null int64
Dewormed         14993 non-null int64
Sterilized       14993 non-null int64
Health           14993 non-null int64
Quantity         14993 non-null int64
Fee              14993 non-null int64
State            14993 non-null int64
RescuerID        14993 non-null object
VideoAmt         14993 non-null int64
Description      14981 non-null object
PetID            14993 non-null object
PhotoAmt         14993 non-null float64
AdoptionSpeed    14993 non-null int64
dtypes: float64(1), int64(19), object(4)
memory usage: 2.7+ MB
None
Type Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State VideoAmt PhotoAmt AdoptionSpeed
count 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000 14993.000000
mean 1.457614 10.452078 265.272594 74.009738 1.776162 2.234176 3.222837 1.882012 1.862002 1.467485 1.731208 1.558727 1.914227 1.036617 1.576069 21.259988 41346.028347 0.056760 3.889215 2.516441
std 0.498217 18.155790 60.056818 123.011575 0.681592 1.745225 2.742562 2.984086 0.547959 0.599070 0.667649 0.695817 0.566172 0.199535 1.472477 78.414548 32.444153 0.346185 3.487810 1.177265
min 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 41324.000000 0.000000 0.000000 0.000000
25% 1.000000 2.000000 265.000000 0.000000 1.000000 1.000000 0.000000 0.000000 2.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 0.000000 41326.000000 0.000000 2.000000 2.000000
50% 1.000000 3.000000 266.000000 0.000000 2.000000 2.000000 2.000000 0.000000 2.000000 1.000000 2.000000 1.000000 2.000000 1.000000 1.000000 0.000000 41326.000000 0.000000 3.000000 2.000000
75% 2.000000 12.000000 307.000000 179.000000 2.000000 3.000000 6.000000 5.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 0.000000 41401.000000 0.000000 5.000000 4.000000
max 2.000000 255.000000 307.000000 307.000000 3.000000 7.000000 7.000000 7.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 20.000000 3000.000000 41415.000000 8.000000 30.000000 4.000000

Test Dataset

In [4]:
print("Test Dataset Shape: ", test.shape)
display(test.head(2))
display(test.info())
display(test.describe())
Test Dataset Shape:  (3948, 23)
Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State RescuerID VideoAmt Description PetID PhotoAmt
0 1 Puppy 2 307 0 1 1 0 0 2 2 2 2 2 1 1 150 41326 4475f31553f0170229455e3c5645644f 0 Puppy is calm for a young dog, but he becomes ... 378fcc4fc 3.0
1 2 London 24 266 0 1 2 7 0 2 1 1 1 1 1 1 0 41326 4475f31553f0170229455e3c5645644f 0 Urgently seeking adoption. Please contact for ... 73c10e136 1.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3948 entries, 0 to 3947
Data columns (total 23 columns):
Type            3948 non-null int64
Name            3645 non-null object
Age             3948 non-null int64
Breed1          3948 non-null int64
Breed2          3948 non-null int64
Gender          3948 non-null int64
Color1          3948 non-null int64
Color2          3948 non-null int64
Color3          3948 non-null int64
MaturitySize    3948 non-null int64
FurLength       3948 non-null int64
Vaccinated      3948 non-null int64
Dewormed        3948 non-null int64
Sterilized      3948 non-null int64
Health          3948 non-null int64
Quantity        3948 non-null int64
Fee             3948 non-null int64
State           3948 non-null int64
RescuerID       3948 non-null object
VideoAmt        3948 non-null int64
Description     3946 non-null object
PetID           3948 non-null object
PhotoAmt        3948 non-null float64
dtypes: float64(1), int64(18), object(4)
memory usage: 709.5+ KB
None
Type Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State VideoAmt PhotoAmt
count 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000 3948.000000
mean 1.526089 11.564590 263.033435 57.359422 1.782675 2.232016 3.355623 2.061297 1.824468 1.466819 1.703647 1.506079 1.889311 1.043566 1.626393 27.346251 41351.019250 0.062817 3.809524
std 0.499382 18.568429 59.178121 112.086810 0.692633 1.736614 2.700144 3.041357 0.569772 0.613308 0.664200 0.682930 0.587995 0.218539 1.609914 88.416045 34.708648 0.391324 3.627959
min 1.000000 0.000000 2.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 41324.000000 0.000000 0.000000
25% 1.000000 2.000000 265.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 0.000000 41326.000000 0.000000 2.000000
50% 2.000000 4.000000 266.000000 0.000000 2.000000 2.000000 3.000000 0.000000 2.000000 1.000000 2.000000 1.000000 2.000000 1.000000 1.000000 0.000000 41326.000000 0.000000 3.000000
75% 2.000000 12.000000 307.000000 0.000000 2.000000 3.000000 6.000000 6.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 0.000000 41401.000000 0.000000 5.000000
max 2.000000 180.000000 307.000000 307.000000 3.000000 7.000000 7.000000 7.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 20.000000 2400.000000 41401.000000 9.000000 30.000000

Let's concatenate the datasets to make the EDA easier

In [5]:
train_target = train['AdoptionSpeed']
test_id = test['PetID']
all_data = pd.concat((train, test), sort=True).reset_index(drop=True)

Missing Data

In [6]:
msno.bar(all_data.drop('AdoptionSpeed', axis=1), color=sns.color_palette("Spectral", 5)[4])
plt.show()
  • The only variable with missing values is the "Name".
  • My intuition is that people are more likely to respond positively to listed pets with existing names. Imagine a little girl saying:
    • "Aww, look how cute 'Peanut' is!"

Let's test the hypothesis that the name would play a role in the Adoption Speed. First, we define a new variable ("NameorNot"), then conduct a chi-squared enrichment analysis. Lastly, we examine the standardized residuals to see any effects.

In [7]:
train['NameorNot'] = np.where(train['Name'].isnull(), 'No Name', 'Has a Name')
test['NameorNot'] = np.where(test['Name'].isnull(), 'No Name', 'Has a Name')


df_plot = train.groupby(['NameorNot', 'AdoptionSpeed']).size().reset_index().pivot(
    columns='AdoptionSpeed', index='NameorNot', values=0)
print("Number of Animals by Type")
display(df_plot)

ggDF = train['NameorNot'].value_counts()
count_overview(ggDF, 'NameorNot Count')

tab = pd.crosstab(train['NameorNot'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)

print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Effects of a Name on Adoption Speed")
plt.show()
Number of Animals by Type
AdoptionSpeed 0 1 2 3 4
NameorNot
Has a Name 384 2819 3725 3043 3765
No Name 26 271 312 216 432
Chi-Square Test of Independence: p-value = 1.076E-07
AdoptionSpeed 0 1 2 3 4
NameorNot
Has a Name 1.513055 -0.869614 1.757704 4.0889 -5.259132
No Name -1.513055 0.869614 -1.757704 -4.0889 5.259132

Looking at the standardized residuals table and the bottom plot, we can see that pets that are listed without a name are more likely to NOT be adopted.

  • As a rule of thumb, values outside the range of [-2, 2] exhibit a lower or higher than expected frequency, respectively.

Variable Visualizations

Let's split the training dataset into categorical and numeric subsets to make it easier to visualize

Target Variable (Adoption Speed)

In [8]:
ggDF = pd.DataFrame(train['AdoptionSpeed'].value_counts().rename({0: '0 (Same Day)', 1: '1 (1-7 Days)', 2: '2 (8-30 Days)', 3: '3 (31-90 Days)', 4: '4 (No Adoption After 100 Days)'}))
ggDF['Order'] = [4, 2, 3, 1, 0]
ggDF.sort_values('Order', inplace=True, ascending=True)
count_overview(ggDF['AdoptionSpeed'], 'AdoptionSpeed (Target Variable)')

Cats vs Dogs

In [9]:
df_plot = train.groupby(['Type', 'AdoptionSpeed']).size().reset_index().pivot(columns='AdoptionSpeed', index='Type', values=0).rename({1: 'Dog', 2: 'Cat'})
print("Number of Animals by Type")
display(df_plot)

ggDF = train['Type'].value_counts().rename({1: 'Dog', 2: 'Cat'})
count_overview(ggDF, 'Type Count')

tab = pd.crosstab(train['Type'], train['AdoptionSpeed']).rename({1: 'Dog', 2: 'Cat'})
table = sm.stats.Table(tab)

print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Adoption Speed by Type")
plt.show()
Number of Animals by Type
AdoptionSpeed 0 1 2 3 4
Type
Dog 170 1435 2164 1949 2414
Cat 240 1655 1873 1310 1783
Chi-Square Test of Independence: p-value = 5.010E-34
AdoptionSpeed 0 1 2 3 4
Type
Dog -5.264749 -9.7657 -0.946594 7.208134 5.024245
Cat 5.264749 9.7657 0.946594 -7.208134 -5.024245

It seems cats are more likely to be adopted sooner, compared to dogs. This may be due to the larger responsibility & commitment associated with owning a dog.

Gender

In [10]:
df_plot = train.groupby(['Gender', 'AdoptionSpeed']).size().reset_index().pivot(
    columns='AdoptionSpeed', index='Gender', values=0).rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
print("Number of Animals by Gender")
display(df_plot)

ggDF = train['Gender'].value_counts().rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
count_overview(ggDF, 'Gender Count')

tab = pd.crosstab(train['Gender'], train['AdoptionSpeed']).rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Adoption Speed by Type")
plt.show()
Number of Animals by Gender
AdoptionSpeed 0 1 2 3 4
Gender
Male 160 1283 1578 1109 1406
Female 204 1366 1911 1671 2125
Mixed (Groups of Pets) 46 441 548 479 666
Chi-Square Test of Independence: p-value = 1.871E-13
AdoptionSpeed 0 1 2 3 4
Gender
Male 0.893609 5.942875 3.333695 -3.871096 -5.416289
Female 0.501221 -5.403628 -1.782910 3.534435 3.200882
Mixed (Groups of Pets) -1.934038 -0.474798 -2.036183 0.288581 2.876952
In [11]:
plt.figure(figsize=(12,5))
sns.distplot(train['Age'], kde=False, bins=100, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('Age Distribution')
plt.show()

There seems to be a large number of outliers (very old pets), but the median value (50th percentile) is the same for both cats and dogs. The average age of pets up for adoption is 3 months.

  • Fun fact: This coincidences with the end of socialization period and the start of the juvenile period in dogs. Cats on the other hand are still considered kittens for another 3 months.
In [12]:
display(train[['AdoptionSpeed', 'Age']].groupby('AdoptionSpeed').describe())
sns.barplot(x=train.AdoptionSpeed, y=train.Age, palette = 'Spectral')
plt.show()
Age
count mean std min 25% 50% 75% max
AdoptionSpeed
0 410.0 10.451220 17.775118 0.0 2.0 3.0 12.0 120.0
1 3090.0 8.488350 15.746187 0.0 2.0 2.0 6.0 147.0
2 4037.0 8.823631 16.779013 0.0 2.0 3.0 6.0 156.0
3 3259.0 10.189936 18.672104 0.0 2.0 3.0 9.0 212.0
4 4197.0 13.667858 20.177460 0.0 3.0 6.0 15.0 255.0
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

As expected, pets that are not adopted after 100 days are on average much older pets. Interestingly, the pets that were adopted on the same day ranged in all ages. There must be another factor in play that makes people adopt immediately within one day.

PhotoAmt - Can you have too many cute pictures?

Spoiler Alert: You can't!
In [13]:
plt.figure(figsize=(10,5))
sns.distplot(train['PhotoAmt'], kde=False, bins=30, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('PhotoAmt Distribution')
plt.show()

It's highly unlikely that the difference between 7 and 8 pictures will be the deciding factor for adoption, but we can group the number of photos into a new variable (PhotosType).

The threshold for the number of photos for each class were somewhat arbitrarily, but the histogram above shows a steep drop in PhotoAmt after 5. The PhotoType classes correspond to the following number of pictures:

  • No Photos: 0 photos available
  • Few Photos: Between 1-5 photos
  • Many Photos: More than 5 photos
In [14]:
train['PhotosType'] = "No Photos"
train['PhotosType'] = np.where(train['PhotoAmt']>1, 'Few Photos', train['PhotosType'])
train['PhotosType'] = np.where(train['PhotoAmt']>5, 'Many Photos', train['PhotosType'])

test['PhotosType'] = "No Photos"
test['PhotosType'] = np.where(test['PhotoAmt']>1, 'Few Photos', test['PhotosType'])
test['PhotosType'] = np.where(test['PhotoAmt']>5, 'Many Photos', test['PhotosType'])
In [15]:
df_plot = train.groupby(['PhotosType', 'AdoptionSpeed']).size().reset_index().pivot(
    columns='AdoptionSpeed', index='PhotosType', values=0)
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)

ggDF = train['PhotosType'].value_counts()
count_overview(ggDF, 'Photos Type Count')

tab = pd.crosstab(train['PhotosType'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed 0 1 2 3 4
PhotosType
Few Photos 268 2003 2460 1818 2508
Many Photos 43 438 758 815 466
No Photos 99 649 819 626 1223
Chi-Square Test of Independence: p-value = 3.302E-76
AdoptionSpeed 0 1 2 3 4
PhotosType
Few Photos 2.081322 5.630655 0.802700 -6.101690 -1.016624
Many Photos -3.470047 -4.393020 3.912735 14.150472 -11.647052
No Photos 0.666870 -2.648652 -4.424175 -5.501059 11.568863

Having no photos seems to be a big factor associated with pets not being adopted after 100 days.

Location (Location, Location)

In [16]:
states = pd.read_csv('../input/state_labels.csv')
states.index = states['StateID']
states_dict = states.to_dict()

state_counts = pd.crosstab(train['State'], train['AdoptionSpeed'])
state_counts.index = state_counts.index.map(states_dict['StateName'])

state_totals = state_counts.copy(deep=True)
state_totals['Total'] = state_counts.sum(axis=1)
display(state_totals)


plt.figure(figsize=(10,6))
ax = sns.barplot(x=state_totals.Total, y = state_totals.index, palette="Spectral")
ax.set_title("Adoptions by State")
plt.xlabel('Count')
plt.tight_layout()
plt.show()


stacked_barplot(state_counts, 'Adoption Speed by State')
AdoptionSpeed 0 1 2 3 4 Total
State
Melaka 4 18 23 12 80 137
Kedah 3 14 34 23 36 110
Selangor 246 1877 2435 2004 2152 8714
Pulau Pinang 8 122 216 197 300 843
Perak 3 48 111 117 141 420
Negeri Sembilan 4 36 63 42 108 253
Pahang 3 29 14 16 23 85
Johor 23 113 136 103 132 507
Sarawak 1 1 0 2 9 13
Sabah 1 6 3 4 8 22
Terengganu 0 9 2 6 9 26
Kelantan 2 3 3 1 6 15
Kuala Lumpur 112 814 996 731 1192 3845
Labuan 0 0 1 1 1 3

There is a clear difference in adoption speeds between the different states. Sarawak and Melaka have the highest proportion of pets left unadopted after 100 days compared to Kelatan and Sarawak, which have the lowest proportions.

Purebreed vs Mudblood Mutts?

In [17]:
train['Purebreed'] = np.where(train['Breed2'] == 0, 'Pure Breed', 'Mixed Breed')
test['Purebreed'] = np.where(test['Breed2'] == 0, 'Pure Breed', 'Mixed Breed')

df_plot = train.groupby(['Purebreed', 'AdoptionSpeed']).size().reset_index().pivot(
    columns='AdoptionSpeed', index='Purebreed', values=0)
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)

ggDF = train['Purebreed'].value_counts()
count_overview(ggDF, 'Photos Type Count')

tab = pd.crosstab(train['Purebreed'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed 0 1 2 3 4
Purebreed
Mixed Breed 157 876 1133 992 1073
Pure Breed 253 2214 2904 2267 3124
Chi-Square Test of Independence: p-value = 9.746E-09
AdoptionSpeed 0 1 2 3 4
Purebreed
Mixed Breed 4.595 0.179756 -0.255053 3.181495 -4.501907
Pure Breed -4.595 -0.179756 0.255053 -3.181495 4.501907

Mixed breeds seem to be adopted faster than pure breeds

What's better than free?

First, let's see the distribution of fees

In [18]:
ax = sns.distplot(train['Fee'], kde=False, bins=50, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4])
ax.set_title('Fee Distribution (Log Scale)')
ax.set_yscale('log')
  • Note: The fee cost on the y-axis is in log scale. So there is several thousand more pets up for adoption with no cost than at any price.

Let's convert the fee variable into a new categorical variable (PayorNot). This group is composed of:

  • Free - No fee associated with adoption
  • Paid Adoption - Fee associated with adoption
In [19]:
train['PayorNot'] = np.where(train['Fee'] == 0, 'Free', 'Paid Adoption')
test['PayorNot'] = np.where(test['Fee'] == 0, 'Free', 'Paid Adoption')


df_plot = train.groupby(['PayorNot', 'AdoptionSpeed']).size().reset_index().pivot(
    columns='AdoptionSpeed', index='PayorNot', values=0).rename({0: 'Free', 1: 'Paid Adoption'})
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)

ggDF = train['PayorNot'].value_counts()
count_overview(ggDF, 'Photos Type Count')

tab = pd.crosstab(train['PayorNot'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)

stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed 0 1 2 3 4
PayorNot
Free 357 2611 3430 2789 3476
Paid Adoption 53 479 607 470 721
Chi-Square Test of Independence: p-value = 5.575E-03
AdoptionSpeed 0 1 2 3 4
PayorNot
Free 1.481222 0.067103 1.03537 1.993102 -3.452485
Paid Adoption -1.481222 -0.067103 -1.03537 -1.993102 3.452485
  • The effects of a fee seems negligible, but there is some evidence that pets that are not adopted after the 100 days are likely to have a fee associated with them. This might be due to the adoption shelters attempting to recoup their costs.
In [20]:
plt.figure(figsize=(12,5))
sns.distplot(train[train['PayorNot'] != 'Free']['Fee'], kde=False, bins=30, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('Age Distribution')
plt.show()

Length of Description

In [21]:
train['DescriptionLength'] = train['Description'].apply(lambda x: len(str(x)))
sns.distplot(train['DescriptionLength'], kde=False, bins=50, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4])
plt.show()

test['DescriptionLength'] = test['Description'].apply(lambda x: len(str(x)))
In [22]:
sns.boxplot(x = train['AdoptionSpeed'], y = train['DescriptionLength'])
plt.show()
In [23]:
gr = train.groupby('AdoptionSpeed').DescriptionLength
for label, arr in gr:
    sns.kdeplot(arr, label=label, shade=True)
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Caution Variables

Some of the variables, such as MaturitySize or FurLength, include labels that are either "Not Specified" or "Not Sure". We will impute these unknown values are the mode (or most common variable).

In [24]:
caution_variables = ['MaturitySize', 'FurLength', 'Health', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health']

for variable in caution_variables:
    train[variable].replace(0, train[variable].mode()[0], regex=True, inplace=True)
    test[variable].replace(0, test[variable].mode()[0], regex=True, inplace=True)

Data Cleaning

In [25]:
clean_train = train.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
clean_test = test.drop(columns=['Name', 'RescuerID', 'Description', 'PetID'])

Label Encoding

In [26]:
cat_cols = ['PayorNot', 'NameorNot','PhotosType','Purebreed']

for col in cat_cols:
    label = LabelEncoder()
    label.fit(list(clean_train[col].values))
    clean_train[col] = label.transform(list(clean_train[col].values))
    clean_test[col] = label.transform(list(clean_test[col].values))

Tree Ensemble Modelling

Let's start by defining the evaluation criteria (quadratic weighted kappa). We will use the sci-kit learn's cohen_kappa_score function for cross-validation.

In [27]:
def metric(y1,y2):
    return cohen_kappa_score(y1, y2, weights = 'quadratic')

# Make scorer for scikit-learn
scorer = make_scorer(metric)

Models:

  • XGBoost (xgboost.sklearn)
  • Adaboost (sklearn.ensemble.AdaBoostClassifier)
  • Random Forest (sklearn.ensemble.RandomForestClassifier)
  • Gradient Boosted (sklearn.ensembl.GradientBoostingClassifier)
  • Extra Trees (sklearn.ensembl.ExtraTreesClassifier)
In [28]:
gbm_model = GradientBoostingClassifier()
gbm_grid = {
    'loss' : ['deviance'],
    'learning_rate' : [0.1],
    'n_estimators' : [100],
    'subsample' : [0.8, 0.9],
    'min_samples_split' : [5,10],
    'random_state' : [1234],
    'max_depth' : [5, 7],
    'max_features' : ['auto'],
    'min_samples_leaf': [15,20]
}

gbm_gridsearch = GridSearchCV(estimator = gbm_model,
                              param_grid = gbm_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1, 
                              scoring = scorer)

gbm_gridsearch.fit(clean_train, train_target)
gbm_gridsearch.best_params_
Fitting 3 folds for each of 16 candidates, totalling 48 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  4.5min finished
Out[28]:
{'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 5,
 'max_features': 'auto',
 'min_samples_leaf': 15,
 'min_samples_split': 5,
 'n_estimators': 100,
 'random_state': 1234,
 'subsample': 0.9}
In [29]:
ada_model = AdaBoostClassifier()
ada_grid = {
    'random_state' : [1234],
    'n_estimators' : [135, 150],
    'learning_rate' : [0.35, 0.5],
    'algorithm' : ['SAMME.R']
}

ada_gridsearch = GridSearchCV(estimator = ada_model,
                              param_grid = ada_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1, 
                              scoring = scorer)

ada_gridsearch.fit(clean_train, train_target)
ada_gridsearch.best_params_
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    7.8s finished
Out[29]:
{'algorithm': 'SAMME.R',
 'learning_rate': 0.35,
 'n_estimators': 150,
 'random_state': 1234}
In [30]:
ets_model = ExtraTreesClassifier()
ets_grid = {
    'random_state' : [1234],
    'n_estimators' : [50, 150, 300],
    'criterion' : ['gini'],
    'max_depth' : [10, 25, 50],
    'max_features' : ['auto'],
    'min_samples_split' : [5, 10, 15, 30],
    'min_samples_leaf' : [5, 10, 15, 30],
    'bootstrap' : ['true']
}

ets_gridsearch = GridSearchCV(estimator = ets_model,
                              param_grid = ets_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1, 
                              scoring = scorer)

ets_gridsearch.fit(clean_train, train_target)
print(ets_gridsearch.best_score_)
ets_gridsearch.best_params_
Fitting 3 folds for each of 144 candidates, totalling 432 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   16.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed:  3.0min finished
0.2928920722473285
Out[30]:
{'bootstrap': 'true',
 'criterion': 'gini',
 'max_depth': 50,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 5,
 'n_estimators': 300,
 'random_state': 1234}
In [31]:
rfc_model = RandomForestClassifier()
rfc_grid = {
    'random_state' : [1234],
    'n_estimators' : [250, 300, 350],
    'max_depth' : [35, 50, 75],
    'max_features' : ['auto'],
    'min_samples_leaf': [5, 10],
    'min_samples_split': [10, 15]
}

rfc_gridsearch = GridSearchCV(estimator = rfc_model,
                              param_grid = rfc_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1, 
                              scoring = scorer)

rfc_gridsearch.fit(clean_train, train_target)
rfc_gridsearch.best_params_
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   54.8s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  2.3min finished
Out[31]:
{'max_depth': 35,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 15,
 'n_estimators': 300,
 'random_state': 1234}
In [32]:
xgb_model = XGBClassifier()
xgb_grid = {
    'n_estimators' : [150, 200],
    'random_state' : [1234],
    'max_depth': [10,12,15],
    'min_child_weight': [2,3],
    'learning_rate': [0.1],
    'gamma': [0.6, 0.7, 0.8]
}

xgb_gridsearch = GridSearchCV(estimator = xgb_model,
                              param_grid = xgb_grid, 
                              cv = 3, 
                              n_jobs = -1, 
                              verbose = 1, 
                              scoring = scorer)

xgb_gridsearch.fit(clean_train, train_target)
xgb_gridsearch.best_params_
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 21.3min finished
Out[32]:
{'gamma': 0.7,
 'learning_rate': 0.1,
 'max_depth': 10,
 'min_child_weight': 2,
 'n_estimators': 150,
 'random_state': 1234}
In [33]:
print("Best Model Parameters: \n")

print("AdaBoost Classifier Score: {:.4f}\n{}\n".format(ada_gridsearch.best_score_, ada_gridsearch.best_params_))
print("GradientBoosting Classifier Score: {:.4f}\n{}\n".format(gbm_gridsearch.best_score_, gbm_gridsearch.best_params_))
print("Extra Trees Classifier Score: {:.4f}\n{}\n".format(ets_gridsearch.best_score_, ets_gridsearch.best_params_))
print("Random Forest Classifier Score: {:.4f}\n{}\n".format(rfc_gridsearch.best_score_, rfc_gridsearch.best_params_))
print("XGBoost Classifier Score: {:.4f}\n{}\n".format(xgb_gridsearch.best_score_, xgb_gridsearch.best_params_))
Best Model Parameters: 

AdaBoost Classifier Score: 0.3252
{'algorithm': 'SAMME.R', 'learning_rate': 0.35, 'n_estimators': 150, 'random_state': 1234}

GradientBoosting Classifier Score: 0.3538
{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 15, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 1234, 'subsample': 0.9}

Extra Trees Classifier Score: 0.2929
{'bootstrap': 'true', 'criterion': 'gini', 'max_depth': 50, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 300, 'random_state': 1234}

Random Forest Classifier Score: 0.3449
{'max_depth': 35, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 15, 'n_estimators': 300, 'random_state': 1234}

XGBoost Classifier Score: 0.3526
{'gamma': 0.7, 'learning_rate': 0.1, 'max_depth': 10, 'min_child_weight': 2, 'n_estimators': 150, 'random_state': 1234}

Predictions & Submission

In [36]:
pred1 = ada_gridsearch.predict(clean_test)
pred2 = gbm_gridsearch.predict(clean_test)
#pred3 = ets_gridsearch.predict(clean_test) # Did not perform well
pred4 = rfc_gridsearch.predict(clean_test)
pred5 = xgb_gridsearch.predict(clean_test)
In [37]:
final_preds = pd.DataFrame(data=[pred1, pred2, pred4, pred5]).transpose()
final_preds.columns = (['AdaBoost', 'GradientBoosting', 'Random Forest', 'XGBoost'])
final_preds['Average'] = round(final_preds.mean(axis=1)).astype(int)
final_preds.head()
Out[37]:
AdaBoost GradientBoosting Random Forest XGBoost Average
0 2 2 2 2 2
1 4 4 4 4 4
2 4 4 4 4 4
3 4 4 4 3 4
4 4 4 4 4 4
In [38]:
submission_df = pd.DataFrame(data = {'PetID' : test['PetID'], 
                                     'AdoptionSpeed' : final_preds.Average})
submission_df.to_csv('submission.csv', index = False)
In [39]:
submission_df.head(3)
Out[39]:
PetID AdoptionSpeed
0 378fcc4fc 2
1 73c10e136 4
2 72000c4c5 4