The final ensemble model used four different classifier algorithms (Gradient Boosting, AdaBoost, Random Forest, and XGBoost). The final prediction classes were determined as the average of all four models weighted equally.
Table of Contents
Model Accuracy: (Cohen's Kappa = 0.33809)
Top submissions on the public leaderboard are around 0.45
Competition Description
Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. If homes can be found for them, many precious lives can be saved — and more happy families created.
PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.
Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos.
In this competition you will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization.
Top participants may be invited to collaborate on implementing their solutions into AI tools for assessing and improving pet adoption performance, which will benefit global animal welfare.
Competition Evaluation
Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two ratings. This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters).
In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.
*AdoptionSpeed*
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import missingno as msno
import statsmodels.api as sm
import scipy as sp
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import cohen_kappa_score, make_scorer
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier, RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = [10,5]
train = pd.read_csv('../input/train/train.csv')
test = pd.read_csv('../input/test/test.csv')
states = pd.read_csv('../input/state_labels.csv')
def count_overview(variable, title):
plt.figure(figsize=(10,4))
ax = sns.barplot(x=variable, y = variable.index, palette="Spectral")
ax.set_title(title)
plt.xlabel('Count')
plt.tight_layout()
plt.show()
def stacked_barplot(tab, title):
tab_percent = tab.apply(lambda r: r/r.sum(), axis=1)
ax = tab_percent.plot.barh(stacked=True, cmap='Spectral', alpha=0.8)
ax.set_title(title)
vals = ax.get_yticks()
ax.xaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), shadow=True, ncol=5)
plt.tight_layout()
plt.show()
print("Training Dataset Shape: ", train.shape)
display(train.head(2))
display(train.info())
display(train.describe())
Training Dataset Shape: (14993, 24)
Type | Name | Age | Breed1 | Breed2 | Gender | Color1 | Color2 | Color3 | MaturitySize | FurLength | Vaccinated | Dewormed | Sterilized | Health | Quantity | Fee | State | RescuerID | VideoAmt | Description | PetID | PhotoAmt | AdoptionSpeed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | Nibble | 3 | 299 | 0 | 1 | 1 | 7 | 0 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 100 | 41326 | 8480853f516546f6cf33aa88cd76c379 | 0 | Nibble is a 3+ month old ball of cuteness. He ... | 86e1089a3 | 1.0 | 2 |
1 | 2 | No Name Yet | 1 | 265 | 0 | 1 | 1 | 2 | 0 | 2 | 2 | 3 | 3 | 3 | 1 | 1 | 0 | 41401 | 3082c7125d8fb66f7dd4bff4192c8b14 | 0 | I just found it alone yesterday near my apartm... | 6296e909a | 2.0 | 0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14993 entries, 0 to 14992 Data columns (total 24 columns): Type 14993 non-null int64 Name 13736 non-null object Age 14993 non-null int64 Breed1 14993 non-null int64 Breed2 14993 non-null int64 Gender 14993 non-null int64 Color1 14993 non-null int64 Color2 14993 non-null int64 Color3 14993 non-null int64 MaturitySize 14993 non-null int64 FurLength 14993 non-null int64 Vaccinated 14993 non-null int64 Dewormed 14993 non-null int64 Sterilized 14993 non-null int64 Health 14993 non-null int64 Quantity 14993 non-null int64 Fee 14993 non-null int64 State 14993 non-null int64 RescuerID 14993 non-null object VideoAmt 14993 non-null int64 Description 14981 non-null object PetID 14993 non-null object PhotoAmt 14993 non-null float64 AdoptionSpeed 14993 non-null int64 dtypes: float64(1), int64(19), object(4) memory usage: 2.7+ MB
None
Type | Age | Breed1 | Breed2 | Gender | Color1 | Color2 | Color3 | MaturitySize | FurLength | Vaccinated | Dewormed | Sterilized | Health | Quantity | Fee | State | VideoAmt | PhotoAmt | AdoptionSpeed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 | 14993.000000 |
mean | 1.457614 | 10.452078 | 265.272594 | 74.009738 | 1.776162 | 2.234176 | 3.222837 | 1.882012 | 1.862002 | 1.467485 | 1.731208 | 1.558727 | 1.914227 | 1.036617 | 1.576069 | 21.259988 | 41346.028347 | 0.056760 | 3.889215 | 2.516441 |
std | 0.498217 | 18.155790 | 60.056818 | 123.011575 | 0.681592 | 1.745225 | 2.742562 | 2.984086 | 0.547959 | 0.599070 | 0.667649 | 0.695817 | 0.566172 | 0.199535 | 1.472477 | 78.414548 | 32.444153 | 0.346185 | 3.487810 | 1.177265 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 41324.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 2.000000 | 265.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41326.000000 | 0.000000 | 2.000000 | 2.000000 |
50% | 1.000000 | 3.000000 | 266.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 0.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41326.000000 | 0.000000 | 3.000000 | 2.000000 |
75% | 2.000000 | 12.000000 | 307.000000 | 179.000000 | 2.000000 | 3.000000 | 6.000000 | 5.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41401.000000 | 0.000000 | 5.000000 | 4.000000 |
max | 2.000000 | 255.000000 | 307.000000 | 307.000000 | 3.000000 | 7.000000 | 7.000000 | 7.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 20.000000 | 3000.000000 | 41415.000000 | 8.000000 | 30.000000 | 4.000000 |
print("Test Dataset Shape: ", test.shape)
display(test.head(2))
display(test.info())
display(test.describe())
Test Dataset Shape: (3948, 23)
Type | Name | Age | Breed1 | Breed2 | Gender | Color1 | Color2 | Color3 | MaturitySize | FurLength | Vaccinated | Dewormed | Sterilized | Health | Quantity | Fee | State | RescuerID | VideoAmt | Description | PetID | PhotoAmt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Puppy | 2 | 307 | 0 | 1 | 1 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 150 | 41326 | 4475f31553f0170229455e3c5645644f | 0 | Puppy is calm for a young dog, but he becomes ... | 378fcc4fc | 3.0 |
1 | 2 | London | 24 | 266 | 0 | 1 | 2 | 7 | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 41326 | 4475f31553f0170229455e3c5645644f | 0 | Urgently seeking adoption. Please contact for ... | 73c10e136 | 1.0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3948 entries, 0 to 3947 Data columns (total 23 columns): Type 3948 non-null int64 Name 3645 non-null object Age 3948 non-null int64 Breed1 3948 non-null int64 Breed2 3948 non-null int64 Gender 3948 non-null int64 Color1 3948 non-null int64 Color2 3948 non-null int64 Color3 3948 non-null int64 MaturitySize 3948 non-null int64 FurLength 3948 non-null int64 Vaccinated 3948 non-null int64 Dewormed 3948 non-null int64 Sterilized 3948 non-null int64 Health 3948 non-null int64 Quantity 3948 non-null int64 Fee 3948 non-null int64 State 3948 non-null int64 RescuerID 3948 non-null object VideoAmt 3948 non-null int64 Description 3946 non-null object PetID 3948 non-null object PhotoAmt 3948 non-null float64 dtypes: float64(1), int64(18), object(4) memory usage: 709.5+ KB
None
Type | Age | Breed1 | Breed2 | Gender | Color1 | Color2 | Color3 | MaturitySize | FurLength | Vaccinated | Dewormed | Sterilized | Health | Quantity | Fee | State | VideoAmt | PhotoAmt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 | 3948.000000 |
mean | 1.526089 | 11.564590 | 263.033435 | 57.359422 | 1.782675 | 2.232016 | 3.355623 | 2.061297 | 1.824468 | 1.466819 | 1.703647 | 1.506079 | 1.889311 | 1.043566 | 1.626393 | 27.346251 | 41351.019250 | 0.062817 | 3.809524 |
std | 0.499382 | 18.568429 | 59.178121 | 112.086810 | 0.692633 | 1.736614 | 2.700144 | 3.041357 | 0.569772 | 0.613308 | 0.664200 | 0.682930 | 0.587995 | 0.218539 | 1.609914 | 88.416045 | 34.708648 | 0.391324 | 3.627959 |
min | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 41324.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 2.000000 | 265.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41326.000000 | 0.000000 | 2.000000 |
50% | 2.000000 | 4.000000 | 266.000000 | 0.000000 | 2.000000 | 2.000000 | 3.000000 | 0.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41326.000000 | 0.000000 | 3.000000 |
75% | 2.000000 | 12.000000 | 307.000000 | 0.000000 | 2.000000 | 3.000000 | 6.000000 | 6.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 41401.000000 | 0.000000 | 5.000000 |
max | 2.000000 | 180.000000 | 307.000000 | 307.000000 | 3.000000 | 7.000000 | 7.000000 | 7.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 20.000000 | 2400.000000 | 41401.000000 | 9.000000 | 30.000000 |
Let's concatenate the datasets to make the EDA easier
train_target = train['AdoptionSpeed']
test_id = test['PetID']
all_data = pd.concat((train, test), sort=True).reset_index(drop=True)
msno.bar(all_data.drop('AdoptionSpeed', axis=1), color=sns.color_palette("Spectral", 5)[4])
plt.show()
Let's test the hypothesis that the name would play a role in the Adoption Speed. First, we define a new variable ("NameorNot"), then conduct a chi-squared enrichment analysis. Lastly, we examine the standardized residuals to see any effects.
train['NameorNot'] = np.where(train['Name'].isnull(), 'No Name', 'Has a Name')
test['NameorNot'] = np.where(test['Name'].isnull(), 'No Name', 'Has a Name')
df_plot = train.groupby(['NameorNot', 'AdoptionSpeed']).size().reset_index().pivot(
columns='AdoptionSpeed', index='NameorNot', values=0)
print("Number of Animals by Type")
display(df_plot)
ggDF = train['NameorNot'].value_counts()
count_overview(ggDF, 'NameorNot Count')
tab = pd.crosstab(train['NameorNot'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Effects of a Name on Adoption Speed")
plt.show()
Number of Animals by Type
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
NameorNot | |||||
Has a Name | 384 | 2819 | 3725 | 3043 | 3765 |
No Name | 26 | 271 | 312 | 216 | 432 |
Chi-Square Test of Independence: p-value = 1.076E-07
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
NameorNot | |||||
Has a Name | 1.513055 | -0.869614 | 1.757704 | 4.0889 | -5.259132 |
No Name | -1.513055 | 0.869614 | -1.757704 | -4.0889 | 5.259132 |
Looking at the standardized residuals table and the bottom plot, we can see that pets that are listed without a name are more likely to NOT be adopted.
Let's split the training dataset into categorical and numeric subsets to make it easier to visualize
ggDF = pd.DataFrame(train['AdoptionSpeed'].value_counts().rename({0: '0 (Same Day)', 1: '1 (1-7 Days)', 2: '2 (8-30 Days)', 3: '3 (31-90 Days)', 4: '4 (No Adoption After 100 Days)'}))
ggDF['Order'] = [4, 2, 3, 1, 0]
ggDF.sort_values('Order', inplace=True, ascending=True)
count_overview(ggDF['AdoptionSpeed'], 'AdoptionSpeed (Target Variable)')
df_plot = train.groupby(['Type', 'AdoptionSpeed']).size().reset_index().pivot(columns='AdoptionSpeed', index='Type', values=0).rename({1: 'Dog', 2: 'Cat'})
print("Number of Animals by Type")
display(df_plot)
ggDF = train['Type'].value_counts().rename({1: 'Dog', 2: 'Cat'})
count_overview(ggDF, 'Type Count')
tab = pd.crosstab(train['Type'], train['AdoptionSpeed']).rename({1: 'Dog', 2: 'Cat'})
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Adoption Speed by Type")
plt.show()
Number of Animals by Type
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Type | |||||
Dog | 170 | 1435 | 2164 | 1949 | 2414 |
Cat | 240 | 1655 | 1873 | 1310 | 1783 |
Chi-Square Test of Independence: p-value = 5.010E-34
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Type | |||||
Dog | -5.264749 | -9.7657 | -0.946594 | 7.208134 | 5.024245 |
Cat | 5.264749 | 9.7657 | 0.946594 | -7.208134 | -5.024245 |
It seems cats are more likely to be adopted sooner, compared to dogs. This may be due to the larger responsibility & commitment associated with owning a dog.
df_plot = train.groupby(['Gender', 'AdoptionSpeed']).size().reset_index().pivot(
columns='AdoptionSpeed', index='Gender', values=0).rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
print("Number of Animals by Gender")
display(df_plot)
ggDF = train['Gender'].value_counts().rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
count_overview(ggDF, 'Gender Count')
tab = pd.crosstab(train['Gender'], train['AdoptionSpeed']).rename({1: 'Male', 2: 'Female', 3: 'Mixed (Groups of Pets)'})
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Adoption Speed by Type")
plt.show()
Number of Animals by Gender
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Gender | |||||
Male | 160 | 1283 | 1578 | 1109 | 1406 |
Female | 204 | 1366 | 1911 | 1671 | 2125 |
Mixed (Groups of Pets) | 46 | 441 | 548 | 479 | 666 |
Chi-Square Test of Independence: p-value = 1.871E-13
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Gender | |||||
Male | 0.893609 | 5.942875 | 3.333695 | -3.871096 | -5.416289 |
Female | 0.501221 | -5.403628 | -1.782910 | 3.534435 | 3.200882 |
Mixed (Groups of Pets) | -1.934038 | -0.474798 | -2.036183 | 0.288581 | 2.876952 |
plt.figure(figsize=(12,5))
sns.distplot(train['Age'], kde=False, bins=100, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('Age Distribution')
plt.show()
There seems to be a large number of outliers (very old pets), but the median value (50th percentile) is the same for both cats and dogs. The average age of pets up for adoption is 3 months.
display(train[['AdoptionSpeed', 'Age']].groupby('AdoptionSpeed').describe())
sns.barplot(x=train.AdoptionSpeed, y=train.Age, palette = 'Spectral')
plt.show()
Age | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
AdoptionSpeed | ||||||||
0 | 410.0 | 10.451220 | 17.775118 | 0.0 | 2.0 | 3.0 | 12.0 | 120.0 |
1 | 3090.0 | 8.488350 | 15.746187 | 0.0 | 2.0 | 2.0 | 6.0 | 147.0 |
2 | 4037.0 | 8.823631 | 16.779013 | 0.0 | 2.0 | 3.0 | 6.0 | 156.0 |
3 | 3259.0 | 10.189936 | 18.672104 | 0.0 | 2.0 | 3.0 | 9.0 | 212.0 |
4 | 4197.0 | 13.667858 | 20.177460 | 0.0 | 3.0 | 6.0 | 15.0 | 255.0 |
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
As expected, pets that are not adopted after 100 days are on average much older pets. Interestingly, the pets that were adopted on the same day ranged in all ages. There must be another factor in play that makes people adopt immediately within one day.
Spoiler Alert: You can't!
plt.figure(figsize=(10,5))
sns.distplot(train['PhotoAmt'], kde=False, bins=30, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('PhotoAmt Distribution')
plt.show()
It's highly unlikely that the difference between 7 and 8 pictures will be the deciding factor for adoption, but we can group the number of photos into a new variable (*PhotosType*).
The threshold for the number of photos for each class were somewhat arbitrarily, but the histogram above shows a steep drop in PhotoAmt after 5. The *PhotoType* classes correspond to the following number of pictures:
- No Photos: 0 photos available
- Few Photos: Between 1-5 photos
- Many Photos: More than 5 photos
train['PhotosType'] = "No Photos"
train['PhotosType'] = np.where(train['PhotoAmt']>1, 'Few Photos', train['PhotosType'])
train['PhotosType'] = np.where(train['PhotoAmt']>5, 'Many Photos', train['PhotosType'])
test['PhotosType'] = "No Photos"
test['PhotosType'] = np.where(test['PhotoAmt']>1, 'Few Photos', test['PhotosType'])
test['PhotosType'] = np.where(test['PhotoAmt']>5, 'Many Photos', test['PhotosType'])
df_plot = train.groupby(['PhotosType', 'AdoptionSpeed']).size().reset_index().pivot(
columns='AdoptionSpeed', index='PhotosType', values=0)
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)
ggDF = train['PhotosType'].value_counts()
count_overview(ggDF, 'Photos Type Count')
tab = pd.crosstab(train['PhotosType'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
PhotosType | |||||
Few Photos | 268 | 2003 | 2460 | 1818 | 2508 |
Many Photos | 43 | 438 | 758 | 815 | 466 |
No Photos | 99 | 649 | 819 | 626 | 1223 |
Chi-Square Test of Independence: p-value = 3.302E-76
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
PhotosType | |||||
Few Photos | 2.081322 | 5.630655 | 0.802700 | -6.101690 | -1.016624 |
Many Photos | -3.470047 | -4.393020 | 3.912735 | 14.150472 | -11.647052 |
No Photos | 0.666870 | -2.648652 | -4.424175 | -5.501059 | 11.568863 |
Having no photos seems to be a big factor associated with pets not being adopted after 100 days.
states = pd.read_csv('../input/state_labels.csv')
states.index = states['StateID']
states_dict = states.to_dict()
state_counts = pd.crosstab(train['State'], train['AdoptionSpeed'])
state_counts.index = state_counts.index.map(states_dict['StateName'])
state_totals = state_counts.copy(deep=True)
state_totals['Total'] = state_counts.sum(axis=1)
display(state_totals)
plt.figure(figsize=(10,6))
ax = sns.barplot(x=state_totals.Total, y = state_totals.index, palette="Spectral")
ax.set_title("Adoptions by State")
plt.xlabel('Count')
plt.tight_layout()
plt.show()
stacked_barplot(state_counts, 'Adoption Speed by State')
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 | Total |
---|---|---|---|---|---|---|
State | ||||||
Melaka | 4 | 18 | 23 | 12 | 80 | 137 |
Kedah | 3 | 14 | 34 | 23 | 36 | 110 |
Selangor | 246 | 1877 | 2435 | 2004 | 2152 | 8714 |
Pulau Pinang | 8 | 122 | 216 | 197 | 300 | 843 |
Perak | 3 | 48 | 111 | 117 | 141 | 420 |
Negeri Sembilan | 4 | 36 | 63 | 42 | 108 | 253 |
Pahang | 3 | 29 | 14 | 16 | 23 | 85 |
Johor | 23 | 113 | 136 | 103 | 132 | 507 |
Sarawak | 1 | 1 | 0 | 2 | 9 | 13 |
Sabah | 1 | 6 | 3 | 4 | 8 | 22 |
Terengganu | 0 | 9 | 2 | 6 | 9 | 26 |
Kelantan | 2 | 3 | 3 | 1 | 6 | 15 |
Kuala Lumpur | 112 | 814 | 996 | 731 | 1192 | 3845 |
Labuan | 0 | 0 | 1 | 1 | 1 | 3 |
There is a clear difference in adoption speeds between the different states. Sarawak and Melaka have the highest proportion of pets left unadopted after 100 days compared to Kelatan and Sarawak, which have the lowest proportions.
train['Purebreed'] = np.where(train['Breed2'] == 0, 'Pure Breed', 'Mixed Breed')
test['Purebreed'] = np.where(test['Breed2'] == 0, 'Pure Breed', 'Mixed Breed')
df_plot = train.groupby(['Purebreed', 'AdoptionSpeed']).size().reset_index().pivot(
columns='AdoptionSpeed', index='Purebreed', values=0)
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)
ggDF = train['Purebreed'].value_counts()
count_overview(ggDF, 'Photos Type Count')
tab = pd.crosstab(train['Purebreed'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Purebreed | |||||
Mixed Breed | 157 | 876 | 1133 | 992 | 1073 |
Pure Breed | 253 | 2214 | 2904 | 2267 | 3124 |
Chi-Square Test of Independence: p-value = 9.746E-09
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Purebreed | |||||
Mixed Breed | 4.595 | 0.179756 | -0.255053 | 3.181495 | -4.501907 |
Pure Breed | -4.595 | -0.179756 | 0.255053 | -3.181495 | 4.501907 |
Mixed breeds seem to be adopted faster than pure breeds
First, let's see the distribution of fees
ax = sns.distplot(train['Fee'], kde=False, bins=50, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4])
ax.set_title('Fee Distribution (Log Scale)')
ax.set_yscale('log')
Let's convert the fee variable into a new categorical variable (*PayorNot*). This group is composed of:
- Free - No fee associated with adoption
- Paid Adoption - Fee associated with adoption
train['PayorNot'] = np.where(train['Fee'] == 0, 'Free', 'Paid Adoption')
test['PayorNot'] = np.where(test['Fee'] == 0, 'Free', 'Paid Adoption')
df_plot = train.groupby(['PayorNot', 'AdoptionSpeed']).size().reset_index().pivot(
columns='AdoptionSpeed', index='PayorNot', values=0).rename({0: 'Free', 1: 'Paid Adoption'})
print("Number of Pets Adopted by Number of Photos Available")
display(df_plot)
ggDF = train['PayorNot'].value_counts()
count_overview(ggDF, 'Photos Type Count')
tab = pd.crosstab(train['PayorNot'], train['AdoptionSpeed'])
table = sm.stats.Table(tab)
print("Chi-Square Test of Independence: p-value = {:.3E}".format(sp.stats.chi2_contingency(tab)[1]))
display(table.standardized_resids)
stacked_barplot(tab, "Adoption Speed by Number of Photos")
plt.show()
Number of Pets Adopted by Number of Photos Available
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
PayorNot | |||||
Free | 357 | 2611 | 3430 | 2789 | 3476 |
Paid Adoption | 53 | 479 | 607 | 470 | 721 |
Chi-Square Test of Independence: p-value = 5.575E-03
AdoptionSpeed | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
PayorNot | |||||
Free | 1.481222 | 0.067103 | 1.03537 | 1.993102 | -3.452485 |
Paid Adoption | -1.481222 | -0.067103 | -1.03537 | -1.993102 | 3.452485 |
plt.figure(figsize=(12,5))
sns.distplot(train[train['PayorNot'] != 'Free']['Fee'], kde=False, bins=30, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4]).set_title('Age Distribution')
plt.show()
train['DescriptionLength'] = train['Description'].apply(lambda x: len(str(x)))
sns.distplot(train['DescriptionLength'], kde=False, bins=50, hist_kws=dict(alpha=0.85), color=sns.color_palette("Spectral", 5)[4])
plt.show()
test['DescriptionLength'] = test['Description'].apply(lambda x: len(str(x)))
sns.boxplot(x = train['AdoptionSpeed'], y = train['DescriptionLength'])
plt.show()
gr = train.groupby('AdoptionSpeed').DescriptionLength
for label, arr in gr:
sns.kdeplot(arr, label=label, shade=True)
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Some of the variables, such as MaturitySize or FurLength, include labels that are either "Not Specified" or "Not Sure". We will impute these unknown values are the mode (or most common variable).
caution_variables = ['MaturitySize', 'FurLength', 'Health', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health']
for variable in caution_variables:
train[variable].replace(0, train[variable].mode()[0], regex=True, inplace=True)
test[variable].replace(0, test[variable].mode()[0], regex=True, inplace=True)
clean_train = train.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
clean_test = test.drop(columns=['Name', 'RescuerID', 'Description', 'PetID'])
cat_cols = ['PayorNot', 'NameorNot','PhotosType','Purebreed']
for col in cat_cols:
label = LabelEncoder()
label.fit(list(clean_train[col].values))
clean_train[col] = label.transform(list(clean_train[col].values))
clean_test[col] = label.transform(list(clean_test[col].values))
Let's start by defining the evaluation criteria (quadratic weighted kappa). We will use the sci-kit learn's cohen_kappa_score
function for cross-validation.
def metric(y1,y2):
return cohen_kappa_score(y1, y2, weights = 'quadratic')
# Make scorer for scikit-learn
scorer = make_scorer(metric)
Models:
gbm_model = GradientBoostingClassifier()
gbm_grid = {
'loss' : ['deviance'],
'learning_rate' : [0.1],
'n_estimators' : [100],
'subsample' : [0.8, 0.9],
'min_samples_split' : [5,10],
'random_state' : [1234],
'max_depth' : [5, 7],
'max_features' : ['auto'],
'min_samples_leaf': [15,20]
}
gbm_gridsearch = GridSearchCV(estimator = gbm_model,
param_grid = gbm_grid,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring = scorer)
gbm_gridsearch.fit(clean_train, train_target)
gbm_gridsearch.best_params_
Fitting 3 folds for each of 16 candidates, totalling 48 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 48 out of 48 | elapsed: 4.5min finished
{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 15, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 1234, 'subsample': 0.9}
ada_model = AdaBoostClassifier()
ada_grid = {
'random_state' : [1234],
'n_estimators' : [135, 150],
'learning_rate' : [0.35, 0.5],
'algorithm' : ['SAMME.R']
}
ada_gridsearch = GridSearchCV(estimator = ada_model,
param_grid = ada_grid,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring = scorer)
ada_gridsearch.fit(clean_train, train_target)
ada_gridsearch.best_params_
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 7.8s finished
{'algorithm': 'SAMME.R', 'learning_rate': 0.35, 'n_estimators': 150, 'random_state': 1234}
ets_model = ExtraTreesClassifier()
ets_grid = {
'random_state' : [1234],
'n_estimators' : [50, 150, 300],
'criterion' : ['gini'],
'max_depth' : [10, 25, 50],
'max_features' : ['auto'],
'min_samples_split' : [5, 10, 15, 30],
'min_samples_leaf' : [5, 10, 15, 30],
'bootstrap' : ['true']
}
ets_gridsearch = GridSearchCV(estimator = ets_model,
param_grid = ets_grid,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring = scorer)
ets_gridsearch.fit(clean_train, train_target)
print(ets_gridsearch.best_score_)
ets_gridsearch.best_params_
Fitting 3 folds for each of 144 candidates, totalling 432 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 16.3s [Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 1.3min [Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed: 3.0min finished
0.2928920722473285
{'bootstrap': 'true', 'criterion': 'gini', 'max_depth': 50, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 300, 'random_state': 1234}
rfc_model = RandomForestClassifier()
rfc_grid = {
'random_state' : [1234],
'n_estimators' : [250, 300, 350],
'max_depth' : [35, 50, 75],
'max_features' : ['auto'],
'min_samples_leaf': [5, 10],
'min_samples_split': [10, 15]
}
rfc_gridsearch = GridSearchCV(estimator = rfc_model,
param_grid = rfc_grid,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring = scorer)
rfc_gridsearch.fit(clean_train, train_target)
rfc_gridsearch.best_params_
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 54.8s [Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 2.3min finished
{'max_depth': 35, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 15, 'n_estimators': 300, 'random_state': 1234}
xgb_model = XGBClassifier()
xgb_grid = {
'n_estimators' : [150, 200],
'random_state' : [1234],
'max_depth': [10,12,15],
'min_child_weight': [2,3],
'learning_rate': [0.1],
'gamma': [0.6, 0.7, 0.8]
}
xgb_gridsearch = GridSearchCV(estimator = xgb_model,
param_grid = xgb_grid,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring = scorer)
xgb_gridsearch.fit(clean_train, train_target)
xgb_gridsearch.best_params_
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 8.3min [Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 21.3min finished
{'gamma': 0.7, 'learning_rate': 0.1, 'max_depth': 10, 'min_child_weight': 2, 'n_estimators': 150, 'random_state': 1234}
print("Best Model Parameters: \n")
print("AdaBoost Classifier Score: {:.4f}\n{}\n".format(ada_gridsearch.best_score_, ada_gridsearch.best_params_))
print("GradientBoosting Classifier Score: {:.4f}\n{}\n".format(gbm_gridsearch.best_score_, gbm_gridsearch.best_params_))
print("Extra Trees Classifier Score: {:.4f}\n{}\n".format(ets_gridsearch.best_score_, ets_gridsearch.best_params_))
print("Random Forest Classifier Score: {:.4f}\n{}\n".format(rfc_gridsearch.best_score_, rfc_gridsearch.best_params_))
print("XGBoost Classifier Score: {:.4f}\n{}\n".format(xgb_gridsearch.best_score_, xgb_gridsearch.best_params_))
Best Model Parameters: AdaBoost Classifier Score: 0.3252 {'algorithm': 'SAMME.R', 'learning_rate': 0.35, 'n_estimators': 150, 'random_state': 1234} GradientBoosting Classifier Score: 0.3538 {'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 15, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 1234, 'subsample': 0.9} Extra Trees Classifier Score: 0.2929 {'bootstrap': 'true', 'criterion': 'gini', 'max_depth': 50, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 300, 'random_state': 1234} Random Forest Classifier Score: 0.3449 {'max_depth': 35, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 15, 'n_estimators': 300, 'random_state': 1234} XGBoost Classifier Score: 0.3526 {'gamma': 0.7, 'learning_rate': 0.1, 'max_depth': 10, 'min_child_weight': 2, 'n_estimators': 150, 'random_state': 1234}
pred1 = ada_gridsearch.predict(clean_test)
pred2 = gbm_gridsearch.predict(clean_test)
#pred3 = ets_gridsearch.predict(clean_test) # Did not perform well
pred4 = rfc_gridsearch.predict(clean_test)
pred5 = xgb_gridsearch.predict(clean_test)
final_preds = pd.DataFrame(data=[pred1, pred2, pred4, pred5]).transpose()
final_preds.columns = (['AdaBoost', 'GradientBoosting', 'Random Forest', 'XGBoost'])
final_preds['Average'] = round(final_preds.mean(axis=1)).astype(int)
final_preds.head()
AdaBoost | GradientBoosting | Random Forest | XGBoost | Average | |
---|---|---|---|---|---|
0 | 2 | 2 | 2 | 2 | 2 |
1 | 4 | 4 | 4 | 4 | 4 |
2 | 4 | 4 | 4 | 4 | 4 |
3 | 4 | 4 | 4 | 3 | 4 |
4 | 4 | 4 | 4 | 4 | 4 |
submission_df = pd.DataFrame(data = {'PetID' : test['PetID'],
'AdoptionSpeed' : final_preds.Average})
submission_df.to_csv('submission.csv', index = False)
submission_df.head(3)
PetID | AdoptionSpeed | |
---|---|---|
0 | 378fcc4fc | 2 |
1 | 73c10e136 | 4 |
2 | 72000c4c5 | 4 |