We aim to fit a logistic regression model to the shelter animal data from kaggle using the Ruby gems daru
and statsample-glm
.
Let's first load the data.
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
p shelter_data.shape
shelter_data.head(3)
[26711, 10]
Daru::DataFrame(3x10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
AnimalID | Name | DateTime | OutcomeType | OutcomeSubtype | AnimalType | SexuponOutcome | Breed | Color | AgeuponOutcome(Weeks) | |
0 | A671945 | Hambone | 2014-02-12 18:22:00 | Return_to_owner | Dog | Neutered Male | Shetland Sheepdog Mix | Brown/White | 52.0 | |
1 | A656520 | Emily | 2013-10-13 12:44:00 | Euthanasia | Suffering | Cat | Spayed Female | Domestic Shorthair Mix | Cream Tabby | 52.0 |
2 | A686464 | Pearce | 2015-01-31 12:28:00 | Adoption | Foster | Dog | Neutered Male | Pit Bull Mix | Blue/White | 104.0 |
We need to tell Daru what vectors are category. We can do with via #to_category
shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil
We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not.
shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3
Daru::DataFrame(3x11) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AnimalID | Name | DateTime | OutcomeType | OutcomeSubtype | AnimalType | SexuponOutcome | Breed | Color | AgeuponOutcome(Weeks) | OutcomeType_Adoption | |
0 | A671945 | Hambone | 2014-02-12 18:22:00 | Return_to_owner | Dog | Neutered Male | Shetland Sheepdog Mix | Brown/White | 52.0 | 0 | |
1 | A656520 | Emily | 2013-10-13 12:44:00 | Euthanasia | Suffering | Cat | Spayed Female | Domestic Shorthair Mix | Cream Tabby | 52.0 | 0 |
2 | A686464 | Pearce | 2015-01-31 12:28:00 | Adoption | Foster | Dog | Neutered Male | Pit Bull Mix | Blue/White | 104.0 | 1 |
Before we create a model. Let's do some preprocessing to create an effective model.
I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing.
small = shelter_data.head 600
small.head 3
Daru::DataFrame(3x11) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AnimalID | Name | DateTime | OutcomeType | OutcomeSubtype | AnimalType | SexuponOutcome | Breed | Color | AgeuponOutcome(Weeks) | OutcomeType_Adoption | |
0 | A671945 | Hambone | 2014-02-12 18:22:00 | Return_to_owner | Dog | Neutered Male | Shetland Sheepdog Mix | Brown/White | 52.0 | 0 | |
1 | A656520 | Emily | 2013-10-13 12:44:00 | Euthanasia | Suffering | Cat | Spayed Female | Domestic Shorthair Mix | Cream Tabby | 52.0 | 0 |
2 | A686464 | Pearce | 2015-01-31 12:28:00 | Adoption | Foster | Dog | Neutered Male | Pit Bull Mix | Blue/White | 104.0 | 1 |
p small['Breed'].categories.size, small['Color'].categories.size
1380 366
[1380, 366]
Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories.
Lets have a look at the distribution.
small['Breed'].frequencies.sort(ascending: false).head(10)
Daru::Vector(10) | |
---|---|
Breed | |
Domestic Shorthair Mix | 204 |
Chihuahua Shorthair Mix | 47 |
Pit Bull Mix | 38 |
Labrador Retriever Mix | 33 |
Domestic Medium Hair Mix | 17 |
Siamese Mix | 11 |
Domestic Longhair Mix | 11 |
German Shepherd Mix | 10 |
Australian Cattle Dog Mix | 8 |
Dachshund Mix | 7 |
Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.
Here we've used #rename_categories which accepts a hash mapping old categories to new one.
other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Breed'].rename_categories other_cats_hash
small['Breed'].frequencies
Daru::Vector(9) | |
---|---|
Breed | |
Domestic Shorthair Mix | 204 |
Pit Bull Mix | 38 |
German Shepherd Mix | 10 |
Chihuahua Shorthair Mix | 47 |
Labrador Retriever Mix | 33 |
Domestic Longhair Mix | 11 |
Siamese Mix | 11 |
Domestic Medium Hair Mix | 17 |
other | 229 |
And let's set the base category to 'other'.
small['Breed'].base_category = 'other'
"other"
We now do the same with 'Colors'
p small['Color'].categories.size
small['Color'].frequencies.sort(ascending: false).head 10
366
Daru::Vector(10) | |
---|---|
Color | |
Black/White | 66 |
Black | 52 |
Brown Tabby | 37 |
Tricolor | 22 |
Brown/White | 21 |
Brown Tabby/White | 20 |
Calico | 19 |
White | 19 |
Tan/White | 18 |
Brown | 16 |
other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Color'].rename_categories other_cats_hash
small['Color'].frequencies
Daru::Vector(24) | |
---|---|
Color | |
Brown/White | 21 |
Blue/White | 12 |
Tan | 11 |
Black/Tan | 14 |
Blue Tabby | 10 |
Brown Tabby | 37 |
White | 19 |
Black | 52 |
Brown | 16 |
Orange Tabby/White | 14 |
Black/White | 66 |
Brown Brindle/White | 10 |
Orange Tabby | 15 |
Chocolate/White | 11 |
Blue | 10 |
Calico | 19 |
Brown/Black | 11 |
Tricolor | 22 |
White/Black | 10 |
Tortie | 13 |
Tan/White | 18 |
Brown Tabby/White | 20 |
White/Brown | 13 |
other | 156 |
small['Color'].base_category = 'other'
"other"
small['SexuponOutcome'].frequencies
Daru::Vector(6) | |
---|---|
SexuponOutcome | |
Neutered Male | 216 |
Spayed Female | 205 |
Intact Male | 78 |
Intact Female | 77 |
Unknown | 24 |
0 |
The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values.
p small['SexuponOutcome'].categories
small['SexuponOutcome'].rename_categories nil => 'Unknown'
small['SexuponOutcome'].categories
["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown", nil]
["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown"]
train = small.head 500
test = small.tail 100
p train.size, test.size
500 100
[500, 100]
Now, having put data in appropriate form, we can fit the logistic regression model with statsample-glm
.
m = test['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"
"Trivial accuracy = 0.5900000000000001"
require 'statsample-glm'
formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.df_for_regression.head 5
glm_adoption.model.coefficients :hash
{:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}
We can also predict using the model we just created.
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5
Daru::Vector(5) | |
---|---|
0 | 0 |
1 | 0 |
2 | 1 |
3 | 0 |
4 | 0 |