This notebook explains the use of formula langauge and capability of Statsample-GLM to handle category data in regression.

This notebook based this notebook created by Alexej

Logistic regression with categorical data¶

We aim to fit a logistic regression model to the shelter animal data from kaggle using the Ruby gems daru and statsample-glm.

Let's first load the data.

In [1]:

require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
p shelter_data.shape
shelter_data.head(3)

[26711, 10]

Out[1]:

Daru::DataFrame(3x10)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0

We need to tell Daru what vectors are category. We can do with via #to_category

In [2]:

shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil

We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not.

In [3]:

shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3

Out[3]:

Daru::DataFrame(3x11)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)	OutcomeType_Adoption
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0	0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0	0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0	1

Before we create a model. Let's do some preprocessing to create an effective model.

Some data preprocessing¶

I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing.

In [4]:

small = shelter_data.head 600
small.head 3

Out[4]:

Daru::DataFrame(3x11)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)	OutcomeType_Adoption
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0	0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0	0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0	1

In [5]:

p small['Breed'].categories.size, small['Color'].categories.size

1380
366

Out[5]:

[1380, 366]

Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories.

Grouping Breeds¶

Lets have a look at the distribution.

In [6]:

small['Breed'].frequencies.sort(ascending: false).head(10)

Out[6]:

Daru::Vector(10)
	Breed
Domestic Shorthair Mix	204
Chihuahua Shorthair Mix	47
Pit Bull Mix	38
Labrador Retriever Mix	33
Domestic Medium Hair Mix	17
Siamese Mix	11
Domestic Longhair Mix	11
German Shepherd Mix	10
Australian Cattle Dog Mix	8
Dachshund Mix	7

Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.

Here we've used #rename_categories which accepts a hash mapping old categories to new one.

In [7]:

other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Breed'].rename_categories other_cats_hash
small['Breed'].frequencies

Out[7]:

Daru::Vector(9)
	Breed
Domestic Shorthair Mix	204
Pit Bull Mix	38
German Shepherd Mix	10
Chihuahua Shorthair Mix	47
Labrador Retriever Mix	33
Domestic Longhair Mix	11
Siamese Mix	11
Domestic Medium Hair Mix	17
other	229

And let's set the base category to 'other'.

In [8]:

small['Breed'].base_category = 'other'

Out[8]:

"other"

We now do the same with 'Colors'

Grouping colors¶

In [9]:

p small['Color'].categories.size
small['Color'].frequencies.sort(ascending: false).head 10

Out[9]:

Daru::Vector(10)
	Color
Black/White	66
Black	52
Brown Tabby	37
Tricolor	22
Brown/White	21
Brown Tabby/White	20
Calico	19
White	19
Tan/White	18
Brown	16

In [10]:

other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Color'].rename_categories other_cats_hash
small['Color'].frequencies

Out[10]:

Daru::Vector(24)
	Color
Brown/White	21
Blue/White	12
Tan	11
Black/Tan	14
Blue Tabby	10
Brown Tabby	37
White	19
Black	52
Brown	16
Orange Tabby/White	14
Black/White	66
Brown Brindle/White	10
Orange Tabby	15
Chocolate/White	11
Blue	10
Calico	19
Brown/Black	11
Tricolor	22
White/Black	10
Tortie	13
Tan/White	18
Brown Tabby/White	20
White/Brown	13
other	156

In [11]:

small['Color'].base_category = 'other'

Out[11]:

"other"

Looking at SexuponOutcome¶

In [12]:

small['SexuponOutcome'].frequencies

Out[12]:

Daru::Vector(6)
	SexuponOutcome
Neutered Male	216
Spayed Female	205
Intact Male	78
Intact Female	77
Unknown	24
	0

The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values.

In [13]:

p small['SexuponOutcome'].categories
small['SexuponOutcome'].rename_categories nil => 'Unknown'
small['SexuponOutcome'].categories

["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown", nil]

Out[13]:

["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown"]

Split to train and test¶

In [14]:

train = small.head 500
test = small.tail 100
p train.size, test.size

500
100

Out[14]:

[500, 100]

Model fit¶

Now, having put data in appropriate form, we can fit the logistic regression model with statsample-glm.

In [16]:

m = test['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"

Out[16]:

"Trivial accuracy = 0.5900000000000001"

In [17]:

require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.df_for_regression.head 5
glm_adoption.model.coefficients :hash

Out[17]:

{:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}

We can also predict using the model we just created.

In [18]:

predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5

Out[18]:

Daru::Vector(5)
0	0
1	0
2	1
3	0
4	0