This notebook describes how one can use categorical data. The applicability is limited now because regression is not yet supported with Categorical Data
require 'daru'
true
This is animal shelter data taken from kaggle compeption.
Its animals that are given up by their owner to a shelter. Lets gain some insight about this data.
shelter_data = Daru::DataFrame.from_csv '../data/animal_shelter_train.csv'
shelter_data.head(3)
Daru::DataFrame(3x10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
AnimalID | Name | DateTime | OutcomeType | OutcomeSubtype | AnimalType | SexuponOutcome | Breed | Color | AgeuponOutcome(Weeks) | |
0 | A671945 | Hambone | 2014-02-12 18:22:00 | Return_to_owner | Dog | Neutered Male | Shetland Sheepdog Mix | Brown/White | 52.0 | |
1 | A656520 | Emily | 2013-10-13 12:44:00 | Euthanasia | Suffering | Cat | Spayed Female | Domestic Shorthair Mix | Cream Tabby | 52.0 |
2 | A686464 | Pearce | 2015-01-31 12:28:00 | Adoption | Foster | Dog | Neutered Male | Pit Bull Mix | Blue/White | 104.0 |
shelter_data.shape
[26711, 10]
We are not interested in DateTime
, AnimalID
and OutcomeSubtype
so we will delete them.
Since OutcomeType
, AnimalType
, SexuponOutcome
, Breed
and Color
are qualitative variable, we'll convert them to type category.
shelter_data.delete_vectors 'DateTime', 'AnimalID', 'OutcomeSubtype'
shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
shelter_data.first 5
Daru::DataFrame(5x7) | |||||||
---|---|---|---|---|---|---|---|
Name | OutcomeType | AnimalType | SexuponOutcome | Breed | Color | AgeuponOutcome(Weeks) | |
0 | Hambone | Return_to_owner | Dog | Neutered Male | Shetland Sheepdog Mix | Brown/White | 52.0 |
1 | Emily | Euthanasia | Cat | Spayed Female | Domestic Shorthair Mix | Cream Tabby | 52.0 |
2 | Pearce | Adoption | Dog | Neutered Male | Pit Bull Mix | Blue/White | 104.0 |
3 | Transfer | Cat | Intact Male | Domestic Shorthair Mix | Blue Cream | 3.0 | |
4 | Transfer | Dog | Neutered Male | Lhasa Apso/Miniature Poodle | Tan | 104.0 |
We'll categorize AgeuponOutcome(Weeks)
to get quick summary of the ages (as we will see later).
shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500], labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]
shelter_data.delete_vector 'AgeuponOutcome(Weeks)'
nil
Lets look at the categories we have formed.
shelter_data['AgeuponOutcome'].frequencies.sort ascending: false
Daru::Vector(5) | |
---|---|
one_to_five_years | 10605 |
less_than_year | 9965 |
more_than__five_years | 4216 |
less_than_month | 1505 |
less_than_week | 420 |
Say we are interested in looking at percentage of each animals we have having in the shelter.
shelter_data['AnimalType'].frequencies :percentage
Daru::Vector(2) | |
---|---|
AnimalType | |
Dog | 58.38044251431994 |
Cat | 41.61955748568006 |
This tells us that we have 58% of dogs and 41% of cats in out dataset. Lets explore further.
Lets look at what are the possible outcomes along with their frequencies.
shelter_data['OutcomeType'].frequencies
Daru::Vector(5) | |
---|---|
OutcomeType | |
Return_to_owner | 4786 |
Euthanasia | 1553 |
Adoption | 10769 |
Transfer | 9406 |
Died | 197 |
So, a large amount of these animals are adopted which is great.
Lets get some insight into animals who died.
died = shelter_data.where shelter_data['OutcomeType'].eq('Died')
died['AnimalType'].frequencies :percentage
Daru::Vector(2) | |
---|---|
AnimalType | |
Dog | 25.380710659898476 |
Cat | 74.61928934010153 |
Hmm.. Cats are more prone to die than dogs. We can say this because cats to dog ratio is almost the same in the dataset.
Lets have some insight into ages of cats and dogs that died.
died.where(died['AnimalType'].eq 'Dog')['AgeuponOutcome'].frequencies :percentage
Daru::Vector(5) | |
---|---|
less_than_week | 12.0 |
less_than_month | 4.0 |
less_than_year | 24.0 |
one_to_five_years | 40.0 |
more_than__five_years | 20.0 |
died.where(died['AnimalType'].eq 'Cat')['AgeuponOutcome'].frequencies :percentage
Daru::Vector(5) | |
---|---|
less_than_week | 11.564625850340136 |
less_than_month | 12.244897959183673 |
less_than_year | 57.14285714285714 |
one_to_five_years | 12.244897959183673 |
more_than__five_years | 6.802721088435375 |
Also younger cats are more prone to die.
Lets move our attention to animals which got adopted.
adopted = shelter_data.where shelter_data['OutcomeType'].eq('Adoption')
adopted['AnimalType'].frequencies :percentage
Daru::Vector(2) | |
---|---|
AnimalType | |
Dog | 60.33057851239669 |
Cat | 39.66942148760331 |
Hmm... Dogs are more likely to be adopted, maybe that explains why so many cats die.
Lets now look at those animals which got adopted by their owner back.
owner = shelter_data.where shelter_data['OutcomeType'].eq('Return_to_owner')
owner['AnimalType'].frequencies :percentage
Daru::Vector(2) | |
---|---|
AnimalType | |
Dog | 89.55286251567071 |
Cat | 10.447137484329295 |
Astonishingly 90% of dogs returns to their owner while only 10% of cats do.