This notebook deals with categorical data support which is now added to Daru. With this Daru can handle categorical data.
require 'daru'
true
Initialize a vector whose data is categorical by specifying type: :category
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category
Daru::Vector(5) | |
---|---|
0 | a |
1 | 1 |
2 | a |
3 | 1 |
4 | c |
dv.frequencies
Daru::Vector(3) | |
---|---|
a | 2 |
1 | 2 |
c | 1 |
You can initialize it with some predefined categories even though they do not exist using categories
option.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category, categories: [:a, :b, :c, 1]
Daru::Vector(5) | |
---|---|
0 | a |
1 | 1 |
2 | a |
3 | 1 |
4 | c |
categories option initalizes new categories and also specify the order in which they should occur. So now if you see the frequency table it would be ordered with the order you specified.
dv.frequencies
Daru::Vector(4) | |
---|---|
a | 2 |
b | 0 |
c | 1 |
1 | 2 |
Since categorical data can be ordered as well as unordered you can specify whether the vector is ordered or not using the ordered: true
or ordered: false
during initialization.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], categories: [:a, :b, :c, 1], ordered: false, type: :category
Daru::Vector(5) | |
---|---|
0 | a |
1 | 1 |
2 | a |
3 | 1 |
4 | c |
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min' (pry):7:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
As you can see you can't do the comparision if vector is not ordered. Lets make it ordered.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], ordered: true, categories: [:a, :b, :c, 1], type: :category
Daru::Vector(5) | |
---|---|
0 | a |
1 | 1 |
2 | a |
3 | 1 |
4 | c |
dv.min
:a
dv.sort!
Daru::Vector(5) | |
---|---|
0 | a |
2 | a |
4 | c |
1 | 1 |
3 | 1 |
Beside during the initialization you can also set the categories after the vector has been initialized.
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.categories = [:a, :b, :c, 1]
[:a, :b, :c, 1]
You can also check all the categories associated with the vector.
dv.categories
[:a, :b, :c, 1]
You can specify if the vector has to be treated as ordered or not after initialization of vector.
Note: By default the vector will be unordered
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.ordered?
false
dv.ordered = true
dv.ordered?
true
Here are a few measures to summarize categorical vector.
dv = Daru::Vector.new [:a, :a, :a, :b, :b, :c], type: :category
dv.summary
Daru::Vector(6) | |
---|---|
size | 6 |
categories | 3 |
max_freq | 3 |
max_category | a |
min_freq | 1 |
min_category | c |
Gives the frequency of each category in the order they occur.
dv = Daru::Vector.new ['third']*3 + ['second']*2 + ['first'], type: :category, categories: ['first', 'second', 'third']
dv.frequencies
Daru::Vector(3) | |
---|---|
first | 1 |
second | 2 |
third | 3 |
Note: These operations only apply if the vector is ordered.
dv
Daru::Vector(6) | |
---|---|
0 | third |
1 | third |
2 | third |
3 | second |
4 | second |
5 | first |
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min' (pry):23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
dv.ordered = true
true
dv.min
"first"
dv.max
"third"
dv.sort!
Daru::Vector(6) | |
---|---|
5 | first |
3 | second |
4 | second |
0 | third |
1 | third |
2 | third |
Associates new categories with the vector.
Note: In order to insert a new categorical value you need to use #add_category
to make sure this category is registered in the vector. For example -
dv
Daru::Vector(6) | |
---|---|
5 | first |
3 | second |
4 | second |
0 | third |
1 | third |
2 | third |
dv[0] = 'fourth'
ArgumentError: Invalid category fourth, to add a new category use #add_category /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:505:in `modify_category_at' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:144:in `[]=' (pry):29:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
dv.add_category 'fourth'
dv[0] = 'fourth'
dv
Daru::Vector(6) | |
---|---|
5 | first |
3 | second |
4 | second |
0 | fourth |
1 | third |
2 | third |
dv.categories
["first", "second", "third", "fourth"]
You can rename subset of existing categories by passing a hash mapping old ones to new ones.
dv = Daru::Vector.new [1, 2, 'third', 2, 1], type: :category
Daru::Vector(5) | |
---|---|
0 | 1 |
1 | 2 |
2 | third |
3 | 2 |
4 | 1 |
dv.rename_categories 1 => 'first', 2 => 'second'
dv
Daru::Vector(5) | |
---|---|
0 | first |
1 | second |
2 | third |
3 | second |
4 | first |
Indexing works similar to an ordinary vector, so you can expect these methods to do the same as with ordinary vector. Here are few examples:
dv = Daru::Vector.new [1, 1, 2, 2, 3, 1], index: :a..:f, type: :category
Daru::Vector(6) | |
---|---|
a | 1 |
b | 1 |
c | 2 |
d | 2 |
e | 3 |
f | 1 |
dv[0..2]
Daru::Vector(3) | |
---|---|
a | 1 |
b | 1 |
c | 2 |
dv.at -1
1
dv.set_at [0, 1], 3
dv
Daru::Vector(6) | |
---|---|
a | 3 |
b | 3 |
c | 2 |
d | 2 |
e | 3 |
f | 1 |
Daru uses Arel-like syntax for querying data.
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.ordered = true
dv.frequencies
Daru::Vector(4) | |
---|---|
I | 3 |
II | 2 |
III | 1 |
IV | 1 |
dv.where(dv.eq('I'))
Daru::Vector(3) | |
---|---|
0 | I |
2 | I |
5 | I |
dv.where(dv.gt('II'))
Daru::Vector(2) | |
---|---|
3 | III |
4 | IV |
df = Daru::DataFrame.new({
a: (1..7).to_a,
b: ('a'..'g').to_a,
c: ['I', 'II', 'I', 'III', 'IV', 'I', 'II']
})
Daru::DataFrame(7x3) | |||
---|---|---|---|
a | b | c | |
0 | 1 | a | I |
1 | 2 | b | II |
2 | 3 | c | I |
3 | 4 | d | III |
4 | 5 | e | IV |
5 | 6 | f | I |
6 | 7 | g | II |
df.c = df.c.to_category
df
Daru::DataFrame(7x3) | |||
---|---|---|---|
a | b | c | |
0 | 1 | a | I |
1 | 2 | b | II |
2 | 3 | c | I |
3 | 4 | d | III |
4 | 5 | e | IV |
5 | 6 | f | I |
6 | 7 | g | II |
df.where(df.c.gt('I') & df.c.lt('IV'))
Daru::DataFrame(3x3) | |||
---|---|---|---|
a | b | c | |
1 | 2 | b | II |
3 | 4 | d | III |
6 | 7 | g | II |
Categorical data supports 4 types of contrast coding schemes-
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.name = 'Rank'
dv.contrast_code
Daru::DataFrame(7x3) | |||
---|---|---|---|
Rank_II | Rank_III | Rank_IV | |
0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 0 |
3 | 0 | 1 | 0 |
4 | 0 | 0 | 1 |
5 | 0 | 0 | 0 |
6 | 1 | 0 | 0 |
You can set the base category using #base_category=
dv.base_category = 'IV'
dv.contrast_code
Daru::DataFrame(7x3) | |||
---|---|---|---|
Rank_I | Rank_II | Rank_III | |
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 0 |
5 | 1 | 0 | 0 |
6 | 0 | 1 | 0 |
To use any other coding using #coding_scheme
dv.coding_scheme = :deviation
dv.contrast_code
Daru::DataFrame(7x3) | |||
---|---|---|---|
Rank_I | Rank_II | Rank_III | |
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | -1 | -1 | -1 |
5 | 1 | 0 | 0 |
6 | 0 | 1 | 0 |