Discretizers

This package supports discretization methods and mapping functions.

Installation

In [ ]:
Pkg.add("Discretizers")

Once the installation is complete you can use it anywhere by running

In [1]:
using Discretizers

Discretization

Categorical Labels

You can construct an object for mapping labels to integer indeces

In [2]:
data = [:cat, :dog, :dog, :cat, :cat, :elephant]
catdisc = CategoricalDiscretizer(data);

The resulting object can be used to encode your source labels to their categorical labels

In [3]:
println(":cat becomes: ", encode(catdisc, :cat))
println(":dog becomes: ", encode(catdisc, :dog))
println("data becomes: ", encode(catdisc, data))
:cat becomes: 1
:dog becomes: 2
data becomes: [1,2,2,1,1,3]

You can also transform back

In [4]:
println("1 becomes: ", decode(catdisc, 1))
println("2 becomes: ", decode(catdisc, 2))
println("[1,2,3] becomes: ", decode(catdisc, [1,2,3]))
1 becomes: cat
2 becomes: dog
[1,2,3] becomes: [:cat,:dog,:elephant]

The CategoricalDiscretizer works with any object type

In [5]:
CategoricalDiscretizer(["A", "B", "C"])
CategoricalDiscretizer([5000, 1200, 100])
CategoricalDiscretizer([:dog, "hello world", NaN]);

Linear Discretization

Linear discretization into a series of bins is supported as well

Here we construct a linear discretizer that maps $[0,0.5) \rightarrow 1$ and $[0.5,1] \rightarrow 2$

In [6]:
bin_edges = [0.0,0.5,1.0]
lindisc = LinearDiscretizer(bin_edges);

Encoding works the same way

In [7]:
println("0.2 becomes: ", encode(lindisc, 0.2))
println("0.7 becomes: ", encode(lindisc, 0.7))
println("0.5 becomes: ", encode(lindisc, 0.5))
println("it works on arrays: ", encode(lindisc, [0.0,0.8,0.2]))
0.2 becomes: 1
0.7 becomes: 2
0.5 becomes: 2
it works on arrays: [1,2,1]

Decoding is a bit different. Here we obtain the bin and sample from it uniformally

In [8]:
println("1 becomes: ", decode(lindisc, 1))
println("2 becomes: ", decode(lindisc, 2))
println("it works on arrays: ", decode(lindisc, [2,1,2]))
1 becomes: 0.1887493587068908
2 becomes: 0.5460050358882282
it works on arrays: [0.8104340883570698,0.454060686695556,0.9457654175718269]

Some other functions are supported

In [9]:
println("number of labels: ", nlabels(catdisc), "  ", nlabels(lindisc))
println("bin centers:      ", bincenters(lindisc))
println("extrama of a bin: ", extrema(lindisc, 2))
number of labels: 3  2
bin centers:      [0.25,0.75]
extrama of a bin: (0.5,1.0)

Both discretizers can be constructed to map to other integer types

In [10]:
catdisc = CategoricalDiscretizer(data, Int32)
lindisc = LinearDiscretizer(bin_edges, UInt8)
encode(lindisc, 0.2)
Out[10]:
0x01

Discretization Algorithms

In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that.

  • Uniform Width

    DiscretizeUniformWidth(nbins) - divide the domain evenly into nbins

  • Uniform Count

    DiscretizeUniformCount(nbins) - divide the domain into nbins where each bin has approximately equal count

  • Bayesian Blocks

    DiscretizeBayesianBlocks() - determines an appropriate number of bins by maximizing a Bayesian prior. See this website for an overview.

In [11]:
nbins = 3
data  = randn(1000)
edges = binedges(DiscretizeUniformWidth(nbins), data)
Out[11]:
4-element Array{Float64,1}:
 -2.85658 
 -0.598978
  1.65863 
  3.91623 
In [12]:
using PGFPlots
using Distributions

# draw a set of variables and
# filter values to a reasonable range
srand(0)
data = [rand(Cauchy(-5, 1.8), 500);
        rand(Cauchy(-4, 0.8), 2000);
        rand(Cauchy(-1, 0.3), 500);
        rand(Cauchy( 2, 0.8), 1000);
        rand(Cauchy( 4, 1.5), 500)]
data = filter!(x->-15.0 <= x <= 15.0, data)

g = GroupPlot(3, 1, groupStyle = "horizontal sep = 1.75cm")

discalgs = [("Uniform Width", DiscretizeUniformWidth(15)),
            ("Uniform Count", DiscretizeUniformCount(15)),
            ("Bayesian Blocks", DiscretizeBayesianBlocks())]

for (name, discalg) in discalgs
    disc = LinearDiscretizer(binedges(discalg, data))
    counts = get_discretization_counts(disc, data)    
    arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
    push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"), 
          ymin=0, xlabel="x", ylabel="pdf(x)", title=name))
end

g
Out[12]:

Automatically Determine Number of Uniform-Width Bins

Several algorithms exist for deterimining the number of uniform-width bins. Simply enter the algorithm as a symbol into DiscretizeUniformWidth.

In [13]:
g = GroupPlot(3, 3, groupStyle = "horizontal sep = 1.75cm, vertical sep = 1.5cm")

discalgs = [:sqrt, # used by Excel and others for its simplicity and speed
            :sturges, # R's default method, only good for near-Gaussian data
            :rice, # commonly overestimates the number of bins required
            :doane, # improves Sturges’ for non-normal datasets.
            :scott, # less robust estimator that that takes into account data variability and data size.
            :fd, # Freedman Diaconis Estimator, robust
            :auto, # max between :fd and :sturges. Good all-round performance
            ]

for discalg in discalgs
    disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))
    counts = get_discretization_counts(disc, data)    
    arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
    push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"), 
          ymin=0, title=string(discalg)))
end

g
Out[13]:

A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set.

In [14]:
data = [randn(100); randn(100)+1.0]
labels = [fill(:cat, 100); fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)
Out[14]:
4-element Array{AbstractFloat,1}:
 -2.58175 
 -0.229589
  1.87765 
  2.6983  

More information on MODL can be found here.

In [ ]: