Discretizers¶

This package supports discretization methods and mapping functions.

Installation¶

In [ ]:

Once the installation is complete you can use it anywhere by running

In :
using Discretizers

Discretization¶

Categorical Labels¶

You can construct an object for mapping labels to integer indeces

In :
data = [:cat, :dog, :dog, :cat, :cat, :elephant]
catdisc = CategoricalDiscretizer(data);

The resulting object can be used to encode your source labels to their categorical labels

In :
println(":cat becomes: ", encode(catdisc, :cat))
println(":dog becomes: ", encode(catdisc, :dog))
println("data becomes: ", encode(catdisc, data))
:cat becomes: 1
:dog becomes: 2
data becomes: [1,2,2,1,1,3]

You can also transform back

In :
println("1 becomes: ", decode(catdisc, 1))
println("2 becomes: ", decode(catdisc, 2))
println("[1,2,3] becomes: ", decode(catdisc, [1,2,3]))
1 becomes: cat
2 becomes: dog
[1,2,3] becomes: [:cat,:dog,:elephant]

The CategoricalDiscretizer works with any object type

In :
CategoricalDiscretizer(["A", "B", "C"])
CategoricalDiscretizer([5000, 1200, 100])
CategoricalDiscretizer([:dog, "hello world", NaN]);

Linear Discretization¶

Linear discretization into a series of bins is supported as well

Here we construct a linear discretizer that maps $[0,0.5) \rightarrow 1$ and $[0.5,1] \rightarrow 2$

In :
bin_edges = [0.0,0.5,1.0]
lindisc = LinearDiscretizer(bin_edges);

Encoding works the same way

In :
println("0.2 becomes: ", encode(lindisc, 0.2))
println("0.7 becomes: ", encode(lindisc, 0.7))
println("0.5 becomes: ", encode(lindisc, 0.5))
println("it works on arrays: ", encode(lindisc, [0.0,0.8,0.2]))
0.2 becomes: 1
0.7 becomes: 2
0.5 becomes: 2
it works on arrays: [1,2,1]

Decoding is a bit different. Here we obtain the bin and sample from it uniformally

In :
println("1 becomes: ", decode(lindisc, 1))
println("2 becomes: ", decode(lindisc, 2))
println("it works on arrays: ", decode(lindisc, [2,1,2]))
1 becomes: 0.1887493587068908
2 becomes: 0.5460050358882282
it works on arrays: [0.8104340883570698,0.454060686695556,0.9457654175718269]

Some other functions are supported

In :
println("number of labels: ", nlabels(catdisc), "  ", nlabels(lindisc))
println("bin centers:      ", bincenters(lindisc))
println("extrama of a bin: ", extrema(lindisc, 2))
number of labels: 3  2
bin centers:      [0.25,0.75]
extrama of a bin: (0.5,1.0)

Both discretizers can be constructed to map to other integer types

In :
catdisc = CategoricalDiscretizer(data, Int32)
lindisc = LinearDiscretizer(bin_edges, UInt8)
encode(lindisc, 0.2)
Out:
0x01

Discretization Algorithms¶

In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that.

• Uniform Width

DiscretizeUniformWidth(nbins) - divide the domain evenly into nbins

• Uniform Count

DiscretizeUniformCount(nbins) - divide the domain into nbins where each bin has approximately equal count

• Bayesian Blocks

DiscretizeBayesianBlocks() - determines an appropriate number of bins by maximizing a Bayesian prior. See this website for an overview.

In :
nbins = 3
data  = randn(1000)
edges = binedges(DiscretizeUniformWidth(nbins), data)
Out:
4-element Array{Float64,1}:
-2.85658
-0.598978
1.65863
3.91623
In :
using PGFPlots
using Distributions

# draw a set of variables and
# filter values to a reasonable range
srand(0)
data = [rand(Cauchy(-5, 1.8), 500);
rand(Cauchy(-4, 0.8), 2000);
rand(Cauchy(-1, 0.3), 500);
rand(Cauchy( 2, 0.8), 1000);
rand(Cauchy( 4, 1.5), 500)]
data = filter!(x->-15.0 <= x <= 15.0, data)

g = GroupPlot(3, 1, groupStyle = "horizontal sep = 1.75cm")

discalgs = [("Uniform Width", DiscretizeUniformWidth(15)),
("Uniform Count", DiscretizeUniformCount(15)),
("Bayesian Blocks", DiscretizeBayesianBlocks())]

for (name, discalg) in discalgs
disc = LinearDiscretizer(binedges(discalg, data))
counts = get_discretization_counts(disc, data)
arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"),
ymin=0, xlabel="x", ylabel="pdf(x)", title=name))
end

g
Out:

Automatically Determine Number of Uniform-Width Bins¶

Several algorithms exist for deterimining the number of uniform-width bins. Simply enter the algorithm as a symbol into DiscretizeUniformWidth.

In :
g = GroupPlot(3, 3, groupStyle = "horizontal sep = 1.75cm, vertical sep = 1.5cm")

discalgs = [:sqrt, # used by Excel and others for its simplicity and speed
:sturges, # R's default method, only good for near-Gaussian data
:rice, # commonly overestimates the number of bins required
:doane, # improves Sturges’ for non-normal datasets.
:scott, # less robust estimator that that takes into account data variability and data size.
:fd, # Freedman Diaconis Estimator, robust
:auto, # max between :fd and :sturges. Good all-round performance
]

for discalg in discalgs
disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))
counts = get_discretization_counts(disc, data)
arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"),
ymin=0, title=string(discalg)))
end

g
Out:

A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set.

In :
data = [randn(100); randn(100)+1.0]
labels = [fill(:cat, 100); fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)
Out:
4-element Array{AbstractFloat,1}:
-2.58175
-0.229589
1.87765
2.6983

More information on MODL can be found here.

In [ ]: