Notebook

Learning from continuous data is almost identical to learning from discrete data. Instead of generating local scores using BDeu scoring, BGe scoring is used. BGe scoring implicitly assumes the data is sampled from some (unknown) Gaussian distribution.

First we import the Gobnilp class and create a Gobnilp object as usual.

In [1]:

from pygobnilp.gobnilp import Gobnilp
m = Gobnilp()

Using license file /home/james/gurobi.lic
Academic license - for non-commercial use only
Changed value of parameter PreCrush to 1
   Prev: 0  Min: 0  Max: 1  Default: 0
Changed value of parameter CutPasses to 100000
   Prev: -1  Min: -1  Max: 2000000000  Default: -1
Changed value of parameter GomoryPasses to 100000
   Prev: -1  Min: -1  Max: 2000000000  Default: -1
Changed value of parameter MIPFocus to 2
   Prev: 0  Min: 0  Max: 3  Default: 0
Changed value of parameter ZeroHalfCuts to 2
   Prev: -1  Min: -1  Max: 2  Default: -1
Changed value of parameter MIPGap to 0.0
   Prev: 0.0001  Min: 0.0  Max: inf  Default: 0.0001
Changed value of parameter MIPGapAbs to 0.0
   Prev: 1e-10  Min: 0.0  Max: inf  Default: 1e-10

The method learn does everything we need: reads in data, computes local scores, creates a MIP model and the solves it. However, we have to explicit declare that we are reading in continuous data and using BGe scoring (since the default is discrete data with BDeu scoring). Here we use the data file gaussian.dat. This data is called 'gaussian.test' in bnlearn.

In [2]:

m.learn('gaussian.dat',data_type='continuous',score='BGe')

**********
BN has score -54052.41081344564
**********
A<- -7124.782936593152
B<- -12656.351445396509
C<-A,B -3743.043565645632
D<-B -1548.939409177594
E<-C,F,G -7312.558564066843
F<-A,D,G -11136.569374437633
G<- -10530.165518128275
**********
bnlearn modelstring = 
[A][B][C|B:A][D|B][E|F:C:G][F|D:A:G][G]
**********
CPDAG:
Vertices: A,B,C,D,E,F,G
A->C
A->F
B->C
B-D
C->E
D->F
F->E
G->E
G->F

bnlearn's hillclimbing algorithm actually finds a higher-scoring network from the same data. The problem is that we have been using the default parent set size limit of 3. This default value is to make learning practical on datasets with many variables. But here we only have 7 variables so truly optimal learning is easy and we can run with no limit on parent set size.

In [3]:

m = Gobnilp()
m.learn('gaussian.dat',data_type='continuous',score='BGe',palim=None)

Changed value of parameter PreCrush to 1
   Prev: 0  Min: 0  Max: 1  Default: 0
Changed value of parameter CutPasses to 100000
   Prev: -1  Min: -1  Max: 2000000000  Default: -1
Changed value of parameter GomoryPasses to 100000
   Prev: -1  Min: -1  Max: 2000000000  Default: -1
Changed value of parameter MIPFocus to 2
   Prev: 0  Min: 0  Max: 3  Default: 0
Changed value of parameter ZeroHalfCuts to 2
   Prev: -1  Min: -1  Max: 2  Default: -1
Changed value of parameter MIPGap to 0.0
   Prev: 0.0001  Min: 0.0  Max: inf  Default: 0.0001
Changed value of parameter MIPGapAbs to 0.0
   Prev: 1e-10  Min: 0.0  Max: inf  Default: 1e-10
**********
BN has score -53258.94161814058
**********
A<- -7124.782936593152
B<- -12656.351445396509
C<-A,B -3743.043565645632
D<-B -1548.939409177594
E<- -10545.851006239516
F<-A,D,E,G -7109.807736959905
G<- -10530.165518128275
**********
bnlearn modelstring = 
[A][B][C|B:A][D|B][E][F|G:D:A:E][G]
**********
CPDAG:
Vertices: A,B,C,D,E,F,G
A->C
A->F
B->C
B-D
D->F
E->F
G->F

OK, that's a better network. In fact bnlearn also learns a network with this score when using hill-climbing.