Learning from continuous data is almost identical to learning from discrete data. Instead of generating local scores using BDeu scoring, BGe scoring is used. BGe scoring implicitly assumes the data is sampled from some (unknown) Gaussian distribution.
First we import the Gobnilp
class and create a Gobnilp
object as usual.
from pygobnilp.gobnilp import Gobnilp
m = Gobnilp()
Using license file /home/james/gurobi.lic Academic license - for non-commercial use only Changed value of parameter PreCrush to 1 Prev: 0 Min: 0 Max: 1 Default: 0 Changed value of parameter CutPasses to 100000 Prev: -1 Min: -1 Max: 2000000000 Default: -1 Changed value of parameter GomoryPasses to 100000 Prev: -1 Min: -1 Max: 2000000000 Default: -1 Changed value of parameter MIPFocus to 2 Prev: 0 Min: 0 Max: 3 Default: 0 Changed value of parameter ZeroHalfCuts to 2 Prev: -1 Min: -1 Max: 2 Default: -1 Changed value of parameter MIPGap to 0.0 Prev: 0.0001 Min: 0.0 Max: inf Default: 0.0001 Changed value of parameter MIPGapAbs to 0.0 Prev: 1e-10 Min: 0.0 Max: inf Default: 1e-10
The method learn
does everything we need: reads in data, computes local scores, creates a MIP model and the solves it. However, we have to explicit declare that we are reading in continuous data and using BGe scoring (since the default is discrete data with BDeu scoring). Here we use the data file gaussian.dat
. This data is called 'gaussian.test' in bnlearn.
m.learn('gaussian.dat',data_type='continuous',score='BGe')
********** BN has score -54052.41081344564 ********** A<- -7124.782936593152 B<- -12656.351445396509 C<-A,B -3743.043565645632 D<-B -1548.939409177594 E<-C,F,G -7312.558564066843 F<-A,D,G -11136.569374437633 G<- -10530.165518128275 ********** bnlearn modelstring = [A][B][C|B:A][D|B][E|F:C:G][F|D:A:G][G] ********** CPDAG: Vertices: A,B,C,D,E,F,G A->C A->F B->C B-D C->E D->F F->E G->E G->F
bnlearn's hillclimbing algorithm actually finds a higher-scoring network from the same data. The problem is that we have been using the default parent set size limit of 3. This default value is to make learning practical on datasets with many variables. But here we only have 7 variables so truly optimal learning is easy and we can run with no limit on parent set size.
m = Gobnilp()
m.learn('gaussian.dat',data_type='continuous',score='BGe',palim=None)
Changed value of parameter PreCrush to 1 Prev: 0 Min: 0 Max: 1 Default: 0 Changed value of parameter CutPasses to 100000 Prev: -1 Min: -1 Max: 2000000000 Default: -1 Changed value of parameter GomoryPasses to 100000 Prev: -1 Min: -1 Max: 2000000000 Default: -1 Changed value of parameter MIPFocus to 2 Prev: 0 Min: 0 Max: 3 Default: 0 Changed value of parameter ZeroHalfCuts to 2 Prev: -1 Min: -1 Max: 2 Default: -1 Changed value of parameter MIPGap to 0.0 Prev: 0.0001 Min: 0.0 Max: inf Default: 0.0001 Changed value of parameter MIPGapAbs to 0.0 Prev: 1e-10 Min: 0.0 Max: inf Default: 1e-10 ********** BN has score -53258.94161814058 ********** A<- -7124.782936593152 B<- -12656.351445396509 C<-A,B -3743.043565645632 D<-B -1548.939409177594 E<- -10545.851006239516 F<-A,D,E,G -7109.807736959905 G<- -10530.165518128275 ********** bnlearn modelstring = [A][B][C|B:A][D|B][E][F|G:D:A:E][G] ********** CPDAG: Vertices: A,B,C,D,E,F,G A->C A->F B->C B-D D->F E->F G->F
OK, that's a better network. In fact bnlearn also learns a network with this score when using hill-climbing.