Native Bayes with Weka¶

If you deal with Machine Learning in the JVM you should not forget about the good old Weka. It has basically been desinged to be used by a Swing GUI but it can also be used as an API. In terms of documentation I can recommend this manual and the javadoc.

In my Demo I use the NaiveBayesMultinomial classifier with the iris dataset that is directly loaded from the Internet.

Setup¶

We add the necessary Maven dependency

In [1]:

%%classpath add mvn 
nz.ac.waikato.cms.weka:weka-stable:3.8.3

Data Preparation¶

We can use the CSVLoader to load the data from the Internet. We define the index of the class (category) field which defines the classification result.

Finally we reshuffle the data: We have a dataset with 150 records

In [30]:

import weka.core.converters.CSVLoader;
import java.net.URL

var url = new URL("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
var loader = new CSVLoader();
loader.setSource(url.openStream);
var dataSet = loader.getDataSet();
dataSet.setClassIndex(4);

//We could use Collections.shuffle(dataSet) or use the weka functionality
dataSet = dataSet.resample(new java.util.Random())

dataSet.size

Out[30]:

In [31]:

dataSet.getClass.getName

Out[31]:

weka.core.Instances

Just to double check the data we show the first 10 records

In [32]:

import scala.collection.JavaConversions._

dataSet.subList(0,10).foreach(println(_))

5,3.4,1.5,0.2,Setosa
6.7,3,5,1.7,Versicolor
6.2,2.2,4.5,1.5,Versicolor
4.9,2.4,3.3,1,Versicolor
5.7,3.8,1.7,0.3,Setosa
6.7,3.3,5.7,2.1,Virginica
6.7,3.1,4.7,1.5,Versicolor
5.5,2.4,3.7,1,Versicolor
5.5,2.6,4.4,1.2,Versicolor
5,3.6,1.4,0.2,Setosa

Out[32]:

null

We want to split the data into a training and testing dataset. We will use 90% of the data for training. So we calculate the number of training data

In [33]:

(dataSet.size * 0.9).toInt

Out[33]:

With this we can split the data into new Instances

In [34]:

import weka.core.Instances;

var trainingDataSet = new Instances(dataSet,0,135)
var testingDataSet = new Instances(dataSet,135,15)

dataSet.size + " = " +trainingDataSet.size + " / " + testingDataSet.size

Out[34]:

150 = 135 / 15

Defining and Training the Classifier¶

We create a new NaiveBayesMultinomial object and train it by calling the buildClassifier method passing the training data. The classifier provides some basic information.

In [35]:

import weka.classifiers.bayes.NaiveBayesMultinomial;

var classifier = new NaiveBayesMultinomial()
classifier.buildClassifier(trainingDataSet)

classifier

Out[35]:

The independent probability of a class
--------------------------------------
Setosa	0.36
Versicolor	0.33
Virginica	0.31

The probability of a word given the class
-----------------------------------------
	Setosa	Versicolor	Virginica	
sepal.length	0.49	0.41	0.38	
sepal.width	0.34	0.19	0.18	
petal.length	0.14	0.3	0.32	
petal.width	0.03	0.09	0.12

Evaluate the Model with the Test Data¶

Finally we double check how good our classifier is performing by testing it with our test data

In [36]:

import weka.classifiers.Evaluation;

var eval = new Evaluation(trainingDataSet)
eval.evaluateModel(classifier, testingDataSet)
eval.toSummaryString()

Out[36]:

Correctly Classified Instances          15              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0.2738
Root mean squared error                  0.3341
Relative absolute error                 61.5984 %
Root relative squared error             70.7939 %
Total Number of Instances               15

Predicting Data¶

Finally I demonstrate how you can process new data because this is a little bit tricky. You need to create a new DenseInstance with the Attributes from the original dataset and populate the numerical input data.

The prediction instance needs to be assinged a dataset! We use the oringal dataset in our example, but we could also use the trainnig set.

Then we use the classifyInstance method from the classifier to predict the numerical value which needs to be converted back to a String with the help of the value function on the attribute:

In [18]:

import weka.core.DenseInstance;

for (rec <- testingDataSet) {
    print(rec)
    var predict = new DenseInstance(dataSet.numAttributes());
    predict.setValue(0,rec.value(0))
    predict.setValue(1,rec.value(1))
    predict.setValue(2,rec.value(2))
    predict.setValue(3,rec.value(3))
    predict.setDataset(dataSet); 
    
    var index = classifier.classifyInstance(predict).toInt;
    var className = trainingDataSet.attribute(4).value(index);
    println(" -> " +className)
}

5.7,2.8,4.1,1.3,Versicolor -> Virginica
5.9,3.2,4.8,1.8,Versicolor -> Virginica
6.5,3,5.2,2,Virginica -> Virginica
6,2.2,4,1,Versicolor -> Virginica
6.2,2.8,4.8,1.8,Virginica -> Virginica
6.7,2.5,5.8,1.8,Virginica -> Virginica
6,2.7,5.1,1.6,Versicolor -> Virginica
4.6,3.2,1.4,0.2,Setosa -> Setosa
5,3,1.6,0.2,Setosa -> Setosa
5.6,3,4.1,1.3,Versicolor -> Virginica
6.5,3,5.5,1.8,Virginica -> Virginica
7.7,3.8,6.7,2.2,Virginica -> Virginica
6.7,3,5.2,2.3,Virginica -> Virginica
4.3,3,1.1,0.1,Setosa -> Setosa
6.4,3.1,5.5,1.8,Virginica -> Virginica

Out[18]:

null

Finally - just for our piece of mind - we double check that we didn't change our original dataSet:

In [19]:

dataSet.size

Out[19]: