If you deal with Machine Learning in the JVM you should not forget about the good old Weka. It has basically been desinged to be used by a Swing GUI but it can also be used as an API. In terms of documentation I can recommend this manual and the javadoc.
In my Demo I use the NaiveBayesMultinomial classifier with the iris dataset that is directly loaded from the Internet.
We add the necessary Maven dependency
%%classpath add mvn
nz.ac.waikato.cms.weka:weka-stable:3.8.3
We can use the CSVLoader to load the data from the Internet. We define the index of the class (category) field which defines the classification result.
Finally we reshuffle the data: We have a dataset with 150 records
import weka.core.converters.CSVLoader;
import java.net.URL
var url = new URL("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
var loader = new CSVLoader();
loader.setSource(url.openStream);
var dataSet = loader.getDataSet();
dataSet.setClassIndex(4);
//We could use Collections.shuffle(dataSet) or use the weka functionality
dataSet = dataSet.resample(new java.util.Random())
dataSet.size
150
dataSet.getClass.getName
weka.core.Instances
Just to double check the data we show the first 10 records
import scala.collection.JavaConversions._
dataSet.subList(0,10).foreach(println(_))
5,3.4,1.5,0.2,Setosa 6.7,3,5,1.7,Versicolor 6.2,2.2,4.5,1.5,Versicolor 4.9,2.4,3.3,1,Versicolor 5.7,3.8,1.7,0.3,Setosa 6.7,3.3,5.7,2.1,Virginica 6.7,3.1,4.7,1.5,Versicolor 5.5,2.4,3.7,1,Versicolor 5.5,2.6,4.4,1.2,Versicolor 5,3.6,1.4,0.2,Setosa
null
We want to split the data into a training and testing dataset. We will use 90% of the data for training. So we calculate the number of training data
(dataSet.size * 0.9).toInt
135
With this we can split the data into new Instances
import weka.core.Instances;
var trainingDataSet = new Instances(dataSet,0,135)
var testingDataSet = new Instances(dataSet,135,15)
dataSet.size + " = " +trainingDataSet.size + " / " + testingDataSet.size
150 = 135 / 15
We create a new NaiveBayesMultinomial object and train it by calling the buildClassifier method passing the training data. The classifier provides some basic information.
import weka.classifiers.bayes.NaiveBayesMultinomial;
var classifier = new NaiveBayesMultinomial()
classifier.buildClassifier(trainingDataSet)
classifier
The independent probability of a class -------------------------------------- Setosa 0.36 Versicolor 0.33 Virginica 0.31 The probability of a word given the class ----------------------------------------- Setosa Versicolor Virginica sepal.length 0.49 0.41 0.38 sepal.width 0.34 0.19 0.18 petal.length 0.14 0.3 0.32 petal.width 0.03 0.09 0.12
Finally we double check how good our classifier is performing by testing it with our test data
import weka.classifiers.Evaluation;
var eval = new Evaluation(trainingDataSet)
eval.evaluateModel(classifier, testingDataSet)
eval.toSummaryString()
Correctly Classified Instances 15 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0.2738 Root mean squared error 0.3341 Relative absolute error 61.5984 % Root relative squared error 70.7939 % Total Number of Instances 15
Finally I demonstrate how you can process new data because this is a little bit tricky. You need to create a new DenseInstance with the Attributes from the original dataset and populate the numerical input data.
The prediction instance needs to be assinged a dataset! We use the oringal dataset in our example, but we could also use the trainnig set.
Then we use the classifyInstance method from the classifier to predict the numerical value which needs to be converted back to a String with the help of the value function on the attribute:
import weka.core.DenseInstance;
for (rec <- testingDataSet) {
print(rec)
var predict = new DenseInstance(dataSet.numAttributes());
predict.setValue(0,rec.value(0))
predict.setValue(1,rec.value(1))
predict.setValue(2,rec.value(2))
predict.setValue(3,rec.value(3))
predict.setDataset(dataSet);
var index = classifier.classifyInstance(predict).toInt;
var className = trainingDataSet.attribute(4).value(index);
println(" -> " +className)
}
5.7,2.8,4.1,1.3,Versicolor -> Virginica 5.9,3.2,4.8,1.8,Versicolor -> Virginica 6.5,3,5.2,2,Virginica -> Virginica 6,2.2,4,1,Versicolor -> Virginica 6.2,2.8,4.8,1.8,Virginica -> Virginica 6.7,2.5,5.8,1.8,Virginica -> Virginica 6,2.7,5.1,1.6,Versicolor -> Virginica 4.6,3.2,1.4,0.2,Setosa -> Setosa 5,3,1.6,0.2,Setosa -> Setosa 5.6,3,4.1,1.3,Versicolor -> Virginica 6.5,3,5.5,1.8,Virginica -> Virginica 7.7,3.8,6.7,2.2,Virginica -> Virginica 6.7,3,5.2,2.3,Virginica -> Virginica 4.3,3,1.1,0.1,Setosa -> Setosa 6.4,3.1,5.5,1.8,Virginica -> Virginica
null
Finally - just for our piece of mind - we double check that we didn't change our original dataSet:
dataSet.size
150