The basic goal of a 'sentiment analysis' is to classify a given text into positive, negative or neutral.
SentiWordNet is a lexical resource for opinion mining. It assigns sentiments to each synset of WordNet which makes it possible to "calculate" an overall sentiment for a text. A SentiWordNet implementation can be found in DL4J in the deeplearning4j-nlp-uima artifact.
This demo has been implemented in Scala using the Jupyter BeakerX Notebook.
%classpath config resolver maven-public http://192.168.1.10:8081/repository/maven-public/
%%classpath add mvn
org.deeplearning4j:deeplearning4j-nlp-uima:1.0.0-beta2
Added new repo: maven-public
org.deeplearning4j:deeplearning4j-core:1.0.0-beta2 org.deeplearning4j:deeplearning4j-nlp-uima:1.0.0-beta2 org.deeplearning4j:deeplearning4j-nlp:1.0.0-beta2 org.nd4j:nd4j-native-platform:1.0.0-beta2 com.github.habernal:confusion-matrix:1.0
In order to classify a text we just need to call the classify method on the SWN3 class. Here is an example:
import org.deeplearning4j.text.corpora.sentiwordnet.SWN3
var svn3 = new SWN3()
var txt = "For years Apple was the innovative leader in personal computers. Not anymore."
var result = svn3.classify(txt)
s"$txt ==> $result"
For years Apple was the innovative leader in personal computers. Not anymore. ==> weak_negative
In quite a few Blogs the Sentiment140 dataset (with 1.6 Million Twitter entries) is used as input for the training of Sentiment Analyisis models. So we would like to double check how good this dataset is relating to the predictions done with Sentiwordnet.
import scala.io.Source
def getLabel(number:String):String = {
var v = number
if (v=="\"0\"") v = "negative"
else if (v=="\"2\"") v = "neutral"
else if (v=="\"4\"") v = "positive"
return v
}
var dataList = Source.fromFile("training.1600000.processed.noemoticon.csv","ISO-8859-1")
.getLines()
.map(str => str.split(","))
.map(array => (array(5),getLabel(array(0))))
.toList
dataList.length
1600000
We can easily caluclate the accuracy by dividing the correctly classified entries with the total entries. For doing this we need to remove the strong_ and weak_ prefixes which are generated by the SentiWordNet.
import scala.util.Random
def stdLabel(label:String):String = {
var result = label.replace("strong_","")
result = result.replace("weak_","")
return result
}
dataList = Random.shuffle(dataList)
val resultCompareList = dataList.slice(0, 20000).par.map(r => (r._2, stdLabel(svn3.classify(r._1))))
resultCompareList.size
20000
We just compare the classification result values which should be identical
var swnSet = resultCompareList.map(_._2).toSet
var s140Set = resultCompareList.map(_._1).toSet
s"$swnSet <=> $s140Set"
ParSet(negative, positive, neutral) <=> ParSet(negative, positive)
...and we calculate the Accuracy
val correctCount:Double = resultCompareList.count(v => v._1 == v._2)
s"Accuracy = ${correctCount / resultCompareList.size}"
Accuracy = 0.4147
val neutralCount:Double = resultCompareList.count(v => v._2 == "neutral")
neutralCount / resultCompareList.size
0.32095
The accuracy is surprsingly low. This can partly be explained by the fact that the Sentiment140 data provides almost no neutral classifications.