Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try the Naive Bayes approach.

In many of my previous blogs, I have posted about NAIVE BAYES CLASSIFIER, what it’s about and how does it work. Today we will be using KSAI library to build our Naive Bayes model. But before that let’s explore What is KSAI?

What is KSAI?

KSAI is an open source machine learning library which contains various algorithms such as classification, regression, clustering, and many others. It is an attempt to build machine learning algorithms with the language Scala. The library Breeze, which is again built on Scala is getting used for doing the mathematical functionalities.

KSAI mainly used Scala’s inbuilt case classes, Future and some of the other cool features. It has also used Akka in some places and tried doing things in an asynchronous fashion. In order to start exploring the library, the test cases might be a good start. Right now it might not be that easy to use the library with limited documentation and unclear API, however, the committers will update them in the near future.

How to use it?

You can add KSAI library to your project by adding up below dependency for it.

  1.  For sbt project add the following dependency to build.sbt.
    libraryDependencies += "io.github.knolduslabs" %% "ksai" % "0.0.4"
  2. For maven project using the following dependency in pom.xml.
  3.  For Gradle Groovy DSL, Gradle Kotlin DSL, Apache Ivy, Groovy Grape and other build tools you may find related dependencies here.


KSAI naive bayes classifier can be used under three models which are:

  • General/Gaussian: It is used in classification and it assumes that features follow a normal distribution.
  • Multinomial: It is used for discrete counts.
  • Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones).

For better understanding, let’s take an example and let’s try to build something using KSAI’s naive bayes classifier algorithm.

In this example, I’ll be using the data file movie.txt for demonstrating the application. I will also be including this file in the GitHub repository that will be provided below so that you guys can also play around with it.

As the name suggests movie.txt contains data related to movies, which is labeled as neg( for negative movie review) and pos(for positive movie review). In our example, we will first convert our resource data into numeric values, by providing 1 for each positive record and 0 for every negative record.

val resource = Source.fromFile("src/test/resources/movie.txt").getLines().toArray

val movieX = new Array[Array[Double]](2000)

val movieY = new Array[Int](2000)

val x = new Array[Array[String]](2000)
resource.indices.foreach { itr =>
  val value = resource(itr)
  val words = value.trim.split(" ")
  if (words(0).equalsIgnoreCase("pos")) {
    movieY(itr) = 1
  } else if (words(0).equalsIgnoreCase("neg")) {
    movieY(itr) = 0
  } else println("Invalid class label: " + words(itr))
  x(itr) = words

In our example, we will be using some of the feature set, according to which our algorithm will predict, and those features are:

val feature: Array[String] = Array(
  "outstanding", "wonderfully", "wasted", "lame", "awful", "poorly",
  "ridiculous", "waste", "worst", "bland", "unfunny", "stupid", "dull",
  "fantastic", "laughable", "mess", "pointless", "terrific", "memorable",
  "superb", "boring", "badly", "subtle", "terrible", "excellent",
  "perfectly", "masterpiece", "realistic", "flaws")

Now according to these features, we will convert our whole data into numeric values so that we can apply NAIVE BAYES CLASSIFIER to them.

val (featureMap, _) = feature.foldLeft((Map.empty[String, Int], 0)) {
  case ((map, k), string) if !map.keySet.contains(string) => (map ++ Map(string -> k), k + 1)
  case (tuple, _) => tuple

x.indices.foreach { itr =>
  movieX(itr) = feature(x(itr))

def feature(x: Array[String]): Array[Double] = {
  val bag = new Array[Double](feature.length)
  x.foreach { word =>
    featureMap.get(word).foreach { f => bag(f) = bag(f) + 1 }

Our data set is ready. Now we will just slice up our data and use some part of source data to first train our algorithm and some part of source data to predict and check what is the accuracy of our algorithm.

val startTime = new java.util.Date().getTime// For logging time only
val crossValidation = CrossValidation(movieX.length, 10)
(0 until 10).foreach { itr =>
  val trainX = LOOCV.slice(movieX, crossValidation.train(itr)).toArray
  val trainY = LOOCV.slice(movieY, crossValidation.train(itr)).toArray

  val naiveBayes = NaiveBayes(model = MULTINOMIAL, classCount = 2, independentVariablesCount = feature.length)
  naiveBayes.learn(trainX, trainY)

  val testX = LOOCV.slice(movieX, crossValidation.test(itr)).toArray
  val testY = LOOCV.slice(movieY, crossValidation.test(itr)).toArray

  testX.indices.foreach { j =>
    val label = naiveBayes.predict(testX(j))
    if (label != -1) {
      total = total + 1
      if (testY(j) != label) {
        error = error + 1
        success = success + 1

info(s"Time taken: ${new java.util.Date().getTime - startTime} millies")
info(s"Multinomial error is $error and success is $success of total $total")

Here is the link to the sample code. Please explore it for more details.

That’s it for this blog. You can find many more interesting algorithms in KSAI right here.

Thanks for reading!


Written by 

Nitin Aggarwal is a software consultant at Knoldus Software INC having more than 1.5 years of experience. Nitin likes to explore new technologies and learn new things every day. He loves watching cricket, marvels movies, playing guitar and exploring new places. Nitin is familiar with programming languages such as Java, Scala, C, C++, Html, CSS, technologies like lagom, Akka, Kafka, spark, and databases like Cassandra, MySql, PostgreSQL, graph DB like Titan DB.

Leave a Reply

%d bloggers like this: