Concept Learning: Find-S implementation with Scala


In our previous blog, we discussed the basic theory about concept learning with highlighting Find-S algorithm which is one of the basic algorithms of machine learning. In this blog, we are going to discuss how we can implement Find-S algorithm with one of the scalable and concurrent programming languages Scala. The machine learning algorithms can be implemented in several ways. The basic objective of this application is to provide the understanding of the Find-S algorithm. There are few things that are added to make algorithm usable. Similarly, there are few things that can be enhanced in terms of performance and implementation. Currently, the algorithm works on Scala Map to make features dynamic. One important thing that can be implemented here is the mapping part which converts string values into integers or longs to improve performance. Let’s go through the sample application that uses Find-S algorithm to predict a simple scenario.

find-s

As we discussed in our previous blog, from the implementation perspective, there are three major parts of the Find-S:

  1. Training
  2. Validation
  3. Prediction/Classification

To achieve this, we can divide the entire process into multiple modules:

  1. learning: Contains classes/traits related to machine learning algorithms.
  2. common: Contains common classes used by different modules.
  3. persistence: Contains classes and traits for hypothesis persistence logic.
  4. examples: Contains sample applications to test the algorithms.

Here are some of the important classes/traits that define the major components of the algorithm:

Here are some of the important classes that are required to implement this algorithm. We have tried to make some them more generic to maintain a pattern for other concept learning algorithms as well. Here is a brief description of them:

1. Model: An abstraction of any concept learning algorithm.
2. Trainer: Responsible to make an algorithm learn the concept.
3. Examples: Sample applications to test the learning and prediction of an algorithm.

Apart from this, there are few helper classes to perform different operations for learning and prediction like reading/ writing to/from a file, JSON parsers to parse JSON into Map[String, Any].

1. Trainer:

The trainer is one of the most important parts of the Find-S algorithm, the basic responsibilities of this class are listed below:

1. Accepting/Reading training data to process
2. Distribute it into two parts:

a. Training data
b. Validation data

For any machine learning algorithm, it’s a good practice to use sample data for training and validation of the trained model as well. According to some theories, we should use more than 60% data for training purpose and remaining data for validation. Few algorithms use up to 90% data for training and remaining for validation.

3.
Pass data to Find-S model to be learned
4. Validation of algorithm with validation data

Here is a trait which defines a Trainer:

trait Trainer {

  //Train the algorithm
  def train: Boolean

  //Read data from file
  protected def read: List[Map[String, Any]]

  //separate data into two files
  protected def separate(data: List[Map[String, Any]]): (List[Map[String, Any]], List[Map[String, Any]])

  //pass training data to algorithm
  protected def training(sample: List[Map[String, Any]]): Boolean

  //Validate final hypothesis
  protected def validate(validationData: List[Map[String, Any]]): Boolean

}

2. Model: We have tried to make a standard definition of a model for concept learning algorithms to make things easy to understand and implement. In terms of machine learning

/**
  * Model to be trained
  */
trait Model {
  val resKey: String
  def training(sample: scala.collection.immutable.Map[String, Any]): Boolean
  def getHypothesis: Any
  def predict(dataObject: scala.collection.immutable.Map[String, Any]): Boolean
  def persist: Boolean
  def trained: Boolean
}

3. Example:

Here is an example to find target hypothesis using the Find-S algorithm. This example is divided into multiple parts:

1. Training data generation: As we know that the concept learning works on past experiences, we need to have training data ready for the learning process. This step involves training data generation with a simple example. To test the application you can create your own data, the application currently generates test data as Map[String, Any].

2. Trainer initialization: This task involves the creation of a Trainer with a model (Find-S) and some basic configuration like training ratio (Ration between training samples and validation samples, typically represented by the double value in range 0 to 1 where 0 represents 0% and 1 represents 100%).

3. Training: 
The trainer is completely responsible to make a model learn the concept from training samples, but we need to trigger that event using trainer function ‘train’. Once is triggered, the trainer divides the training samples into two parts training data and validation data based on training ratio, and pass training samples to model synchronously.

4. Trained Model: After finishing training process we can use trained model to make predictions and can analyze final hypothesis(hypothesis set) using the ‘getHypothesis’ function.

5. Testing: To test the training model we can pass a sample object into the model using predict function and compare actual output with expected output for verification.

/**
  * Find-S example
  */
object FindSExample extends App with LogHelper {

  /** ******************************
    * TRAINING DATA GENERATION
    * *******************************/
  val trainingDataFilePath = ConceptLearningTrainingDataGenerator
                             .randomTrainingData

  /** ******************************
    * TRAINER INITIALIZATION
    * *******************************/
  val path = "/tmp/find_s"
  val jsonHelper = new FileHelper {}.reset(path)
  val trainer = new FindSTrainer {
    val trainingSampleFilePath = trainingDataFilePath
    val model: Model = new FindS("result", path)
    override val trainingRatio = 1.0
  }

  /** ******************************
    * TRAINING
    * *******************************/
  if (!trainer.model.trained) {
    trainer.train
  } else {
    info("Model is trained, skipping training process")
  }

  /** ******************************
    * TRAINED MODEL
    * *******************************/
  val trainedModel = trainer.model

  info(s"***Hypothesis: ${trainedModel.getHypothesis}")

  /** ***********************************
    * TESTING
    * ***********************************/
  val testDataObject = Map("sky" -> "Sunny", "airtemp" -> "Cool",
    "humidity" -> "Warm", "wind" -> "Weak",
    "water" -> "Cool", "forecast" -> "Change")
  info(s"***Testing new positive data object: $testDataObject")
  val status = trainedModel.predict(testDataObject)
  if (status) {
    info("***THE DATA OBJECT IS ... : +POSITIVE")
  } else {
    info("***THE DATA OBJECT IS ... : -NEGATIVE")
  }
}

As we have some notion of training data, the accuracy of the model highly depends on training data. The standard Find-S algorithm does not ensure/force the error detection in training data. But we are throwing few exceptions if anything wrong goes during learning and prediction so that the user of the algorithm can understand if there is any error in training data.

Running the sample application:

To test the application and find the code, you can clone the git repo from here: find-s algorithm implementation.

The application is designed to work on different featured and values dynamically so you can test the application using training data in JSON file for now.

The current implementation provides you to use the algorithm for your test applications. Here are few points regarding the sample test application:

1. The application is capable of reading data from JSON file only for now.
2. We have to provide result key for identifying the results from the set of keys.
3. The find-s algorithm is able to store the target concept into the file and can use it next time the algorithm is initialized.
4. After cloning the git repo you can use the following command to run the sample application using the following command:

sbt “project examples” run

5. To understand how the application is working you can play with test cases using the following command:

sbt test

Limitations of the sample application:

This application is developed to show the concept behind the find-s algorithm.

  • There are few things that can be improved like the mapping of feature values from string to double.
  • To find a conjunctive concept we are just comparing the value of features.
  • The application is Linux based for now.
  • You can find find-s hypothesis inside /tmp folder
Advertisements

About Girish Bharti

Girish is Sr. Software Consultant at Knoldus Software LLP. He is a scala developer and very passionate about his interest towards Scala Eco-system. He has also done many projects in different languages like Java and Asp.net. He is self motivated, dedicated and focused towards his work. He believes in developing quality products. He wants to work on different projects and different domains. He is curious to gain knowledge of different domains and try to provide solutions that can utilize resources and improve performance. His personal interests include reading books, video games, cricket and social networking. He has done Masters in Computer Applications from Lal Bahadur Shastri Institute of Management, New Delhi.
This entry was posted in machine learning, Scala. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s