Spark MLLib

Tale of Apache Spark

Reading Time: 6 minutes Data is being produced extensively in today’s world and it is going to be generated more rapidly in future. 90% of total data that is produced in the world is produced in last two years only and it is estimated that in 2020 world’s total data would reach 45 ZB and data generated each day would be enough that if we try to store it Continue Reading

Understanding Support Vector Machines

Reading Time: 3 minutes [Contributed by Raghu from Knoldus, Canada] One of the important and popular classification techniques among Machine Learning algorithms is Support Vector Machines. This is also called large margin classification. Support Vector Machine technique results in a hyperplane that separates and hence classifies samples into two distinct classes. SVM results in such a plane that not only separates samples but does it with maximum separation possible. Continue Reading

Build your personalized movie recommender with Scala and Spark

Reading Time: 3 minutes In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm. What is a Recommendation Engine? I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted Continue Reading

Email spam detection using apache spark mllib

Reading Time: 2 minutes In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder. So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset Continue Reading

A sample ML Pipeline for Clustering in Spark

Reading Time: 2 minutes Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc.  With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow. Whats is in the blog? We will create a sample ML pipeline Continue Reading

Meetup: Introduction to Spark with Scala

Reading Time: < 1 minute Knoldus organized a Meetup on Wednesday, 1 April 2015. In this Meetup, we gave a brief Introduction to Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got an insight of Apache Spark. Examples shown in above slides can be downloaded from Continue Reading

Play with Spark: Building Spark MLLib in a Play Spark Application

Reading Time: 2 minutes In our last post of Play with Spark! series, we saw how to integrate Spark SQL in a Play Scala application. Now in this blog we will see how to add Spark MLLib feature in a Play Scala application. Spark MLLib is a new component under active development. It was first released with Spark 0.8.0. It contains some common machine learning algorithms and utilities, including classification, regression, clustering, Continue Reading