Author: Manish Mishra

AMPS: Empowering real time message driven applications.

Reading Time: 3 minutes Greetings!! In this blog, we will talk about AMPS, a pub-sub engine which delivers messages in real time with a subject of interest. AMPS is mainly used by Financial Institutions as enterprise message bus. We will also demonstrate how we can use AMPS with to publish and subscribe messages with an example. So, let’s start with introducing AMPS.  What is AMPS? Advanced Message Processing System Continue Reading

Apache Ignite

Sharing RDD’s states across Spark applications with Apache Ignite

Reading Time: 4 minutes Apache Ignite offers an abstraction over native Spark RDDs such that the state of RDDs can be shared across spark jobs, workers and applications which is not possible with native Spark RDDS. In this blog, we will walk through the steps on how to share RDDs between two spark Application. Preparing Ingredients To test the Apache Ignite with Apache Spark application we need at least one master Continue Reading

Controlling RDD Partitions in Apache Spark

Reading Time: 2 minutes In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application. What is Partitioning? Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis Continue Reading

Build your personalized movie recommender with Scala and Spark

Reading Time: 3 minutes In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm. What is a Recommendation Engine? I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted Continue Reading

Introduction to Java 8

Reading Time: < 1 minute The Functional Features of Java8 Java 8 was a major release in terms of language and APIs. The language includes several ideas from functional programming like behavior parameterization, passing lambda expression as methods, processing data with stream pipelines etc. The following presentation describes the functional programming add on in Java 8. We will be introducing the lambda expression, Functional Interfaces, Default methods and Stream API in Java Continue Reading

Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables. When to use Broadcast variable? Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific Continue Reading

Aggregating Neighboring vertices with Apache Spark GraphX Library

Reading Time: 2 minutes To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?” In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example. So, let us get back on the problem statement. Let Continue Reading

A sample ML Pipeline for Clustering in Spark

Reading Time: 2 minutes Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc.  With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow. Whats is in the blog? We will create a sample ML pipeline Continue Reading

Introduction to Machine Learning with Spark (Clustering)

Reading Time: 2 minutes In this blog, we will learn how to group similar data objects using K-means clustering offered by Spark Machine Learning Library. Prerequisites The code example needs only Spark Shell to execute. What is Clustering Clustering is like grouping data objects in some random clusters (with no initial class of group defined) on the basis of similarity or the natural closeness to each other. The “closeness” Continue Reading

Testing Scala Applications with In-memory mongoDB

Reading Time: 2 minutes Is your test suite taking a large amount of time to run just because your methods need some database queries to be handled? Testing with In-memory databases can save a hell lot of time. In-memory database makes the queries readily available to methods in a matter of milliseconds. In case you are using mongoDB, this blog can help you run your test suite much faster Continue Reading

Configuring SonarQube with Scoverage plug-in : The Complete Guide

Reading Time: 4 minutes This blog will guide you through the successful configuration of Scoverage plug-in with SonarQube for Scala source code statement coverage analysis. How Does it Work? The Scoverage plug-in for SonarQube reads the report generated by sbt scoverage plug-in and generate several reports like Statement Coverage % Analysis Lines covered by test Drilling down report to the file level The greatest advantage of SonarQube is the Continue Reading