Apache Hadoop vs Apache Spark

The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Spark and Hadoop MapReduce.


Spark – It is an open source big data framework. It provides faster and more general purpose data processing engine. Spark is basically designed for fast computation. It also covers a wide range of workloads for example batch, interactive, iterative and streaming.

Hadoop MapReduce – It is also an open source framework for writing applications. It also processes structured and unstructured data that are stored in HDFS. Hadoop MapReduce is designed in a way to process a large volume of data on a cluster of commodity hardware. MapReduce can process data in batch mode.

Data processing:

Hadoop: Apache Hadoop provides batch processing. Hadoop developers a great deal in creating new algorithms and component stack to improve access to large scale batch processing.

MapReduce is Hadoop’s native batch processing engine. Several components or layers (like YARN, HDFS etc) in modern versions of Hadoop allow easy processing of batch data. Since MapReduce is about permanent storage, it stores data on disk, which means it can handle large datasets. MapReduce is scalable and has proved its efficacy to deal with tens of thousands of nodes. However, Hadoop’s data processing is slow as MapReduce operates in various sequential steps.

Spark: Apache Spark is a good fit for both batch processing and stream processing, meaning it’s a hybrid processing framework. Spark speeds up batch processing via in-memory computation and processing optimization. It’s a nice alternative for streaming workloads, interactive queries, and machine-based learning. Spark can also work with Hadoop and its modules. The real-time data processing capability makes Spark a top choice for big data analytics.

Resilient Distributed Dataset (RDD) allows Spark to transparently store data on the memory, and send to disk only what’s important or needed. As a result, a lot of time that is spent on the disc read and write is saved.

Real-time analysis

Spark – It can process real-time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.

Hadoop MapReduce –MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

Continue reading

Posted in apache spark, big data, Scala | Tagged , , , , , | Leave a comment

Getting started with TensorFlow: Writing your first program

In my previous blog , we saw what Tensorflow is and some of it’s terminologies. In this blog, we are going to go ahead and implement a very basic program in TensorFlow using Python to see it in action.

To import TensorFlow library, use import tensorflow as tf

The computation in TensorFlow consists of two stages –

  1. Building the computational graph
  2. Running the computational graph

Computational graphs are nothing but the data flow graphs that I mentioned in my previous blog. Each node of the data flow graph will represent an operation that will contribute towards evaluating the TensorFlow computation, hence computational graph. In TensorFlow, each node takes zero or more tensors as inputs and produces a tensor as an output.

Continue reading

Posted in machine learning | Leave a comment

Lagom: Consuming a service part-2

So here’s the situation: We are using Lagom framework to develop our micro-services and we need to consume data from other services. What should we do?

Well, it’s not going to be a problem for us. Lagom provides very easy way to consume a service. The very first way to consume a service is to use the unmanagedServices parameter provided in Lagom plugin. To understand the use of unmanagedServices, please refer to the following link.

Now wait a second, In the above blog, we have created an interface for the unmanaged service that could be used to communicate with the service from which we are trying to consume data.
But if we are consuming a service that is already implemented in Lagom way, wouldn’t that be a boilerplate for us? Of Course it is. So what should we do now?
Here’s the solution from Lagom:
Prerequisite: The service from which you want to consume the data should be published via a build.
Now let’s consume the service: The following 4 steps will help you if you need to consume from another Lagom micro-service.

Continue reading

Posted in Java, Microservices | Tagged , , , | Leave a comment

Internal working on Writing and Reading in Cassandra

Apache Cassandra is fast, distributed Database which is built for high availability and Linear Scalability with the Predictable Performance, No SPOF, Multi-DC & easy to manage. Cassandra does not follow the master-slave architecture. It uses Peer to Peer technology. Cassandra follows the features of  Availability and Partitioning from the CAP theorem.

We can provide the Replication Factor for the fault tolerance and we set it while creating the KeySpace. When we set the replication factor then data automatically replicated into each and every replica and it works asynchronously. If any node goes down, it saves the hins and in Cassandra, we say this Hinted Handoff then it replays all the writes when node come back and join the cluster.

Cassandra is really fast to read and write operations. We will discuss the working of Cassandra’s Write and Read.

Cassandra Write Operation

When we start the writing data into Cassandra, it follows these steps:

Continue reading

Posted in Cassandra, NoSql, Scala | Tagged | 1 Comment

Knolders review of #venkat_50_50_tour at #DelhiJUG17 meetup


Dr. Venkat Subramaniam celebrates his 50th anniversary with distributing his smiles throughout the world’s Java User Groups (JUG) meetups, conferences, and events. Recently, he was in India and Delhi-NCR Java User Group organized one stop meetup for him. Needless to say, core to our culture of knowledge sharing caring,  Knolders were a part of the event. In meetup, Venkat spoke about

Designing Functional Programs

Java 8 streams and why functional programming is so important. He mentioned, “All functional is declarative, but not vice versa“.

From Functional to Reactive Programming

Why reactive applications are required and how RxJava helps us build a reactive application using Java 8 Functional style. The important points are:

Screenshot from 2017-11-28 00-01-14

Continue reading

Posted in event, Functional Programming, Java, Reactive | Tagged , , , , , , , , , , , | 3 Comments

One-way & two-way streaming in a Lagom application

Now a days streaming word is a buzz word and you should have heard many types of streaming till now i.e. kafka streaming, spark streaming etc etc. But in this blog we will see a new type of streaming i.e Lagom-streaming.

Lagom-streaming internally uses Akka streams, with the help of which we will see one way & two way streaming. But before going forward, it would be good we will get the difference between one way & two way streaming, so then lets get the difference first and then will move further.

One-way streaming: In this type of streaming, request will be normal but the response will be streamed.

Two-way streaming: In this type of streaming both request & response will be streamed.

Now as we have got the difference, I will not waste your time in theory part. Lets move ahead to see the implementation part in Lagom. We will see both type of streaming together so that we can compare those very easily and can understand the difference quite properly.

API implementation:

One-way streaming:

    ServiceCall<ProductRequest, Source<ProductResponse, ?>> oneWayStreaming();

Two-way streaming:

    ServiceCall<Source<ProductRequest, ?>, Source<ProductResponse, ?>> twoWayStreaming();

Full code for API :

public interface StreamingService extends Service {

    ServiceCall<ProductRequest, Source<ProductResponse, ?>> oneWayStreaming();

    ServiceCall<Source<ProductRequest, ?>, Source<ProductResponse, ?>> twoWayStreaming();

    default Descriptor descriptor() {
        return named("streaming").withCalls(
                Service.pathCall("/api/streaming/oneWay", this::oneWayStreaming),
                Service.pathCall("/api/streaming/twoWay", this::twoWayStreaming)

Continue reading

Posted in Akka, Best Practices, big data, Functional Programming, github, Java, knoldus, Messages, Reactive, Scala, Streaming, Web Services | Leave a comment

Assimilation of Spark Streaming With Kafka

As we know Spark is used at a wide range of organizations to process large datasets. It seems like spark becoming main stream. In this blog we will talk about Integration of Kafka with Spark Streaming. So, lets get started.

How Kafka can be integrated with Spark?

Kafka provides a messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of data and are processed using complex algorithms in Spark Streaming. Once the data is processed, Spark Streaming could be used to publish results into yet another Kafka topic.

Let’s see how to configure Spark Streaming to receive data from Kafka by creating a SBT project first and add the following dependencies in build.sbt.

val sparkCore = "org.apache.spark" % "spark-core_2.11" % "2.2.0"
val sparkSqlKafka = "org.apache.spark" % "spark-sql-kafka-0-10_2.11" % "2.2.0"
val sparkSql = "org.apache.spark" % "spark-sql_2.11" % "2.2.0"

libraryDependencies ++= Seq(sparkCore, sparkSql, sparkSqlKafka)

Continue reading

Posted in Apache Kafka, apache spark, Scala | Tagged , , , | 1 Comment

DevOps | A Basic view of DC/OS and Containers.

I heard this term Data Center operating system many times and tried to understand it from various websites and blogs but couldn’t get the basic idea of it.

What is DC/OS in simple terms and why do we need it?

Every-time the term pops-up and me:


So, I went through some tutorials and understood term like I explained below.

DC/OS (the DataCenter operating system) is an open-source, distributed operating system based on the Apache Mesos distributed kernel systems.

A system which runs over a cluster of different machines/processors.

In technical words, it includes a group of agent nodes that are coordinated by a group of master nodes. It uses the required resources from all the attached servers, as per requirement.

For Instance, we made a small cluster with 10 laptops/machines having powerful processors, big sized RAMs and storage and of course strong network connection. Now we can call this System a Distributed system and we can install an application/software which can run on a cluster.

Now we can install DC/OS on it and take the advantages of below explained features.

  • DC/OS is an excellent open-source platform to run containers, micro-services and state-full Big Data applications.
  • DC/OS manages multiple distributed systems ( i.e. Containers, Services etc. ), providing fast installation of those complex distributed systems, ready to run in production.
  • On a simple process it takes much struggle to install Big Data applications like Cassandra, Jenkins, Kafka and many more. But on DC/OS installation is a click a button task.
  • DC/OS has frameworks built-in for high availability and fault tolerance, so that we don’t need to reinvent the VM for every new application we build.
  • DC/OS helps to simplify your Datacenter, e.g. it can turn 1000 machines into a single logical Computer. Provided a simple GUI, automated placement of tasks and intelligent workload scheduling, you can increase utilisation and drastically reduce costs hense the man power to operate your DC.

You can imagine the architecture by looking at the beautiful image below. It show how thing are getting managed at DC/OS.


Besides that DC/OS also provides a nice interface to manage and monitor your service and applications.

With everything about DC/OS there a word “Containers” you would have noticed. And I found that Container is an Idea which changed everything a few years ago in DevOps world.

Then, what’s a container?

NoOneTells me

So, the legacy system says we have virtual machines, we’ll install a server on it that may be Apache Tomcat, WebLogic or WebSphere etc. and the we’ll deploy our service on it given all integral resources like file system or data bases, as simple as that then why we need Containers? I’ll explain the difference here and then how did it changed the idea DevOps and how is it related to DC/OS?

 A container ( if I say in very simple words ) is a sandbox for a process.

A sandbox that has a separate namespace for that process, that has cgroups (cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes ) that allows us to restrict what this process is able to do.

The life cycle of the process is tightly coupled with the life cycle of its container and it can only see the processes or resources which lies in the same container. It starts when the container is started and the container ends when the process exits. So, that is typically what a container supposed to be.

For instance, we have a processor having different processes running on it. These processes shares almost everything on this processor i.e. address namespace, process namespace etc.

Now let’s separate out these processes and containers for them, so the scene would be like this:


And every-time I say “container” I mean a container image. So, container image is something like a binary representation of some bits of a file system written somewhere on a disk, as the same way Virtual Machine Disk(VMDK) or Original Animation Video & Original Video Animation (OAV / OVA).

Every Image can have child image, grand-child image and so on, depending upon the layering requirements.

With that said an image can be drawn in a specific hierarchy much like the notion of binary snapshots, giving some more advantages to itself. For instance, it allows you to share images that is you can pull any binary state from the hierarchy and reuse the same state for some other things, and you don’t need to shift around the entire application stacks in single file system which is a disaster.

Another big advantage of this that it allows you to concentrate on specific things in specific places and keep track of where they are.

In 2013, a beautiful Linux based container was introduced called Docker by DotCloud which became Docker Inc. later. To understand Docker and have a good basic idea how to start you can visit Deploying Microservices on Docker.

So, as a conclusion, I hope, we can now relate thing and can differentiate previous system we were working on and the new DC/OS’s containerisation.



Posted in Scala | Leave a comment

Scala Best Practices: SAY NO TO RETURN

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” – Martin Fowler

Writing readable and understandable code is often overrated. I feel it’s not big a deal. One should just start following the code conventions right from the start of their career. They’re simple rules just like good habits.

Talking about Scala coding conventions, the most basic practice we often come across is: MUST NOT USE RETURN. But rarely they tell you WHY NOT! So, let’s dig into the world of RETURN and find out why not to use them.

Continue reading

Posted in Scala | Leave a comment

Store Semantic Web Triples into Cassandra

The semantic web is the next level of  Web Searching where data is more important and it should be well defined. The semantic web is needed for making the web search more intelligent and intuitive to get the user’s requirement. You all can find some interesting point on the Semantic Web here.

Triples is an atomic entity in RDF. It is composed of subject-predicate-object. It is used for linking the object (Subject & Object) with the help of Predicate. You all can find some interesting point on the Triples here.

RDF stands for Resource Description Framework, It is a framework which is used for representing all information about the source into the graph. RDF store is used for storing the triples and uses SPARQL query to run it where in RDF store it creates some tables and on the basis of those tables it converts SPARQL query into the normal SQL queries and for that, it uses Quetzal.

Now there will be some questions and one of them is: What are we doing here?

We are trying to store the Triples into Cassandra as Quetzal stores into Postgres after creating the tables. Quetzal creates lots of table for storing the Triples on the basis of certain conditions. Tables which is created by the Quetzal: Continue reading

Posted in Scala, Cassandra, akka-http | Tagged , , , , | 1 Comment