If you are following the Big Data space especially from a Scala Space perspective then you would have noticed a troll of blogs, tweets and more blogs comparing the two. The two being Spark and Flink. That said, you would also find comparing these two with Samza and Storm. Incidentally all of them are top level Apache projects. For the purpose of this blog, let us stick to the shinier toys.

At Knoldus, we have been doing quite some work in the Big Data space. Historically, we have been a Scala company and even before Spark started getting attention, our Big Data products were based on home grown Scala and Akka implementations. If you dig into our blogs around 2011-12 time frame, I bet you would see quite quite a few of our adventures in the Big Data space without the options available today. Cut to today, most of our implementations inherently use Spark. We are Lightbend and Databricks partners and have implemented quite a few complex solutions on the mentioned tech stacks. That said, with the kind of attention Flink seems to be getting, we decided to look into it further and see in what scenarios, Flink would help us better than Spark and vice versa. In hopefully a series of blogs which follow we would try to touch upon various aspects of Flink.

For starters, let us talk about the two of them


  1. Both provide a guarantee that every record would be processed exactly once.
  2. Both provide high throughput with low latency as compared to Storm.
  3. Low overhead of fault tolerance
  4. Both are capable of running in Stand alone mode
  5. Both have support for ML and GraphDB
  6. Both can do batch processing

The main difference that you would hear and read about the two is in the way in which streaming is handled, primarily the computation model. While Spark has adopted micro-batching, Flink has adopted a continuous flow, operator-based streaming model.

The difference between the two is that while Spark is pseudo real time, Flink is real time. Spark would collect data into small buckets and process data after that which gives us the feel of continuous processing whereas Flink does real continuous processing.

Stream imperfections like out-of-order events are handled in Flink using the framework’s event time processing support.

How does it matter? you may ask.

This is a real case scenario which we encountered with a mining industry ERP system which was processing a lot of mine events to make predictions. A lot of events in real time are analysed and action taken on the basis of results. These stream of events were coming from various departments and business units. For instance, a lot stream of events come from a machine which help in deciding as to when it goes into maintenance. This might be done with micro-batching.

Now consider a stream of events leading to prediction of a fire  in the mine would not really benefit with the latency of micro-batching. As soon as a fire event is predicted, a lot of real time alerts need to be sent out to all departments to stop work and begin evacuation. This needs to happen in real time or lives are lost.

Here, we had to fall back on home grown solution with Akka actors with PinnedDispatchers to effectively handle this scenario. This is just one case where real time benefits would outweigh micro-batching.

That said, there would be voices in the community which would suggest that though Spark does micro-batching it does not affect the application performance. This is incorrect.

The use cases that Flink handles better than Spark is where we need real time streaming. That is a perfect differentiation. Cases like Fraud detection, stock monitoring, traffic management, online recommendation engines are better handled with Flink than micro batching of Spark.

So is it the death knell for Spark ?

Not at all! Spark shines in its areas of maturity, data source integration, Sql kind interface with Spark SQL, iterative processing which leads to machine learning easily. Last but not the least industry adoption and applications in production do make Spark a real winner at this time. That said if your use case is real time streaming based then you would have to side with Flink despite all the advantages of Spark.

We would compare and contrast the two as we go ahead. As Big Data integrator’s, we would compare them in their technical capability to solve business problems. Stay tuned.

About the Author: Vikas Hazrati

Vikas is the Founding Partner @ Knoldus which is a group of software industry veterans who have joined hands to add value to the art of software development. Knoldus does niche Reactive and Big Data product development on Scala, Spark and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). To know more, send a mail to hello@knoldus.com or visit www.knoldus.com

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.