If you are following the Big Data space especially from a Scala Space perspective then you would have noticed a troll of blogs, tweets and more blogs comparing the two. The two being Spark and Flink. That said, you would also find comparing these two with Samza and Storm. Incidentally all of them are top level Apache projects. For the purpose of this blog, let us stick to the shinier toys.
At Knoldus, we have been doing quite some work in the Big Data space. Historically, we have been a Scala company and even before Spark started getting attention, our Big Data products were based on home grown Scala and Akka implementations. If you dig into our blogs around 2011-12 time frame, I bet you would see quite quite a few of our adventures in the Big Data space without the options available today. Cut to today, most of our implementations inherently use Spark. We are Lightbend and Databricks partners and have implemented quite a few complex solutions on the mentioned tech stacks. That said, with the kind of attention Flink seems to be getting, we decided to look into it further and see in what scenarios, Flink would help us better than Spark and vice versa. In hopefully a series of blogs which follow we would try to touch upon various aspects of Flink.
For starters, let us talk about the two of them
- Both provide a guarantee that every record would be processed exactly once.
- Both provide high throughput with low latency as compared to Storm.
- Low overhead of fault tolerance
- Both are capable of running in Stand alone mode
- Both have support for ML and GraphDB
- Both can do batch processing
The main difference that you would hear and read about the two is in the way in which streaming is handled, primarily the computation model. While Spark has adopted micro-batching, Flink has adopted a continuous flow, operator-based streaming model.
The difference between the two is that while Spark is pseudo real time, Flink is real time. Spark would collect data into small buckets and process data after that which gives us the feel of continuous processing whereas Flink does real continuous processing.
Stream imperfections like out-of-order events are handled in Flink using the framework’s event time processing support.
How does it matter? you may ask.
This is a real case scenario which we encountered with a mining industry ERP system which was processing a lot of mine events to make predictions. A lot of events in real time are analysed and action taken on the basis of results. These stream of events were coming from various departments and business units. For instance, a lot stream of events come from a machine which help in deciding as to when it goes into maintenance. This might be done with micro-batching.
Now consider a stream of events leading to prediction of a fire in the mine would not really benefit with the latency of micro-batching. As soon as a fire event is predicted, a lot of real time alerts need to be sent out to all departments to stop work and begin evacuation. This needs to happen in real time or lives are lost.
Here, we had to fall back on home grown solution with Akka actors with PinnedDispatchers to effectively handle this scenario. This is just one case where real time benefits would outweigh micro-batching.
That said, there would be voices in the community which would suggest that though Spark does micro-batching it does not affect the application performance. This is incorrect.
The use cases that Flink handles better than Spark is where we need real time streaming. That is a perfect differentiation. Cases like Fraud detection, stock monitoring, traffic management, online recommendation engines are better handled with Flink than micro batching of Spark.
So is it the death knell for Spark ?
Not at all! Spark shines in its areas of maturity, data source integration, Sql kind interface with Spark SQL, iterative processing which leads to machine learning easily. Last but not the least industry adoption and applications in production do make Spark a real winner at this time. That said if your use case is real time streaming based then you would have to side with Flink despite all the advantages of Spark.
We would compare and contrast the two as we go ahead. As Big Data integrator’s, we would compare them in their technical capability to solve business problems. Stay tuned.