Distributed stream processing engines have been on the rise in the last few years, first Hadoop became popular as a batch processing engine, then focus shifted towards stream processing engines. Stream processing engines can make the job of processing data that comes in via a stream easier than ever before and by using clustering can enable processing data in larger sets in a timely manner. In this blog, we will look into a brief comparison between Spark, Flink, Kafka, and Akka Streams.
What is Stream Processing ?
- Stream processing is the processing of data in motion, or in other words, computing on data directly as the data gets produces or while receiving it.
- Before stream processing, database, a file system, or other forms of mass storage were responsible for the storage of data. Applications would query the data or compute over the data as needed.
- Stream Processing turns this paradigm around: The application logic, analytics, and queries exist continuously, and data flows through them continuously.
Apache Spark Streaming :
Spark is an open-source distributed general-purpose cluster computing framework. Spark’s in-memory data processing engines conduct analytics, ETL, machine learning, and graph processing on data in motion or at rest. It offers high-level APIs for the programming languages: Python, Java, Scala, R, and SQL.
The Apache Spark Architecture is founded on Resilient Distributed Datasets (RDDs). These are distributed immutable tables of data, which are split up and allocated to workers. The worker executors implement the data. The RDD is immutable, so the worker nodes cannot make alterations; they process information and output results.
- Apache Spark is a mature product with a large community, proven in production for many use cases, and readily supports SQL querying.
- Supports Lambda architecture, comes free with Spark.
- High throughput, good for many use cases where sub-latency is not required.
- Fault tolerance by default due to micro-batch nature.
- The latency of a few seconds, which eliminates some real-time analytics use cases.
- Spark can be complex to set up and implement.
- Stateless by nature.
- Lags behind Flink in many advanced features.
- Processing social media feeds in real-time for performing sentiment analysis.
- If latency is not a significant issue and you are looking for flexibility in terms of the source compatibility, then Spark Streaming is the best option to go for. Spark Streaming can be run using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or Kubernetes as well.
Apache Flink :
Flink is based on the concept of streams and transformations. Data comes into the system via a source and leaves via a sink. We can use Apache Maven to produce a Flink job. Like Spark, it also supports Lambda architecture. But the implementation is quite opposite to that of Spark. While Spark is essentially a batch with Spark streaming as micro-batching and the special case of Spark Batch, Flink is essentially a true streaming engine treating batch as a special case of streaming with bounded data. Though APIs in both frameworks are similar, they don’t have any similarities in implementations. In Flink, each function like map, filter, reduce, etc is implemented as a long-running operator.
- First True streaming framework with all advanced features like event time processing, watermarks, etc
- Low latency with high throughput, configurable according to requirements.
- It does not require manual optimization and adjustment to data it processes.
- A relatively new project with fewer production deployments than other frameworks.
- The community is not as big as Spark but growing at a fast pace now.
- No known adoption of the Flink Batch as of now, only popular for streaming.
- Detection and prevention of fraudulent credit card transactions in real-time and many other event-driven applications suits best for Flink.
- Flink provides very good support for continuous streaming as well as batch analytics.
- It provides good support for various applications related to data analytics.
Apache Kafka Streams:
Kafka is actually a message broker with a really good performance so that all your data can flow through it before redistributing to applications. Kafka works as a data pipeline. Typically, Kafka Stream supports per-second stream processing with millisecond latency. Due to its lightweight nature, can be used in microservices type architecture. There is no match in terms of performance with Flink but also does not need a separate cluster to run, is very handy and easy to deploy and start working. One major advantage of Kafka Streams is that its processing is Exactly Once end to end.
- Very lightweight library, good for microservices,IOT applications
- It requires no separate processing cluster.
- Scales easily by just adding java processes, No reconfiguration required.
- Supports Stream joins, internally uses rocksDb for maintaining state.
- Tightly coupled with Kafka, can not be used without Kafka in picture.
- Not for heavy lifting work streaming engines likeSpark Streaming,Flink.
- Microservices and stand-alone applications that need embedded stream processing capabilities without the dependency on complex clusters.
- If latency is a significant concern and one has to stick to real-time processing with time frames shorter than milliseconds then, you must consider Kafka Streaming.
- In cases of high scalability requirements, Kafka suits the best, as it is hyper-scalable.
Akka Streams is a module that is built on top of Akka Actors to make the ingestion and processing of streams easy. It provides easy-to-use APIs to create streams that leverage the power of the Akka toolkit without explicitly defining actor behaviors and messages. This allows you to focus on logic and forget about all of the boilerplate code required to manage the actor. Akka Streams follows the Reactive Streams manifesto, which defines a standard for asynchronous stream processing.
- Akka streams implement a reactive manifesto which is great to achieve really low latency.
- Akka streams provide a lot of operators to write declaratively transformations over streams easily.
- One of the best approaches to use when working with a real-time or reactive category.
- Using Akka streams imposes some overhead in getting up and running. It requires learning the DSL and also understanding what’s going on under the hood to some extent.
- Unlike heavier “streaming data processing” frameworks, Akka Streams are not “deployed” nor automatically distributed.
- Any system that has the need for high-throughput and low latency is a good candidate for using Akka Streams.
- If our application involves that is best characterize as “real-time” or “reactive”, then we prefer Akka Streams.
- Akka Streams are also especially effective in making producer-consumer type systems more resilient through the use of backpressure.
- Akka Streams is a higher-level way of writing Akka.NET programs. You can express your application concisely by composing small, modular “graph stages” (i.e. sources, flows, and sinks) together into an asynchronous streaming application.
Hence, which streaming engines should we use for the best output? The answer is that it simply depends on our use case. It is important to keep in mind that no single processing framework can be a silver bullet for every use case. Every framework has some strengths and some limitations too. If we understand the strengths and limitations of the frameworks along with our use cases well then it is easier to pick or at least filter down the available options.
Hope this blog was helpful.