Streaming in Spark, Flink and Kafka

Table of contents

Reading Time: 7 minutes

There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka.

Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once thereby eliminating any duplicates that might be available. Both provide very high throughput compared to any other processing system like storm, and the overhead of fault tolerance is low in both the processing engines, whereas Kafka clients can be created for at-most-once, at-least-once, and exactly-once message processing needs. Kafka gets used for two broad classes of application:

Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data.

There is need for real- time stream processing, as data is arriving as continuous flows of events, for example cars in motion emitting GPS signals, financial transactions, interchange of signals between cell phone towers, web traffic including things like session tracking, understanding user behaviour on websites, measurements from industrial sensors. So with all these types of data, stream processing turns out to be a better way to deal with this kind of data. Stream processing is challenging to maintain consistency and fault tolerance because with the dynamism that is associated with this data generation and processing, you need a system that can keep up with that and handle interruptions of connectivity, you also need the ability to consume the data from the stream processor so you need to be able to answer complex queries in the form of windows so you need rich windowing definitions, different ways to pull out information roll up and aggregate information and also you don’t want the system to bogged down so you need low latency and high throughput in a stream processor.

The point where spark streaming and flink differs is in their computation model,while spark has adopted micro batches flink has adopted continuous flow operative based streaming model. As far as window criteria, spark has a time based window criteria whereas flink has record based or any custom used defined window criteria.

Flink and Spark are both general-purpose data processing platforms and top level projects of the Apache Software Foundation (ASF). They have a wide field of application and are usable for dozens of big data scenarios. Both are capable of running in standalone mode, yet many are using them on top of Hadoop (YARN, HDFS). They share a strong performance due to their in memory nature.

Let’s have a look on Spark, Flink and Kafka,and their advantages.

Apache Spark :

Spark is an open source, cluster computing framework which has a large global user base. It is written in Scala, Java, R and Python and gives programmers an Application Programming Interface (API) built on a fault tolerant, read only multiset of distributed data items. In a short time of 2 years since its initial release (May 2014), it has seen wide acceptability for real time, in-memory, advanced analytics – owing to its speed, ease of use and the ability to handle sophisticated analytical requirements.

Advantages of spark:

Apache Spark has several advantages over traditional Big Data and MapReduce based technologies.
Speed – Spark can execute batch processing jobs 10 to 100 times faster than MapReduce.
Ease of Use – Apache Spark has easy to use APIs, built for operating on large datasets.

Unified Engine – Spark can run on top of Hadoop, making use of its cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). However, it can also run independent of Hadoop, joining hands with other cluster managers and storage platforms (the likes of Cassandra and Amazon S3). It also comes with higher – level libraries that support SQL queries data streaming, machine learning and graph processing.
Choose from Java, Scala or Python – Spark doesn’t tie you down to a particular language and lets you choose from the popular ones such as Java, Scala, Python, R and even Clojure.
In-memory data sharing – Different jobs can share data within the memory, which makes it an ideal choice for iterative, interactive and event stream processing tasks.

Apache Flink:

Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.

Advantages of Flink :

Flink streaming processes data streams as true streams, i.e., data elements are immediately “pipelined” through a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.
Better memory management – Explicit memory management gets rid of the occasional spikes found in Spark framework.
Speed – It manages faster speeds by allowing iterative processing to take place on the same node rather than having the cluster run them independently. Its performance can be further tuned by tweaking it to re-process only that part of data that has changed rather than the entire set. It offers up to five-fold boost in speed when compared to the standard processing algorithm.

Apache Spark is considered a replacement for the batch-oriented Hadoop system. But it includes a component called Apache Spark Streaming as well. Contrast this to Apache Flink, which is a big data processing tool and it is known to process big data quickly with low data latency and high fault tolerance on distributed systems on a large scale. Its defining feature is its ability to process streaming data in real time.

Apache Kafka :

Apache Kafka is a distributed streaming platform. For more complex transformations Kafka provides a fully integrated Streams API. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

Kafka helps to provide support for many stream processing issues:

Handles out-of-order data,
Processes input as code changes,
Performs stateful computations, etc.
Producer and consumer APIs for input,
Group mechanism for fault tolerance among the stream processor instances

These facilities helps to solve the hard problems that application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

Comparison between Spark and Flink:

Flink and Spark are in-memory databases that do not persist their data to storage. They can write their data to permanent storage, but the whole point of streaming is to keep it in memory, to analyze current data. All of this lets programmers write big data programs with streaming data. They can take data in whatever format it is in, join different sets, reduce it to key-value pairs (map), and then run calculations on adjacent pairs to produce some final calculated value. They also can plug these data items into machine learning algorithms to make some projection (predictive models) or discover patterns (classification models).

Given below is comparison between spark and flink:

Data processing

Spark processes data in batch mode while Flink processes streaming data in real time. Spark processes chunks of data, known as RDDs while Flink can process rows after rows of data in real time. So, while a minimum data latency is always there with Spark, it is not so with Flink.

Memory Management

Flink can automatically adapt to varied datasets but Spark needs to optimize and adjust its jobs manually to individual datasets. Also Spark does manual partitioning and caching. So, expect some delay in processing. Flink has a different approach to memory management. Flink pages out to disk when memory is full, which is what happens with Windows and Linux too. Spark crashes that node when it runs out of memory. But it does not lose data, since it is fault tolerant.

Data Flow

Flink is able to provide intermediate results on its data processing whenever required. While Spark follows a procedural programming system, Flink follows a distributed data flow approach. So, whenever intermediate results are required, broadcast variables are used to distribute the pre-calculated results through to all the worker nodes.

Command Line Interface (CLI)

Spark has CLIs in Scala, Python, and R. Flink does not really have a CLI, but the distinction is subtle.

To have a Spark CLI means a users can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.

Flink has a Scala CLI too, but it is not exactly the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.

Support for Other Streaming Products

Both Flink and Spark work with Kafka, the streaming product written by LinkedIn. Flink also works with Storm topologies.

Comparison between kafka and flink :

The fundamental differences between a Flink and a Streams API program lie in the way these are deployed and managed and how the parallel processing including fault tolerance is coordinated.
Flink is a cluster framework, which means that the framework takes care of deploying the application, either in standalone Flink clusters, or using YARN, Mesos, or containers (Docker, Kubernetes).
The Streams API is a library that any standard Java application can embed and hence does not attempt to dictate a deployment method; you can thus deploy applications with essentially any deployment technology.
The biggest difference between the two systems with respect to distributed coordination is :
- Flink has a dedicated master node for coordination, while the Streams API relies on the Kafka broker for distributed coordination and fault tolerance.
- In Apache Flink, fault tolerance, scaling, and even distribution of state are globally coordinated by the dedicated master node.

Conclusion :

Besides the fact that the API of Apache Flink is, easier to use than the API of Apache Spark, it has a more flexible windowing system than Apache Spark. It is also much faster than Apache Spark when network attached storage (NAS) is used in the computing cluster. In terms of batch processing, Apache Flink is also faster and is about twice as fast as Apache Spark with NAS. Apache Flink has almost no latency in processing elements from a stream compared to Apache Spark.

In summary, while there certainly is an overlap between the Streams API in Kafka and Flink, largely due to differences in their architecture and thus we see them as complementary systems. The Streams API makes stream processing accessible as an application programming model, that applications built as microservices can avail from, Flink, on the other hand, is a great fit for applications that are deployed in existing clusters and benefit from throughput, latency, batch processing.

References :