Is Apache Flink the future of Real-time Streaming?

Reading Time: 5 minutes

In our last blog, we had a discussion about the latest version of Spark i.e 2.4 and the new features that it has come up with.
While trying to come up with various approaches to improve our performance, we got the chance to explore one of the major contenders in the race, Apache Flink.

Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014, it was accepted as an Apache Incubator project, and just a few months later, it became a top-level Apache project.

It is a scalable data analytics framework that is fully compatible with Hadoop. Flink can execute both stream processing and batch processing easily.
In this blog, we will try to get some idea about Apache Flink and how it is different when we compare it to Apache Spark.
If you guys want to know more about Apache Spark, you can go through some of our blogs about Spark RDDs and Spark Streaming.

 

 

Image result for apache flink vs apache spark

 

Apache Spark and Apache Flink are both open- sourced, distributed processing framework which was built to reduce the latencies of Hadoop Mapreduce in fast data processing.
Both Spark and Flink support in-memory processing that gives them distinct advantage of speed over other frameworks.

By the time Flink came along, Apache Spark was already the de facto framework for fast, in-memory big data analytic requirements for a number of organizations around the world. This made Flink appear superfluous. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.

Screenshot from 2018-12-27 18-43-57.png

When to choose Apache Spark :

  •  When it comes to real-time processing of incoming data, Flink does not stand up against Spark, though it has the capability to carry out real-time processing tasks.
  • If you don’t need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries.
  • In case you want to move over to a somewhat “more reliable” technology, one would choose Spark as it has a much active and wider community which is constantly increasing.
  • When it comes to speed, Flink gets the upper hand as it can be programmed to process only the data that has changed, which is where it comes out on top of Spark. The main reason for this is its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark’s batch processing method. This makes Flink faster than Spark.
  • If you need to do complex stream processing, then using Apache Flink would be highly recommended.
  • Moreover, if you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.

 

 

Image result for features

 

  • Stream processing
    Flink is a true streaming engine, can process live streams in the sub-second interval.
  • Easy and understandable Programmable APIs
    Flink’s APIs are developed in a way to cover all the common operations, so programmers can use it efficiently.
  • Low latency and High Performance
    Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined architecture provides the high throughput rate. It processes the data at lightning fast speed, it is also called as 4G of Big Data.
  • Fault Tolerance
    The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots, this mechanism provides strong consistency guarantees.
    Also, the failure of hardware, node, software or a process doesn’t affect the cluster.
  • Ease of use
    The Flink APIs make it easier to use than programming for MapReduce and it is easier to test as compared to Hadoop.

  • Memory Management
    The memory management in Apache Flink provides control on how much memory we use in certain runtime operations. Thus we can say that Flink works in managed memory and never get out of memory exception.
  • Iterative processing 
    Apache Flink also provides the dedicated support for iterative algorithms (machine learning, graph processing)
  • Scalable
    Flink is highly scalable. With increasing requirements, we can scale Flink cluster.
  • Easy Integration
    We can easily integrate Apache Flink with other open source data processing ecosystem. It can be integrated with Hadoop, streams data from Kafka, It can be run on YARN.

  • Exactly-once Semantics
    Another important feature of Apache Flink is that it can maintain custom state during computation.
  • Rich set of operators
    Flink has lots of pre-defined operators to process the data. All the common operations can be done using these operators

 

Simple WordCount Program in Scala

 

 

Screenshot from 2018-12-28 11-08-18.png

The example project contains a WordCount implementation, the “Hello World” of Big Data processing systems. The goal of WordCountis to determine the frequencies of words in a text.

Initially, we will require to add the dependency for Flink streaming in Scala in build.sbt :

libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.7.0"

Note: in our example, we will be using the latest version of the dependency i.e. 1.7.0
Second thing import the required classes into your code :

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._

Here, the ParameterTool is only used as it provides the utility methods for reading and parsing program arguments from different sources.

Once we have made all the imports, we require 2 major things

  • Input Parameter
val params = ParameterTool.fromArgs(args)
  • Execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment

Once you have the environment and the params, we need to make these parameters available, we need to add the following:

env.getConfig.setGlobalJobParameters(params)

Now time to get the input data for streaming, we can either give an input file or we can have a predefined input data.



This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters


val text =
// read the text file from given input path
if (params.has("input")) {
env.readTextFile(params.get("input"))
} else {
println("Executing WordCount example with default inputs data set.")
println("Use –input to specify file input.")
// get default test text data
env.fromElements(WordCountData.WORDS: _*)
}
view raw

Input.scala

hosted with ❤ by GitHub

Once we have the input data, we will now apply the various operations on the input in order to obtain the word count for each word.



This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters


val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.sum(1)

Now that we have got the counts for each word, we can have the output as an output file or as console output.



This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters


if (params.has("output")) {
counts.writeAsText(params.get("output"))
} else {
println("Printing result to stdout. Use –output to specify output path.")
counts.print()
}

To finally execute the program simply call the execute method along with the jobName

env.execute(jobName = "WordCount01")

Image result for yes it works

Now to test this simple program, simply run the
sbt run command from your terminal and you are ready to go.

I hope now you guys have some idea about Apache Flink and how is it different from Apache Spark. We have also tried to cover up a simple wordcount code using Flink.

We have also added the code for this demo as well as the sample data (WordCountData.WORDS) for your reference here.

For more details, you can refer to the official documentation for Apache Flink.

Hope this helps. Stay tuned for more 🙂



knoldus-advt-sticker

 

Written by 

Anmol Sarna is a software consultant having more than 1.5 years of experience. He likes to explore new technologies and trends in the IT world. His hobbies include playing football, watching Hollywood movies and he also loves to travel and explore new places. Anmol is familiar with programming languages such as Java, Scala, C, C++, HTML and he is currently working on reactive technologies like Scala, Cassandra, Lagom, Akka, Spark and Kafka.

1 thought on “Is Apache Flink the future of Real-time Streaming?6 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading