Stateful stream processing with Apache Flink(part 1): An introduction

Reading Time: 4 minutes

Apache Flink, a 4th generation Big Data processing framework provides robust stateful stream processing capabilities. So, in a few parts of the blogs, we will learn what is Stateful stream processing. And how we can use Flink to write a stateful streaming application.

What is stateful stream processing?

In general, stateful stream processing is an application design pattern for processing an unbounded stream of events. Stateful stream processing means a “State” is shared between events(stream entities). And therefore past events can influence the way the current events are processed.

Let’s try to understand it with a real-world scenario. Suppose we have a system that is responsible for generating a report. It comprising the total number of vehicles passed from a toll Plaza per hour/day. To achieve it, we will save the count of the vehicles passed from the toll plaza within one hour. That count will be used to accumulate it with the further next hour’s count to find the total number of vehicles passed from toll Plaza within 24 hours. Here we are saving or storing a count and it is nothing but the “State” of the application.

Might be it seems very simple, but in a distributed system it is very hard to achieve stateful stream processing. Stateful stream processing is much more difficult to scale up because we need different workers to share the state. Flink does provide ease of use, high efficiency, and high reliability for the state management in a distributed environment.

What is state in Flink?

In Flink, State is a Snapshot of an operator at any particular time, which remembers information about past input/events. And using it to influence the processing of future input. A state will know everything what is happened in the application till a particular point of time. A state can be stored and accesses in many different places including program variables, local files, or embedded, or external databases. A typical stateful streaming application in Flink looks like-

Since Flink is a distributed system, the local state needs to be protected against failures i.e., avoid data loss in case of an application or machine failure. Flink guarantees this by periodically writing a consistent checkpoint of the application state to a remote and durable storage.

Where the state can be use in Flink applications?

  • To Search certain event patterns: In an application, we might search for certain event patterns that have happened so far. Since a state store sequence of events encountered so far, we can search for those pattern in a state.
  • To train a machine learning Model:  States are widely used as a building block in machine learning models. In machine learning, we will do several transformation iterations on a single dataset. We have an input dataset then we apply transformation 1 on it, then on a result of it apply transformation 2 and the same for n number of iterations. In this scenario of a machine learning model, the state will hold the current version of the model parameter.
  • To manage historic data: State can be used when historic data need to be managed. In that case, the state allows efficient access to the events that occurred in the past.
  • To achieve fault tolerance with check-pointing: States play an important role in Flink applications to recover from any kind of node failure. In case of any failure, if our system is fault-tolerant and we are saving the state of that application then we can restart processing exactly from the same check-pointing where the system got corrupted. We will discuss Flink’s check-pointing and fault tolerance mechanism in detail in another blog.
  • To re-scale the Jobs: Stateful stream processing is much more difficult to scale up. Because we need the different workers to share the state. But Flink supports scaling of a stateful application by redistributing the state to its worker machines.

Rescaling Stateful Stream Processing Jobs

Flink is a massively parallel distributed system. The same operator will run parallelly on different datasets on different machines of cluster. Now suppose we have to re-scale the jobs i.e., increasing the number of parallel operator instances in the job. Let’s try to understand this with an example- Suppose currently there are 3 instances of an operator running on different nodes. We want to scale it to 5 instances.

In Flink, re-scaling is done by redistributing the “state” to its worker machines. All of the previous 3 operator instances maintaining their state within them and the same is periodically saved in the persistence storage like HDFS or any other state backend(will discuss in another blog).

When we pass a re-scaling request then Flink will take the saved state from HDFS and then redistributing it to five instances.

So, this is all about the introduction part of the Stateful stream processing with Apache Flink. In the next part of this blog, we will cover different categories of state in Flink. We will use its various primitives in our Flink applications that give us access to the various type of state according to our use case.

Thanks for reading and stay tuned for the next part!!


Leave a Reply