Windows operator: Heart of processing infinite streams in Flink

Reading Time: 3 minutes

Apache Flink is an open-source, distributed, Big Data framework for stream and batch data processing. Flink is based on the streaming first principle which means it is a real streaming processing engine and implements batching as a special case. Flink is considered to have a heart and it is the “Windows” operator. It makes Flink capable of processing infinite streams quickly and efficiently. Windows split the infinite stream into “buckets” of finite size, over which we can apply computations.

When we work on a batch application we have a finite amount of data. In contrast, in a stream application, the amount of data is potentially infinite. And the problem is that some functions can only be executed on a finite chunk of data. For example, we can execute map, filter operators on individual elements of an infinite stream but operators like min, max, or any group-level aggregations are not very useful when applied to the infinite stream. To tackle the infinite stream of incoming messages Flink implemented a concept called windows. So the idea is very simple, separate elements of infinite streams into streams of finite groups(windows) and then process these groups independently.

How window is created

Suppose we have an incoming infinite stream of data. Data in which is coming in regular or irregular timings. Windows in this data streams mean we are selecting a subset out of the data and processing it using any transformation like map, fold, reduce e.t.c. A window is created as soon as the first element that should belong to this window arrives. The window ends whenever the condition is met. Then the other window starts and it ends as soon as the condition met.

The condition can be the time in seconds passed or when an event is passed or when the count of an entity in a window is reached to maximum limit etc. These conditions basically allow us to decide when we configure a window operator, how we can assign elements to those windows. For that, we have to use Flink’s window assigners which is responsible for assigning each incoming element to one or more windows.

Types of Windows

Flink has two types of Windows: Keyed and Non keyed window.

Non Keyed window

Non keyed window simply separate elements of infinite streams into the stream of a finite group. It results in the non-parallel processing of a single stream. The syntax for non keyed window is:

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"
Keyed Window

This operator not only groups the finite number of elements into windows but in each window, it also splits elements into logical streams. It groups elements to process in parallel. Let’s take a look at an example.

We have a stream where every element has a shape associated with it. Using a keyed window we can separate elements by shape and get 3 logical stream one for each shape. The original stream is not simply split into three infinite streams. In every logical stream, elements are groups into finite chunks. And these logical streams can then be process parallel, independently from each other. The syntax for the keyed window is:

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

The only difference is the keyBy(...) call for the keyed streams and the window(...) which becomes windowAll(...) for non-keyed streams.

After specifying whether our stream is keyed or not, the next step is to define a window assigner. In the next blogs, we will learn the various pre-defined window assigners provided by Flink with an example of processing keyed and non keyed windows.

Happy blogging!!

Leave a Reply