A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order?
Yes, i can use orderBy with streaming dataframe using Spark Structured Streaming, but only in complete mode. There is no way of doing ordering of streaming data in append mode and update mode.
I have tried different ways to solve this issue. Like, if i go with spark structured streaming. I might sort the streamed data in batches but not across batches.
I started finding solutions with different technologies like Apache Flink, Apache storm etc. What i faced at the end is disappointment. 🙁
A bit of light at the end of the tunnel
Luckily there is Apache Kafka Stream which provides the facility of accessing its StateStore. Kafka Stream provides Processor API.
The low-level Processor API provides a client to access stream data and to perform our business logic on the incoming data stream and send the result as the downstream data. It is done via extending abstract class AbstractProcessor and overriding the init, punctuate,close and process method which contains our logic. This process method is called once for every key-value pair.
Where the High-Level DSL provides ready to use methods with functional style, the low-level processor API provides you the flexibility to implement processing logic according to your need. The trade-off is just the lines of code you need to write for specific scenarios. For more information, refer the references.
So, the abstract idea is after aggregating the dataframe,write it to kafka. Read it as a KStream and apply the business logic using low-level processor API to sort the data and write it back to kafka.
Here the main idea is to keep on adding record in listbuffer until it reaches to certain size, let’s say 20. As buffer size reaches 20, we move to else part where we will iterate the listbuffer and parse every record to extract that specific column which will sort the record. We are going to make listbuffer of tuple2, one element of tuple2 is that specific column and element2 is consumed value from kafka. After that, we will sort the listbuffer of tuple2 on the basis of extracted column and send only second element to kafka. After that, we will drop the all element of listbuffer. This process will run continuously. We can also handle late data and system shutdown by saving listbuffer in KeyValueStore according to requirement.
So, here i have implemented the idea in MyProcessor. In my case, i am having three columns in value i.e time,col1,col2. I have extracted time column so that i can sort the record on the basis of time. After sorting, each record is being sent to kafka topic. Now I can consume it as a dataframe again. 😀
Ordering of Streaming Data is always a hard problem. But with Kafka Streams we can now sort the streamed data using its Lower Level Processor APIs. The main aim of this blog is not to talk how to use low-level processor API but to make you familiar with the idea of how to sort the streamed data.
Hope, this blog will help you 🙂