As you all might have known by now that one of the hot topics for quite some time has been streaming of big data. Day after day, we see tons of streaming technologies out there competing with one another. The obvious reason for that, processing big volumes of data is not enough. We need real-time processing of data, especially when we need to handle continuously increasing volumes of data and also need to process it and maintain it.
We ourself have tried our hands on many of the technologies like Apache Flink, Akka streams, Spark Streaming and many more. Just in case you need to know about them, you can go through the earlier blogs I have added:
But after going through these I thought about backtracking the things a little bit. I thought to myself, how these streams were streaming options were different from each other and what makes them similar. While doing this, I came across a term
Reactive Streams which astonished me. So in this blog, I will be taking you on a tour to explore what Reactive Streams are and what makes them so special. But before coming to the Reactive Streams we must know what Streams actually are and we will also try to recall the Reactive Manifesto. So going by the definition,
Streams are continuously flowing data which begins at a Source and ends at a Sink passing through multiple flows in between.
Now that you have some idea about streams, you can go through the below image to get some idea about what the Reactive Manifesto is and what the 4 principles of the Reactive Manifesto are:
To know and read more about the Reactive Manifesto, you can go through its official documentation. Now that we understand the basic terminologies, we will try to keep our focus on Reactive Streams.
Definition : Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.
3 points to note over here are:
We have already discussed what stream processing is. Let’s focus more on the other 2 points.
Asynchronous Processing is required as to :
- ensure the non-blocking flow of data
- enable the parallel use of computing resources
- collaborate network hosts or multiple CPU cores within a single machine.
Back-pressured Processing is required as to:
- communicate workload levels so that data passing never faces bottlenecks on either side.
- manage the rate of the operations of the source and sink.
- make sure the data producer doesn’t overwhelm the data consumer and potentially bring down the entire system.
Let’s try to understand the backpressure part in detail.
Consider a scenario where there is a fast publisher publishing at a speed of 100 ops/sec and a slow subscriber subscribing at the speed of 1 ops/sec.
The subscriber usually has some kind of buffer to manage incoming messages but that also has a limited space which keeps on filling as the publisher is much faster than the subscriber.
As the buffer keeps on filling, at one time the buffer will surely overflow as the publisher is continuously sending data to the subscriber via the flow.
Now how can we handle this scenario?
Option 1: Increase Buffer Size : One option is to increase the buffer size of subscribers. This is possible, but not always realistic. As the memory is limited, we will eventually run out of memory.
Option 2: Bounded Buffer + Drop Messages Another option is to simply drop elements. Working in stream processing gives us some flexibility for dropping what can’t be handled, but it’s not always appropriate. This would highly increase the load on Publisher.
Optimal Solution: What we need is a bi-directional flow of data — elements emitted downstream from publisher to the subscriber, and a signal for demand emitted upstream from subscriber to publisher. If the subscriber is placed in charge of signaling demand, the publisher is free to safely push up to the number of elements demanded. This also prevents wasting resources because demand is signaled asynchronously, a subscriber can send many requests for more work before actually receiving any.
Reactive Streams implementation
Implementations of the Reactive Streams specification are compatible with each other, which is an obvious benefit of choosing a Reactive Streams compliant library. You can even integrate different stream processing systems together that use different libraries as long as they comply with the specification.
Need for Reactive Streams?
The goals of the Reactive Streams specification are :
- to govern the exchange of stream data across an asynchronous boundary – like passing elements on to another thread or thread-pool.
- to create a standard for achieving statically typed, high-performance, low latency, and asynchronous streams of data with built-in non-blocking back pressure
- Handling streams of data—especially “live” data whose volume is not predetermined—requires special care in an asynchronous system.
- The most prominent issue is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination i.e maintaining the backpressure.
In this blog, we tried to cover what reactive streams are and why they are so important. I hope that now you guys have some idea about what reactive Streams are and what makes a stream reactive. So now, when it comes to choosing a streaming approach, be sure to make a choice that suits your use case well. To know more about the Reactive Streams, you can go through the following links :
Hope this helps. Stay tuned for more interesting blogs. 🙂