A few days earlier, in our project, we were using Spark streaming and initially, it worked like a charm. But as we were very close to completion of our use case, the unexpected occurred. Spark does have a lot of interesting features, but we had some more custom needs such as running a ton of varying jobs with different actors/flows. Also, we needed something which provides the least latency. Something close to a real-time approach. It was then I got to explore the Akka Streams.
So, to begin with, I’ll first go through what’s different in Akka streams as compared to the Spark streams.
Spark Streams Vs Akka Streams
Spark Streaming is great to process lots of data while keeping the state at the same time but it is a micro-batch approach. Events are collected and then processed periodically in small batches every few seconds.
Whereas, in Akka Streams, we represent data processing in a form of data flow through an arbitrary complex graph of processing stages. Stages have zero or more inputs and zero or more outputs.
Broadly speaking, if you’re doing something that is best characterized as “big data” or “analytics”, you should really look at Spark in depth. Whereas, if you’re involved into something that is best characterized as “real-time” or “reactive”, you should really look at Akka in depth.
If you’re doing something that involves both, you may find yourself using both Akka and Spark together.
On comparing these two, we chose Akka streams to be a better competitor for our use-case and starting digging a bit deep into it’s working.
Akka Streams Terminologies
The documentation defines Akka-streams as : An intuitive and safe way to do asynchronous,non-blocking and backpressured stream processing.
The Akka Streams API allows us to easily compose data transformation flows from independent steps.
Moreover, all processing is done in a reactive, non-blocking, and asynchronous way.
To work with Akka Streams, we need to be aware of the 4 core basic concepts:
- A processing stage with exactly one output, emitting data elements whenever downstream processing stages are ready to receive them.
- It is the entry point to processing in the Akka-stream library.
- We can create a source in 2 ways :
- We can create an instance of this class from multiple sources
For example, we can use the single() method if we want to create a Source from a single String,
- We can create a Source from an Iterable of elements.
- We can create an instance of this class from multiple sources
- A processing stage which has exactly one input and output, which connects its up and downstream by transforming the data elements flowing through it.
- The flow is the main processing building block
- Every Flow instance has one input and one output value
- A processing stage with exactly one input, requesting and accepting data elements possibly slowing down the upstream producer of elements.
- A Flow is not executed until we will register a Sink operation on it
- It is a terminal operation that triggers all computations in the entire Flow
- A materializer is optional in Akka-streams
- We can use one if we want our Flow to have some side effects like logging or saving results.
- However, we can pass NotUsed alias as a Materializer to denote that our Flow should not have any side effects
- RUNNABLE GRAPH
- A graph is a whole set of Outlet/FlowShape/Inlet linked together, that is closed.
- After a stream is properly terminated by having both a source and a sink, it will be represented by the Runnable Graph type, indicating that it is ready to be executed.
Note: We need both an ActorSystem and an ActorMaterializer in scope to materialize the graph
BroadCasting In Akka Streams
One of the most straightforward ways to partition a stream is to simply broadcast every upstream element to a set of downstream Akka Streams: Streaming Live consumers. Every downstream consumer will receive exactly the same stream of messages and can independently apply filtering, transformations, aggregations, or side-effects.
Basically, Broadcast ingests events from one input and emits duplicated events across more than one output stream.
Here is an example from the documentation itself.
In the example below, we prepare a graph that consists of two parallel streams, in which we re-use the same instance of
Flow, yet it will properly be materialized as two connections between the corresponding Sources and Sinks:
As you can see above, we require a
builder.add(...) method that will make a copy of the blueprint that is passed to it and return the inlets and outlets of the resulting copy so that they can be wired up. Using that we can use Broadcast and create the parallel streams.
Working with Operators
Akka streams offer some standard operations to work on streams. The simplest to understand is similar to those acting on collections.
Below are some of the operations that the Akka-streams provide us :
The rest of the operations are not much different from working with collections. Mostly used are the map and filter, but there are also
drop, and many more.
You can refer here to know about the operators that are provided by the Akka stream Source, Flow and Sink.
Also, Akka-streams also provide a mechanism to build asynchronous boundaries manually into your flows and graph. We can do that by adding Attributes.asyncBoundary using the method async on Source, Sink and Flow to pieces that shall communicate with the rest of the graph in an asynchronous fashion.
In this example we create two regions within the flow which will be executed in one Actor each—assuming that adding and multiplying integers is an extremely costly operation this will lead to a performance gain since two CPUs can work on the tasks in parallel.
As you can see in the above pic that we have bifurcated the flow using the red bubble. Everything that is inside the red bubble will be executed by one actor and everything outside of it by another.
Pros and Cons Of Using Akka Streams
- Akka streams implement something called reactive manifesto which is great to achieve really low latency.
- Akka streams provide a lot of operators to write declaratively transformations over streams easily.
- One of the best approach to use when working with a real-time or reactive category.
- In case you have more of custom needs such as running a ton of varying jobs with different actors/flows, then Akka streams can help you in that.
- Akka streams are quite economical in some cases when it comes to threads.
- Using Akka streams imposes some overhead in getting up and running. It requires learning the DSL and also understanding what’s going on under the hood to some extent.
- Implementation hard to read
- Conceptually very complicated
- Unlike heavier “streaming data processing” frameworks, Akka Streams are not “deployed” nor automatically distributed.
Now, We hope that you have some idea about what Akka streams are and how to use them.
To conclude, one can say that if you’re involved into something that is best characterized as “real-time” or “reactive”, you should really look at Akka in depth as the Akka streams impose some overhead in getting up and running. It requires learning the DSL and also understanding what’s going on under the hood to some extent. But once you get some understanding of these, these can be a solution to most of your streaming problems.
For more information, you can refer the Akka Stream Documentation.
Hope this Helps. Stay tuned for more interesting articles. 🙂