Let’s get to know Data Streaming: A dev’s point of view

Reading Time: 5 minutes

Streaming of data has become the need of the hour. But do we really know how stream processing exactly works? What are its benefits? Where and how to stream data in your big data architecture correctly? How to process streamed data efficiently? What challenges do we face when we move from batch processing to stream processing? What is Stateful stream processing and what is stateless stream processing? Which one to opt and when? Let’s address all these queries!

Stream processing: What, Why, Benefits

By definition “A stream is an unbounded continuous flow of data”. That means when we keep capturing the data from some user’s activity (supervision) or tracking some person’s health (medical) continuously, then we get the latest data in (near) real-time. Now, if we process a record of data as soon as we capture it then the processing happens in a real-time streaming fashion unlike waiting for a few hours to collect the data and processing it in bulk. In essence that is the whole concept of streaming.

If someone asks, why should we do it if we have already a batch processing system already in place working superbly? The question is genuine as we already are processing reliably in batch mode without any issues. But in batch mode, we are waiting for several hours before the processing starts to actually collect the data. Hence the data which came right now will be processed later when it is time. Till then, the data is just sitting idle which is just not right. Stream processing utilizes the time here rather waiting.

Stream Processing processes the data as soon as the data is arrived (real-time streaming) or processes it within some really negligible time (Micro-batching). Hence the data collection in real-time will be utilized.

To list out the benefits of processing it in (near) real-time, some of them are:

  • Less hardware required as less data is to be processed
  • Easy inspection of data from multiple streams simultaneously
  • Providing insights faster for
    • Real-time fraud and anomaly detection.
    • Real-time personalization, marketing, and advertising.
    • and more …

Different Architectures

For an application to be developed, architecture is always needed. The same goes for the stream processing applications. Such architecture must able the application to handle the ingestion, processing, and analysis of data in a streaming manner. The followings are two such architectures:

  1. Lambda Architecture
  2. Kappa Architecture

Lambda Architecture

In this architecture, there are three layers: Batch Layer, Serving Layer, Speed Layer. These layers are grouped into two parts:

  1. Batch Processing: Batch layer and Serving layer are part of this flow. The batch layer stores all of the incoming data in its raw form and performs batch processing on the data. The serving layer helps the API view the processed data.
  2. Stream Processing: The Speed layer analyzes data in real-time. This layer is designed for low latency and to provide the processed view of real-time data.

The above two parts make the Lambda architecture suitable for the organizations that already have a batch processing application working and they want the application to provide real-time insights on the stream data as well. This architecture can be achieved by plugging in a speed layer into the traditional batch processing architecture. The downside of this architecture is that it is a bit more complex.

Kappa Architecture

Kappa architecture is a simpler alternative to Lambda Architecture and supports (near) real-time analytics when the data is read and transformed as soon as it is received. Upon receiving the data (events/messages) in streaming fashion (ex; using Kafka), a Stream Processing Engine reads the data and transform it into an analyzable format, and then stores it into an Analytics Database (Serving DB) for end-users to query.

Types of Stream Processing

Stateless Streaming

Stateless streaming is when the processing of each element of stream happens independently of other data in the stream.

What?

Processing each record in a stream without any intervention from the processing result of previous records.

Where?

Where we need to perform some operation per individual message/event like  filter, select, etc

When?

When result is not dependent upon previous events

Stateful Streaming

This is the opposite of stateless streaming and required to maintain state as the operation on each element in stream gets affected by some common state and that state is also updated in return by each operation.

What?

This stream processing is maintaining the state to perform Aggregations, Deduplication, Joins

Where?

Where we need to perform operations like groupBy, count, etc

When?

When result is dependent upon previous events.

Challenges in Stream Processing

Stateless streaming is just as simple as processing some data in a collection and it doesn’t differ much from batch processing. The challenges are faced while using operations which actually maintains some state in the memory while processing the data stream. These are some of the following problems that we encounter while processing the data in a streaming manner.

  1. Joining data from Two Streams: Both the streams are unbounded and joining keys from both the streams actually differ as we don’t have all the keysets available from both datasets, unlike batch processing.
  2. Aggregations in Streaming: When we want to aggregate some data from a stream (unbounded), we can’t simply do it like we do in batch mode as the data is increasing with time. While in batch processing we have all the data already available at once.
  3. Handling Late data: If we need to group some events but stream return some of the events of the group after a while then managing that grouping becomes complex.
  4. Deduplication in streaming: Deduplication is such an example which is handled differently as the new data might have some duplicate events.
  5. Fast Incoming Data: When the processing engine is not able to process the data according to the incoming speed of streaming data.
  6. Fault Tolerance: What happens when the streaming application faces some error and gets down. How to handle that? Certainly not how we do it in a batch application.

Conclusion

While we have many benefits including the better utilization of data, there are challenges which makes the stream processing more challenging to incorporate in some use cases. In the next blog, we’ll see how we can handle these challenges in our way to make our application achieve the streaming mode.

For more details on this topic, you can refer to the #knolX session uploaded on youtube on the same title

To get notification for more such blog, follow me on:

References:

Footer

Written by 

Anuj Saxena is a software consultant having more than 1.5 years of experience. Anuj has worked on functional programming languages like Scala and functional Java and is also familiar with other programming languages such as Java, C, C++, HTML. He is currently working on reactive technologies like Spark, Kafka, Akka, Lagom, Cassandra and also used DevOps tools like DC/OS and Mesos for Deployments. His hobbies include watching movies, anime and he also loves travelling a lot.