Handling Message Loss in Stream Processing Microservices

Table of contents

Reading Time: 4 minutes

If you have ever designed and worked on a message processing application that involves processing messages either from a stream and producing to a queue which is producing to a regular queue, your application is prone to loss of messages in case of the following scenarios.

If the queueing system is down on which you are publishing the messages and the queue from which you are consuming the messages are not having a ACK mechanism or simply a high velocity streams
If you have consumed the message but cannot process it due to some external dependencies

In this blog, I will walk you through the example problem where a message loss can be a problem to the business if not handled with a proper solution. We will think about the possibility of solving the problem using well-established patterns and anti-patterns.

Loss of messages While Stream Processing Example

Assume you are working a stream processing service that is consuming a stream of messages from one source and after processing, publishing events to a Kafka Topic.

Also, we are not having any dead letter queue where we can back up the messages. In case our Kafka is down for a fraction of a second you are going to lose the message which is consumed from the stream as you can’t go back.

What are the options to avoid messages getting Lost?

Here is what in my opinion would be a handful of options to avoid the loss of messages.

Retry approach in case volume is low
Implement a dead letter queue if possible ( not cost effective)
Use a database to put the records/events and replay them when Kafka is Up

Using Retries

Sometimes, the external dependencies might take longer than usual to respond due to the service might be down for a few reasons like network outage, server down, or other unknown issues. In that case, we can expect a few exceptions or error conditions to be thrown from the service we are calling. For e.g. consider the below pseudo-code example which expects a Response but there is a chance the method might return an exception:

public String methodWhichCallsExternalAPI() throws Exception {
        if (responseFromAPI) {
            return "Response From API";
        } else {
            throw new Exception();
        }
    }

The simplest way to put a retry is by looping over the method and retrying until we get a successful response or exhaust the number of retries.

There are a lot of patterns that handle retries in a standard way. spring-retry Is one of the ways to implement Retries in a standard way. I am borrowing an Example from baeldung blog. We just need a couple of annotations to enable retries on the service and methods we want to behave in a retry way. We need to specify a class of exception we are expecting e.g. RuntimeException

@Service
public interface MyService {
    @Retryable(value = RuntimeException.class)
    void retryService(String sql);
}

Once the annotation is there, The method annotated with @Retryable will retry the method.

Using Dead Letter Queues

As per wiki, here is how a dead letter queue is defined.

In message queueing the dead letter queue is a service implementation to store messages that meet one or more of the following criteria: Message that is sent to a queue that does not exist. Queue length limit exceeded. Message length limit exceeded. Message is rejected by another queue exchange.

In our example, processing of the message is complete but if we can’t publish it to Kafka topic, we can adapt to the dead letter queue pattern in this example only if we redirect our stream to a queue without processing the messages to a queue which supports DLQ. A lot of queues and pub/sub offerings implement the concept of DLQ. e.g. Amazon SQS. Rabbit MQ, Kafka, Active MQ etc.

Unfortunately, we are not using any of these at the consumer end. Otherwise, it would have been a solution for our case.

Persisting Events when not Processing

There is another option that can save us from losing messages in case we are not processing. Here is how:

Consider each incoming stream record as event.
Try processing the record
Retry with a number of attempts
In case of processing stil fails, persist that record to a storage (derby, RDBMS, Mongo), I would rather choose a elastic storage which can be scaled especially when not sure how much time the failure situation is going to be persisted and volume of stream is high.

In the given example scenario at the beginning of the blog, In the failure scenario would persist the records in the database when Kafka is not up and as soon as we started getting the failure message as depicted below.

When any external dependency starts giving the success response, we can replay all the events we have processed and persisted in the database so that all messages that failed to publish can be republished.

Note

The above approach will work properly as long as the ordering of the messages being processed does matter. If we need ordering, we can probably add a persistent queue in between the stream and the stream processing service.

Hope you find the workarounds a fit in case you find a failure scenario. Feel free to suggest/propose new designs.