Fault tolerance and Resiliency in Apache Kafka.

Table of contents

Reading Time: 5 minutes

Kafka is known for it’s performance with resiliency & fault tolerance. In this article we’ll see how to make some changes in configuration to achieve fault tolerance and resilience for better architectural need. before starting the article, we need to have basic knowledge of Kafka or we can go through the Document.

Apache Kafka is a distributed system, and the term fault tolerance is very common in distributed systems. Fault tolerance refers to the ability of a system to continue operating without interruption when one or more of it’s components fail. Fault tolerance systems use backup components that automatically take the place of failed components, ensuring no loss of service.

Kafka provides configuration properties in order to handle adverse scenarios.

Broker level fault tolerance can be managed by below parameters,

Replication factor

A replication factor is the number of copies of data over multiple brokers.

Topic should have replication-factor >1 (usually 2 or 3) . This helps when a broker is down, so that another can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2.

now assume that broker 2 has failed. Broker 1 and 3 can still serve the data for a topic

In-sync replicas

An in-sync replica is a replica that fully catches up with the leader in the last 10 seconds. The time period can be configured via replica.lag.time.max.ms. If a broker goes down or has network issues, then it couldn’t follow up with the leader and after 10 seconds, this broker will be removed from ISR.

The default minimum in-sync replica ( min.insync.replicas) is 1. It means that if all the followers go down, then ISR only consists of the leader. Even if acks is set to all, it actually only commits the message to 1 broker (the leader) which makes the message vulnerable.

Configurations to make Kafka Producer more fault tolerant and resilient

Kafka Producer Delivery Semantics

At Most Once (default Semantics)

For the at-most-Once delivery semantics it is acceptable to deliver a message either one time only or not at all. It is acceptable to lose the message rather than delivering a message twice in this semantic. Failure to deliver a message is typically due to communication error or other disruption.

At Least Once

In at least once delivery semantics it is acceptable to deliver a message more than once but no message should be lost. .Duplication of events can occur due to the combination of disrupted communication and retrying events.

Exactly Once

In exactly once delivery semantics it is acceptable to deliver a message only once and without message loss.

We can achieve delivery semantics in Kafka using Acks property of producer and min.insync.replica property of the broker.

Acks (acknowledgments)

Acks =0

When Acks=1, you can achieve at most delivery semantics. In this case No response is requested from the broker, so if the broker goes offline or an exception happens, we will not know and will lose data.

Acks =1 (Default)

When Acks=1, you can achieve at least once delivery semantics. The Kafka producer sends the record on the broker and waits for a response from the broker. If no acknowledgment is received for the message sent, then you can let the producer resend messages by configuring retries=n. This is basically the maximum number of retries the producer would do if the commit fails. The default value is 0.

Acks=all

We set acks=all to achieve exactly once delivery semantics. The Kafka producer sends the record to the broker and waits for a response from the broker. The producer will retry sending the messages based on retry config n times until received acknowledgement. The broker sends acknowledgment only after replication based on min.insync.replica property.

Properties to create a safe producer that ensures minimal data loss.

Producer properties

Acks = all (default 1) — Ensures replication before acknowledgement

Retries = MAX_INT (default 0) — Retry in case of exceptions

Max.in.flight.requests.per.connection = 5 (default) — Parallel connections to broker

Broker properties

Min.insync.replicas = 2 (at least 2) — Ensures minimum In Sync replica (ISR).

Send messages in order

max.in.flight.requests.per.connection ( default value 5) represents the number of unacknowledged requests that are buffering on the producer side. If the retries is greater than 1 and the first request fails, but the second request succeeds, then the first request will be resent and messages will be in the wrong order.

If you don’t enable idempotent, but still want to keep messages in order, then you should config this setting to 1

Sending messages too fast

When the producer calls send(), the messages will not be immediately sent but added to an internal buffer. The default buffer.memory is 32MB. If the producer sends messages faster than they can be transmitted to the broker or there is a network issue, it will exceed buffer.memory then the send() call will be blocked up to max.block.ms (default 1 minute). We can increase the value to mitigate the problem.

linger.ms (default value 0) is the delay time before the batches are ready to be sent. The default value is 0 which means batches will be immediately sent even if there is only 1 message in the batch. Sometimes, people increase linger.ms to reduce the number of requests and improve throughput.

There is an equivalent configuration as linger.ms, which is batch.size. This is the maximum size of a single batch.

Consumer Configurations to make Kafka more fault tolerant and resilient

fetch.min.bytes (default value 1MB )defines max time to wait before sending data from Kafka to the consumer. This results in up to 500 ms of extra latency in case there is not enough data flowing to the Kafka topic to satisfy the minimum amount of data to return. If you want to limit the potential latency (usually due to SLAs controlling the maximum latency of the application), you can set fetch.max.wait.ms to a lower value. If you set fetch.max.wait.ms to 100 ms and fetch.min.bytes to 1 MB, Kafka will receive a fetch request from the consumer and will respond with data either when it has 1 MB of data to return or after 100 ms, whichever happens first.

`Session.timeout.ms` (default value 10 seconds)

Defines how long a consumer can be out of contact with the broker. When session times out consumer is considered lost and rebalance is triggered.while heartbeat.interval.ms define how often poll method should send a heartbeat.

To avoid this from happening often it’s better to set heartbeat.interval.ms value three times higher than session.timeout.ms. By setting a higher value you can avoid unwanted rebalancing and other overheads associated with it.

`max.poll.records`

max.poll.records controls the maximum number of records that a single call to poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop.

`auto.offset.rese`t

This property controls the behavior of the consumer When reading from the broker for the first time, as Kafka may not have any committed offset value, this property defines where to start reading from. You could set “earliest” or “latest”, while “earliest” will read all messages from the beginning “latest” will read only new messages after a consumer has subscribed to the topic. The default value of auto.offset.reset is “latest.”

In addition to the configuration properties presented above, there are a number of other important configurations that any user of Kafka must know about.