Kafka is known for it’s performance with resiliency & fault tolerance. In this article we’ll see how to make some changes in configuration to achieve fault tolerance and resilience for better architectural need. before starting the article, we need to have basic knowledge of Kafka or we can go through the Document.
Apache Kafka is a distributed system, and the term fault tolerance is very common in distributed systems. Fault tolerance refers to the ability of a system to continue operating without interruption when one or more of it’s components fail. Fault tolerance systems use backup components that automatically take the place of failed components, ensuring no loss of service.
Kafka provides configuration properties in order to handle adverse scenarios.
Broker level fault tolerance can be managed by below parameters,
A replication factor is the number of copies of data over multiple brokers.
Topic should have replication-factor >1 (usually 2 or 3) . This helps when a broker is down, so that another can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2.
now assume that broker 2 has failed. Broker 1 and 3 can still serve the data for a topic
An in-sync replica is a replica that fully catches up with the leader in the last 10 seconds. The time period can be configured via
replica.lag.time.max.ms. If a broker goes down or has network issues, then it couldn’t follow up with the leader and after 10 seconds, this broker will be removed from ISR.
The default minimum in-sync replica (
min.insync.replicas) is 1. It means that if all the followers go down, then ISR only consists of the leader. Even if
acks is set to all, it actually only commits the message to 1 broker (the leader) which makes the message vulnerable.
Configurations to make Kafka Producer more fault tolerant and resilient
Kafka Producer Delivery Semantics
At Most Once (default Semantics)
For the at-most-Once delivery semantics it is acceptable to deliver a message either one time only or not at all. It is acceptable to lose the message rather than delivering a message twice in this semantic. Failure to deliver a message is typically due to communication error or other disruption.
At Least Once
In at least once delivery semantics it is acceptable to deliver a message more than once but no message should be lost. .Duplication of events can occur due to the combination of disrupted communication and retrying events.
In exactly once delivery semantics it is acceptable to deliver a message only once and without message loss.
We can achieve delivery semantics in Kafka using Acks property of producer and min.insync.replica property of the broker.
When Acks=1, you can achieve at most delivery semantics. In this case No response is requested from the broker, so if the broker goes offline or an exception happens, we will not know and will lose data.
Acks =1 (Default)
When Acks=1, you can achieve at least once delivery semantics. The Kafka producer sends the record on the broker and waits for a response from the broker. If no acknowledgment is received for the message sent, then you can let the producer resend messages by configurin
etries=n. This is basically the maximum number of retries the producer would do if the commit fails. The default value is 0.
We set acks=all to achieve exactly once delivery semantics. The Kafka producer sends the record to the broker and waits for a response from the broker. The producer will retry sending the messages based on retry config n times until received acknowledgement. The broker sends acknowledgment only after replication based on
Properties to create a safe producer that ensures minimal data loss.
Acks = all (default 1) — Ensures replication before acknowledgement
Retries = MAX_INT (default 0) — Retry in case of exceptions
Max.in.flight.requests.per.connection = 5 (default) — Parallel connections to broker
Min.insync.replicas = 2 (at least 2) — Ensures minimum In Sync replica (ISR).
Send messages in order
max.in.flight.requests.per.connection ( default value 5) represents the number of unacknowledged requests that are buffering on the producer side. If the
retries is greater than 1 and the first request fails, but the second request succeeds, then the first request will be resent and messages will be in the wrong order.
If you don’t enable idempotent, but still want to keep messages in order, then you should config this setting to 1
Sending messages too fast
When the producer calls
send(), the messages will not be immediately sent but added to an internal buffer. The default
buffer.memory is 32MB. If the producer sends messages faster than they can be transmitted to the broker or there is a network issue, it will exceed
buffer.memory then the
send() call will be blocked up to
max.block.ms (default 1 minute). We can increase the value to mitigate the problem.
linger.ms (default value 0) is the delay time before the batches are ready to be sent. The default value is 0 which means batches will be immediately sent even if there is only 1 message in the batch. Sometimes, people increase
linger.ms to reduce the number of requests and improve throughput.
There is an equivalent configuration as
linger.ms, which is
batch.size. This is the maximum size of a single batch.
Consumer Configurations to make Kafka more fault tolerant and resilient
fetch.min.bytes (default value 1MB )defines max time to wait before sending data from Kafka to the consumer. This results in up to 500 ms of extra latency in case there is not enough data flowing to the Kafka topic to satisfy the minimum amount of data to return. If you want to limit the potential latency (usually due to SLAs controlling the maximum latency of the application), you can set
fetch.max.wait.ms to a lower value. If you set
fetch.max.wait.ms to 100 ms and
fetch.min.bytes to 1 MB, Kafka will receive a fetch request from the consumer and will respond with data either when it has 1 MB of data to return or after 100 ms, whichever happens first.
Session.timeout.ms (default value 10 seconds)
Defines how long a consumer can be out of contact with the broker. When session times out consumer is considered lost and rebalance is triggered.while
heartbeat.interval.ms define how often poll method should send a heartbeat.
To avoid this from happening often it’s better to set
heartbeat.interval.ms value three times higher than
session.timeout.ms. By setting a higher value you can avoid unwanted rebalancing and other overheads associated with it.
max.poll.records controls the maximum number of records that a single call to
poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop.
This property controls the behavior of the consumer When reading from the broker for the first time, as Kafka may not have any committed offset value, this property defines where to start reading from. You could set “earliest” or “latest”, while “earliest” will read all messages from the beginning “latest” will read only new messages after a consumer has subscribed to the topic. The default value of
auto.offset.reset is “latest.”
In addition to the configuration properties presented above, there are a number of other important configurations that any user of Kafka must know about.
Kafka consumer supports only At most once and at least once delivery semantics.
Thanks for taking the time to read through this article!
- Apache Kafka: Topic Partitions, Replicas & ISR
- Kafka Producer Delivery Semantics
- Kafka Consumer Delivery Semantics