If you understand the CAP Theorem which states we can either move towards Consistency (CP) or High Availability (AP) in a distributed system, we can’t achieve both at the same time. But we can tune the system in such a way that we can have as much as consistency without losing the availability of the system, but we can’t assure both 100%.
So when we talk about Kafka Cluster tuning there are several configurations which are set by default but we need to provide few as per our business/system requirement and to achieve the best possible consistency and availability. Let’s first understand them, where N is the number of nodes in the cluster:
- Cluster Size (N): Number of nodes/brokers in the Kafka cluster, we should have 2x+1, i.e. at least 3 nodes or more in an odd number.
- Partitions: We write/publish data/event into a topic which is divided into partitions (by default 1), but we should have M times N, where M can be any integer number, i.e. M >= 1, to achieve more parallelism and partitioning of data over the cluster.
- Replication Factor: determines the number of copies (including the original/Leader) of each partition in the cluster. All replicas of a partition exist on separate node/broker, and we should never have R.F. > N, but at least 3.
We recommend having 3 RF with 3 or 5 nodes cluster. This helps in having both availabilities as well as consistency.
- In-sync Replica (ISR): Number of minimum replicas (including the leader) synced up, i.e. available for the producer to successfully send messages to the partition.
This inversely impacts the availability i.e. lower the ISR more the availability and lesser the consistency and vice versa. we should always have ISR lower than RF. We recommend having 2 ISR for topics with RF as 3.
Note: Setting ISR to 1 is almost equivalent to having no replication in a system.
- Acknowledgment: message to be written into the number of replicas before it is acknowledged to the producer.
Setting acks to 0 will make the system to send acknowledgment without writing the message which may lose the data,
setting it to 1 means it should be written at least to the leader replica,
and setting it to all means message should be written to all in-sync replica which helps in consistency but drops the availability.
- Unclean Leader Election: in case of failure of all ISR, out-of-sync replica is elected as Leader, setting this to TRUE is not recommended at all, as it will lose the consistency of the system, this should be used only and only if we need the 100% availability irrespective of the consistency.
Note: Setting acks to 0 or 1 can lead to inconsistent partitions, in case of leader failure, the next ISR replica might not be aware of the recent message which will cause inconsistency in order of events in replicas.
For eg: ISR-1 (Leader): 1,2,3,4,5,6 while ISR-2 can be 1,2,3,4,6,5 in case of failure of ISR-1 just after getting the 5th event/message.
Conclusion: So, to achieve or tune your Kafka Cluster in order to get high availability without losing the data or consistency of the data we should always use a cluster of size least 3 nodes/brokers, along with a replication factor of 3 and ISR of 2 with acks to all.
- Configuring High Availability and Consistency for Apache Kafka
- Best practices around Kafka consistency and availability