As we all know, most of the systems uses Kafka for distributed and real time processing of large scale of messages. Before starting on this topic, i assume that you all are familiar with basic concepts of Kafka such as brokers, partitions, topics, producer and consumer. Here we are discussing about Log Compaction.
What is Log Compaction
Kafka log compaction is hybrid approach that makes sure that you have an effective recovery strategy. Along with this, it also manages your data log up to a threshold limit. The official documentation of Kafka says that :
Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.
To understand it in simple words, Kafka removes the older message when it receives the new message with the same key in partition log. It makes sure that you always have at least last known value message for each key. It is also very useful for restoring a state after system failure or crash.
Kafka log Compaction Structure
In this, the log has head and tail. The head of compacted log is same as traditional Kafka log. New records are added to the end of head. All log compaction works on tail of log. Compaction is done only on tail side. When rewritten after compaction cleanup, records at tail always keep their original offset.
Log Cleaner is responsible for handling log compaction. Log Cleaner is a pool of background threads that recopy log segment files and removing records whose key appears in the head of the log. The actual process of log compaction is defined in the below image
- min.compaction.lag.ms :- It is the minimum guaranteed time that must pass before a message can be compacted, once it arrives in log. It acts as a lower bound of how long message remains in head.
- delete.retention.ms :- It is the maximum time for which a record that is marked for deletion remains in the topic. Any consumer that is lagging by time greater than delete.retention.ms will end up missing the delete markers.
- min.cleanable.dirty.ratio :- The minimum percentage of the partition log that must be “dirty” before Kafka attempts to compact messages.
- max.compaction.lag.ms :-It is max delay between the time when a message is written and when it becomes eligible for compaction. This overwrites the min.cleanable.dirty.ration. It compacts the log segment even if dirty ratio is lower than threshold value.
Creating a log compacted topic
By using the below Kafka console command, we can easily create a Kafka compacted topic.
./kafka-topics.sh --create --zookeeper localhost:2181 --topic log-compacted-topic --replication-factor 2 --partitions 2 --config "cleanup.policy=compact,delete" --config "delete.retention.ms=1000" --config "segment.ms=1000" --config "min.cleanable.dirty.ratio=0.05"
We can say that Log Compaction is a special type of policy and it can meet the needs of some cases. It is great for caching in which you want to keep the latest value of each key in real time. Assume that you want to build your cache during startup. By reading the log compacted topic you can build your cache in much faster time than using a SQL database.