How to delete record from Kafka Topic : Tombstone

Reading Time: 4 minutes

Hello Reader,
Here we will see how can we delete records from Kafka’s topic(compacted topic as well as the non-compacted topic).

Problem :

GDPR: General Data Protection Regulation is a regulation that requires businesses to protect the personal data and privacy of EU citizens for transactions that occur within EU member states.

CCPA: The California Consumer Privacy Act is a state-wide data privacy law that regulates how businesses all over the world are allowed to handle the personal information (PI) of California residents.

According to the above regulation, consumers can ask businesses for data deletion whenever they fit.

Glimpse of Kafka

Kafka : Event streaming

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.

Applications connect to this system and transfer a record onto the topic(think of a topic as the application that can add, process, and reprocess records). Record has four attributes, key and value are mandatory, and the other attributes, timestamp, and headers are optional. Another application may connect to the system and process or re-process records from a topic. The retention period defines the time till then data is stored into the topic.

These are four main parts in a Kafka system:

  • Broker: Handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.
  • Zookeeper: Keeps the state of the cluster (brokers, topics, users).
  • Producer: Sends records to a broker.
  • Consumer: Consumes batches of records from the broker.

What is Log Compaction ?

log compaction is a strategy to remove records where we have the most recent update with the same primary key. Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. A message with a key and a null payload will be deleted from the log. In simple terms, Apache Kafka will keep the latest version of a record and delete the older versions with the same key.

Log Compaction : How log are stored ?

Kafka log compaction allows consumers to regain their state from the compacted topics. Here, it will never re-order the messages but will delete a few with an updated one. Also, the partition offset for a message will never change. Each topic log is divided into two series areas based on the offsets and insertion, these areas are called head and tail, every time a new record is inserted it gets appended at the end of the head and compaction happens at the tail.

How log compaction works ?

The compaction is done in the background by periodically recopying log segments. Cleaning does not block the reads and can be throttled to use within a configurable amount of I/O throughput to avoid the impact on producers and consumers. His strategy not only deletes the duplicate records but also removes keys with the null values. These records are also known as Tombstone records.

Log compaction : Data cleaning

If you don’t have Kafka running into local you can use docker-compose file to run the Kafka into your system. The actual process of compacting a log segment looks something like this:

$ kafka-topics.sh --create --zookeeper zookeeper:2181 --topic latest-product-price-3 --replication-factor 1 --partitions 1 --config cleanup.policy=compact --config delete.retention.ms=100  --config segment.ms=1 --config min.cleanable.dirty.ratio=0.00
Created topic latest-product-price-3.
$ kafka-console-producer.sh --broker-list localhost:9092 --topic  latest-product-price-3 --property parse.key=true --property key.separator=: 
>123:mukesh
>111:Yadav
>123:Itachi
>007:Naruto
$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic latest-product-price-3  --property  print.key=true --property key.separator=: --from-beginning
111:Yadav
123:Itachi
007:Naruto

Each message in Apache Kafka consists of a value, offset, timestamp, key, message size, compression codec, checksum, and version of the message format. Kafka creates this offset map to efficiently transfer the data with zero-copy. Whenever we have a duplicate message record in the head, Kafka uses the newest offset. Now, the log-cleaner thread checks every record in the tail log and if there is another record with the same key in the tail offset and its offset is different from the current map then it is removed.

Second door to data deletion

As Kafka topic store all consumer data in the event.
The topic will remove the data due to the default retention period. ( 7 days)

Another option is to delete Topic for Kafka which eventually deletes all data. here is the command :

kafka-topics.sh --delete --bootstrap-server localhost:9092 --topic dummy.topic

Note:Today, Data is money so we are not fit to delete all data from the topic. However, we need to delete only particular client data is required.

Conclusion

Apache Kafka is an event sourcing platform that helps in writing, reading, and storing the messages as event. Generally, we don’t delete the data from Kafka’s topic i.e banking system is a transaction is once done then we cannot revert back.
That’s pretty much it from the article, If you have any feedback or queries, please do let me know in the comments. Also, if you liked the article, please give me a thumbs up and I will keep writing blogs like this for you in the future as well. Keep reading and Keep learning.

References

https://kafka.apache.org/
Introduction-to-topic-log-compaction-in-apache-kafka