What is Apache Kafka?
Apache Kafka is a well-known name in the world of Big Data. It is one of the most used distributed streaming platforms. Kafka is just not a messaging queue but a full-fledged event streaming platform. It is a framework for storing, reading and analyzing streaming data. It is a publish-subscribe based durable messaging system exchanging data between processes, applications, and servers.
Let’s understand the essentials of Kafka in detail:
A topic is a named stream of records. It is a category/feed name to which records are stored and published. Producer applications write data to topics and consumer applications read from topics. Records published to the cluster stay in the cluster until a configurable retention period has passed by.
“Kafka retains records in the log, making the consumers responsible for tracking the position in the log, known as the ‘offset’. Typically, a consumer advances the offset in a linear manner as messages are read. However, the position is actually controlled by the consumer, which can consume messages in any order. For example, a consumer can reset to an older offset when reprocessing records.”
Kafka topics are divided into a number of partitions, which contain records in an unchangeable sequence. Partitions allow topics to be parallelized by splitting the data into a particular topic across multiple brokers.
Each record in a partition is assigned and identified by its unique offset. These offsets increment sequentially per message and do not get reset when messages are expired (may be out of retention period).
Offset in a partition of a topic does not necessarily represent the number of current messages that partition has.
A producer is a Kafka client application that is the source of the data stream. It helps to generate tokens or messages and further publish them to one or more topics in the Kafka cluster. The Producer API from Kafka helps to pack the message as either value or key-value pair.
A consumer is an application that reads data from Kafka Topics. It subscribes to one or more topics in the Kafka cluster then further feeds on tokens or messages from the Kafka Topics.
- Consumer group:
You group consumers into a consumer group by use-case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-memory microservices while another consumer group is streaming those same records to Hadoop. Consumer groups have names to identify them from other consumer groups. The main purpose of consumer groups is to distribute incoming data among multiple consumers. Also, the consumers attached to one consumer group wouldn’t pick the same data hence it also helps in stopping redundant consumption of a record.
A Broker is a Kafka server that runs in a Kafka Cluster. It is the part of the Kafka ensemble where the data actually resides. Kafka broker receives the data published by Kafka-producers and saves it on the disk. Multiple Kafka Brokers form a cluster. All the partitions from all the topics are distributed among the Kafka Brokers in a cluster. Broker sometimes refers to more of a logical system or as Kafka as a whole.
Zookeeper is a distributed, open-source configuration, synchronization service along with the naming registry for distributed applications. Zookeeper keeps track of the status of the Kafka cluster nodes and it also keeps track of Kafka topics, partitions, etc.
What makes Kafka fast?
Kafka supports a high-throughput, highly distributed, fault-tolerant platform with low-latency delivery of messages. Following is the approach that Kafka follows to provide the aforementioned characteristics:
- Use of File System
“Disks are slow”. The general perception what people have and hence makes them sceptical about Kafka. But disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.
Kafka rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, inverts that. Kafka makes use of Sequential I/O. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel’s pagecache.
- Use of Queue
BTree is the most persistent Data Structure that is being used by messaging systems. They operate in O(log N) which is a constant time. But this O(log N) will not be considered constant in case of disk operations.
Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often super linear as data increases with fixed cache–i.e. doubling your data makes things much worse than twice as slow.
So, Kafka uses a persistent queue as an underlying data structure. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This makes performance completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives.
- Batching of data
Kafka eliminates the inefficiency of system of type: too many small I/O operations. This problem happens both between the client and the server and in the server’s own persistent operations.
Kafka deals with this problem by making use of message sets. This groups the messages together and send over the network and thereby amortize the overhead of the network roundtrip rather than sending a single message at a time.
Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.
- Compression of Batches
The next hurdle in any data pipeline is Network Bandwidth. Usually in any data pipeline, a user can/want to send the data over a wide area network. Even though, user can compress the data on its own without any help from Kafka but that might lead to poor compression ratios as much of the redundancy is due to repetition between messages of the same type (e.g. field names in JSON or user agents in web logs or common string values).
Efficient compression requires compressing multiple messages together rather than compressing each message individually.
Kafka facilitates this feature by allowing Recursive message sets.
A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
Kafka supports GZIP, Snappy and LZ4 compression protocols.
- Zero Copy
Another inefficiency: Byte Copying.
The common data path for transfer of data from file to socket:
– The operating system reads data from the disk into pagecache in kernel space
– The application reads the data from kernel space into a user-space buffer
– The application writes the data back into kernel space into a socket buffer
– The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
There are four copies and two system calls. INEFFICIENCY ALERT!!
If we keep data in the same format as it will be sent over the network, then we can directly copy data from pagecache to NIC buffer. This can be done through an OS sendfile system call.
The data is copied into pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to user-space every time it is read. This allows messages to be consumed at a rate that approaches the limit of the network connection.
That was just the introduction of Kafka and why Kafka is fast. We will dig deeper further in my next blogs. Stay Tuned! 🙂