In my previous blog, we started with what Kafka is, and what makes Kafka fast. If you haven’t read already, you should give it a read. We also talked briefly about Zookeeper. We know that Zookeeper keeps track of the status of the Kafka cluster nodes and it also keeps track of Kafka topics, partitions, etc. But what else?
In this blog, we will learn more about Zookeeper, what is it, and how it’s important to Apache Kafka. Let’s get started.
What is Zookeeper?
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.
Zookeeper is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.
Need? Simple! The goal is to make the systems easier to manage.
ZooKeeper allows developers to focus on the core application logic, and it implements various protocols on the cluster so that the applications need not implement them on their own. These services are used in some form or another by distributed applications.
Zookeeper at component Level:
Apache ZooKeeper works on the Client–Server architecture in which clients are machine nodes and servers are nodes.
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is organized similarly to a standard file system. The namespace consists of data registers – called znodes (similar to files and directories).
Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.
- High performance: it can be used in large, distributed systems.
- Highly available: keeps it from being a single point of failure.
- Strictly ordered access: sophisticated synchronization primitives can be implemented at the client.
Zookeeper in Kafka
Zookeeper is a top-level centralized service used to maintain configuration information, naming, providing flexible and robust synchronization within distributed systems. Zookeeper keeps track of the status of the Kafka cluster nodes, Kafka topics, partitions, etc.
Since maintaining coordination services is always difficult as they are most likely to fall in race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications of the responsibility of implementing coordination services from scratch. The service itself is distributed and highly reliable.
Let’s see how Zookeeper is helping Kafka:
- Kafka Brokers’ state & quotas: Zookeeper determines the state. That means, it notices, if the Kafka Broker is alive, always when it regularly sends heartbeats requests. Also, while the Broker is the constraint to handle replication, it must be able to follow replication needs. It also keeps track of how much data is each client allowed to read and write.
- Configuration Of Topics: The configuration regarding all the topics including the list of existing topics, the number of partitions for each topic, the location of all the replicas, list of configuration overrides for all topics, and which node is the preferred leader, etc.
- Access Control Lists: Access control lists or ACLs for all the topics are also maintained within Zookeeper.
- Cluster membership: Zookeeper also maintains a list of all the brokers that are functioning at any given moment and are a part of the cluster.
- Controller Election: The controller is one of the most important broking entities in a Kafka ecosystem, and it also has the responsibility to maintain the leader-follower relationship across all the partitions. If a node for some reason is shutting down, it’s the controller’s responsibility to tell all the replicas to act as partition leaders in order to fulfil the duties of the partition leaders on the node that is about to fail. So, whenever a node shuts down, a new controller can be elected and it can also be made sure that at any given time, there is only one controller and all the follower nodes have agreed on that.
- Consumer Offsets and Registry: ZooKeeper keeps all information about how many messages Kafka consumer consumes.
Consumers in Kafka also have their own registry as in the case of Kafka Brokers. However, the same rules apply to it, ie. as ephemeral zNode, it’s destroyed once the consumer goes down and the registration process is made automatically by the consumer.
Even though Zookeeper provides numerous benefits to Kafka, Kafka is planning to work independently.
For the latest version (2.4.1) ZooKeeper is still required for running Kafka, but in the near future, ZooKeeper dependency will be removed from Apache Kafka. See the high-level discussion in KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum.
These efforts will take a few Kafka releases and additional KIPs. Kafka Controllers will take over the tasks of current ZooKeeper tasks. The Controllers will leverage the benefits of the Event Log which is a core concept of Kafka.
Some benefits of the new Kafka architecture are a simpler architecture, ease of operations, and better scalability (e.g. allow “unlimited partitions”).