Introduction To Apache Kafka

Reading Time: 6 minutes

Introduction

Apache Kafka is a framework implementation of a software bus using stream-processing . It is an open source platform, developed by the Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. Apache Kafka uses binary TCP based protocol, that is optimized for efficiency. And relies on a “message set” abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This “leads to larger network packets, larger sequential disk operations, contiguous memory blocks .It allows Kafka to turn a bursty stream of random message writes into linear writes.

Kafka is very fast and guarantees zero downtime and zero data loss.

Need For Kafka

Kafka is a unified platform for handling all the real-time data streaming. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. Hence it has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket.

Kafka Architecture

ComponentsDescription
BrokerKafka cluster typically consists of multiple brokers to maintain load balance. Kafka brokers are stateless, so they use ZooKeeper for maintaining their cluster state.
ZookeeperZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly used to notify producer and consumer about the presence of any new broker in the Kafka system or failure of the broker in the Kafka system.
ProducerProducers push data to brokers. When the new broker is started, all the producers search it and automatically sends a message to that new broker.
ConsumerSince Apache kafka brokers are stateless, which means that the consumer has to maintain how many messages have been consumed by using partition offset. If the consumer acknowledges a particular message offset, it implies that the consumer has consumed all prior messages. The consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to consume.

Kafka stores key-value messages that come from arbitrarily many processes called producers. The data can be partitioned into different “part” within different “topics”. Additionally Other processes called “consumers” can read messages from partitions. Hence for stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka.

Kafka runs on a cluster of one or more servers (called brokers). Here the partitions of all topics are distributed across cluster nodes. Partitions are replicated to multiple brokers. Hence his architecture allows Kafka to deliver huge streams of messages in a fault-tolerant fashion. Moreover this allows it to replace some of the conventional messaging systems like Java Message Service , Advance Message Queuing Protocol etc . 

Kafka Configuration Types

By using the property file the Kafka makes its configuration. It can be supplied either from a file or from the program . It is either taken from a default file or else also can be self programmed.

Some configurations have both a default global setting as well as Topic-level overrides. The Topic level properties have the format of the CSV ( e.g., “xyz.per.Topic=Topic1:value1,Topic2:value2” ). Hence in certain Topics the default quantity is written again.

  • Broker config’s
  • Producer config’s
  • Consumer Configs

1. Broker config’s

The important configurations are the following:

  • broker. id
  • log. dirs
  • zookeeper. connect
broker.id It is a list of  a lot of directories arranged properly separated by commas and each partition is placed in the directory having the less number of partitions.
log.dirsIt stores the log and used when it is 0.
zookeeper.connectZK leader and follower distance.

2. Consumer Configs

The essential consumer configurations are the following:

  • group.id
  • zookeeper.connect
group.idAll the processes belonging to similar consumer process is connected to a singular group and given an identity. If the id is set, it means that all the processes belong to the same group.
zookeeper.connectIt is the Zookeeper connector using both the hosts and ports of different Zookeepers. This connection also includes a chroot path  which keeps the data under another path. The same chroot path should be used by the consumer as well as the producer. When you have to use a chroot path, the string will be as  hostname1:port1,hostname2:port2 /chroot/path.

The things the producer configuration takes care of includes compression, synchronous and asynchronous configuration and also batching sizes. And the consumer configuration takes care of the fetching sizes.

3. Producer Configs

Essential configuration properties for the producer are the following:

  • metadata.broker.list
  • request.required.acks
  • serializer.class
metadata.broker.listIt defines where the Producer can find a one or more Brokers to determine the Leader for each topic.
request.required.acksIt tells Kafka that you want your Producer to require an acknowledgement from the Broker that the message was received. 
serializer.classIt defines what Serializer to use when preparing the message for transmission to the Broker.

Kafka APIs

1. Producer API

The Producer API allows applications to send streams of data to topics in the Kafka cluster.

In order to use the producer, you can use the following maven dependency:

<dependency>
	<groupId>org.apache.kafka</groupId>
	<artifactId>kafka-clients</artifactId>
	<version>3.0.0</version>
</dependency>

The central part of the Producer API is Producer class. Producer class provides an option to connect Kafka broker in its constructor by the following methods. The producer class provides send method to send messages to either single or multiple topics .

There are two types of producers – Sync and Async.

The same API configuration applies to Sync producer as well. The difference between them is a sync producer sends messages directly, but sends messages in background. Async producer is preferred when you want a higher throughput.

2. Consumer API

The Consumer API allows applications to read streams of data from topics in the Kafka cluster.

In order to use the consumer, you can use the following maven dependency:

<dependency>
	<groupId>org.apache.kafka</groupId>
	<artifactId>kafka-clients</artifactId>
	<version>3.0.0</version>
</dependency>

3. Streams API

In order to act as a stream processor consuming an input stream from one or more topics and producing an output stream to one or more output topics and also effectively transforming the input streams to output streams, this Kafka Streams API gives permission to an application

In order to use the Kafka Streams, you can use the following maven dependency:

<dependency>
	<groupId>org.apache.kafka</groupId>
	<artifactId>kafka-streams</artifactId>
	<version>3.0.0</version>
</dependency>

When using Scala you may optionally include the kafka-streams-scala library.

In order to use the Kafka Streams DSL for Scala for Scala 2.13. You can use the following maven dependency:

<dependency>
	<groupId>org.apache.kafka</groupId>
	<artifactId>kafka-streams-scala_2.13</artifactId>
	<version>3.0.0</version>
</dependency>

4. Connect API

The Connect API allows implementing connectors, that continually pull from some source data system into Kafka or push from Kafka into some sink data system.

Many users of Connect will not need to use this API directly. Moreover, they can use pre-built connectors without needing to write any code.

5. Admin API

The Admin API supports managing and inspecting topics, brokers, acls, and other Kafka objects.

To use the Admin API, add the following Maven dependency:

<dependency>
	<groupId>org.apache.kafka</groupId>
	<artifactId>kafka-clients</artifactId>
	<version>3.0.0</version>
</dependency>

Operations on Kafka

Step 1: Download the code

Download the 0.8.1.1 release and un-tar it.

> tar -xzf kafka_2.9.2-0.8.1.1.tgz
> cd kafka_2.9.2-0.8.1.1

Step 2: Start the server

Kafka uses ZooKeeper, so you need to first start a ZooKeeper server if you don’t already have one. You can use the convenience script packaged with kafka to get a quick-and-dirty single-node ZooKeeper instance.

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2021-12-16 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

Now start the Kafka server:

> bin/kafka-server-start.sh config/server.properties
[2021-12-16 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2021-12-16 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
...

Step 3: Create a topic

Let’s create a topic named “test” together with a single partition and only one replica:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

We can now see that topic if we run the list topic command:

> bin/kafka-topics.sh --list --zookeeper localhost:2181
test

Step 4: Send some messages

Kafka comes with a command line client that will take input from a file or from standard input . After that it send it out as messages to the Kafka cluster.By default each line will be sent as a separate message.

Now run the producer, after this type a few messages into the console to send to the server.

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 
This is a message
This is another message

Step 5: Start a consumer

Kafka also has a command line consumer that will dump out messages to standard output.

> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
This is a message
This is another message

Benefits of using Kafka

  • Kafka is highly scalable.
  • I is highly durable.
  • Kafka is Highly Reliable. 
  • Kafka Offers High Performance. 

Conclusion

So, we have covered concepts of Apache Kafka with implemented code of suitable example .

You can refer to the documentation :- https://kafka.apache.org/documentation/ .

Written by 

KRISHNA JAISWAL is Software Consultant Trainee at Knoldus. He is passionate about JAVA , MYSQL , having knowledge of C , C++ and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies. His hobbies include reading Books , listening Music and playing Cricket .