Introduction to Kafka

Table of contents

Reading Time: 5 minutes

Apache Kafka is a software platform that is based on a distributed streaming process. It is a publish-subscribe messaging system that lets exchanging of data between applications, servers, and processors as well. Apache Kafka was originally developed by LinkedIn, and later it was donated to the Apache Software Foundation. Apache Kafka has resolved the lethargic trouble of data communication between a sender and a receiver.

Architecture of Kafka

It consists of-

Topics

It is a common name or a heading given to represent a similar type of data. In Apache Kafka, there can be multiple topics in a cluster. Each topic specifies different types of messages.

Partitions

The data or message is divided into small subparts, known as partitions. Each partition carries data within it having an offset value. The data is written in a sequential manner. We can have an infinite number of partitions with infinite offset values. However, it is not guaranteed that to which partition the message will be written.

Replicas

Apache Kafka is a distributed software system in the Big Data world. Thus, for such a system, there is a requirement to have copies of the stored data. In Kafka, each broker contains some sort of data. But, what if the broker or the machine fails down? The data will be lost. Precautionary, Apache Kafka enables a feature of replication to secure data loss even when a broker fails down. To do so, a replication factor is created for the topics contained in any particular broker.

Producers

They are applications that write/publish data to the topics within a cluster using the producing APIs. Producers can write data on the topic level or on the specific partitions of the topic.

Consumers

They are applications that read/consume data from the topics within a cluster using the consuming APIs. Consumers can read data on the topic level or specific partitions of the topic.

Brokers

Brokers are simple software processes that maintain and manage published messages. The broker manages the consumer offsets and is responsible for the delivery of messages to the right consumers.

Zookeeper

A zookeeper is used to monitor the Kafka clusters and coordinate with each broker. It keeps all the metadata information related to the Kafka cluster in the form of a key-value pair.

How does Kafka works?

Kafka is a distributed system consisting of servers and clients that communicate through a high-performance TCP network Protocol and deployed on any virtual machines and cloud environments.

Servers:

Kafka runs as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka connect to continuously import and export data to stream events.

Clients:

They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.Clients are available for Java and Scala including the higher-level Kafka Streams library

Features of kafka

1. Scalable

Apache Kafka can scale data producers, data brokers, and data consumers. Whether you have a few data producers creating a large dataset sending to many groups of data consumers or the other way around — Kafka has you covered.

2. Extensibility

Kafka’s popularity over the last several years has encouraged many other applications to develop integrations with Kafka. This makes for the easy addition of new functionality, such as plugging into other applications.

3. Fault Tolerance

Kafka Streams builds on fault-tolerance capabilities integrated natively within Kafka. Kafka partitions are highly available and replicated, so when stream data is persisted to Kafka it is available even if the application fails and needs to be re-process. Tasks in Kafka Streams leverage the fault-tolerance capability offered by the Kafka consumer client to handle failures. If a task runs on a machine that fails, Kafka Streams automatically restarts the task in one of the remaining running instances of the application.

4. Reduces the need for multiple integrations

All the data that a producer writes go through Kafka. Therefore, we just need to create one integration with Kafka, which automatically integrates us with each producing and consuming system.

5. Distributed System

Apache Kafka contains a distributed architecture that makes it scalable. Partitioning and replication are the two capabilities under the distributed system.

7. Real-Time handling

Kafka monitors all data communication in near real-time to implement access control, detect anomalies, and provide secure communication. This architecture enables the integration with non-connected legacy systems to collect sensor data but also ensures that no external system gets access to the unsecured machines.

Uses of Kafka

1. Messaging

Kafka has better throughput, built-in partitioning, replication, and fault tolerance which makes it a good solution for large-scale message processing applications.

2. Website Tracking Activity

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.

3. Metrics

Kafka used for operation monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

4. Log Aggregation

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.

5. Stream Processing

Many users of Kafka process data in processing pipelines consisting of multiple stages and raw input data is consumed from Kafka topics and combined enriched or transformed into new topics.

6. Event Sourcing

Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

Apache Kafka Applications

In 2010, LinkedIn developed Apache Kafka. As Kafka is a publish-subscriber messaging system, thus various LinkedIn products such as LinkedIn Today and LinkedIn Newsfeed use it for message consumption.

Uber

Uber uses Kafka as a message bus to connect different parts of the ecosystem. Kafka helps both the passengers and drivers to meet their correct matches. It collects information from the rider’s app as well as from the driver’s app, then makes that information available to a variety of downstream consumers.

Twitter

Because Kafka has fulfilled the requirements of data replication and durability, Twitter has become one of the best applications/users of Apache Kafka. Adopting Kafka led Twitter to a vast resource-saving, up to 75%, i.e., a good cost reduction.

Netflix

Netflix uses Kafka under Keystone Pipeline. A Keystone is a unified collection, event publishing, and routing infrastructure used for stream and batch processing.

Conclusion

Kafka supports low latency message delivery and gives a guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec.