Apache Kafka Connect – Basic Introduction

Reading Time: 3 minutes

We use Apache Kafka Connect for streaming data between Apache Kafka and other systems, scalably as well as reliably. Moreover, connect makes it very simple to quickly define Kafka connectors that move large collections of data into and out of Kafka.

Kafka Connect collects metrics or takes the entire database from application servers into Kafka Topic. It can make available data with low latency for Stream processing.

Kafka Connect Features

  • A common framework for Kafka connectors – It standardizes the integration of other data systems with Kafka. Also, simplifies connector development, deployment, and management.
  • Distributed and standalone modes – Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.
  • REST interface – By an easy-to-use REST API, we can submit and manage connectors to our Kafka Connect cluster.
  • Automatic offset management – However, Kafka Connect can manage the offset commit process automatically even with just a little information from connectors. Hence, connector developers do not need to worry about this error-prone part of connector development.
  • Distributed and scalable by default – It builds upon the existing group management protocol. And to scale up a Kafka Connect cluster we can add more workers.
  • Streaming/batch integration – We can say for bridging streaming and batch data systems, Kafka Connect is an ideal solution.

Why Kafka Connect?

  • Auto-recovery After Failure – it can resume where it failed.
  • Auto-failover – if suppose one node fails the work that it is doing is redistributed to other nodes.
  • Simple Parallelism – A connector can define data import or export tasks, especially which execute in parallel.

Kafka Connect Concepts

  • An operating-system process (Java-based) that executes connectors and their associated tasks in child threads, is what we call a Kafka Connect worker.
  • Also, an object that defines parameters for one or more tasks that should actually do the work of importing or exporting data is what we call a connector.
  • To read from some arbitrary input and write to Kafka, a source connector generates tasks.
  • In order to read from Kafka and write to some arbitrary output, a sink connector generates tasks.

Dependencies of Kafka Connect

Kafka Connect nodes require a connection to a Kafka message-broker cluster, whether run in stand-alone or distributed mode.

Basically, there are no other dependencies, for distributed mode. Even when the connector configuration settings are stored in a Kafka message topic, Kafka Connect nodes are completely stateless. Due to this, Kafka Connect nodes, become very suitable for running via technology.

Although to store the “current location” and the connector configuration, we need a small amount of local disk storage, for standalone mode.

Distributed Mode

By using a Kafka Broker address, we can start a Kafka Connect worker instance (i.e. a java process), the names of several Kafka topics for “internal use” and a “group-id” parameter. 

By the “internal use” Kafka topics, each worker instance coordinates with other worker instances belonging to the same group-id. Here, everything is done via the Kafka message broker, no other external coordination mechanism is needed (no Zookeeper, etc).

The workers negotiate between themselves (via the topics) on how to distribute the set of connectors and tasks across the available set of workers.

Standalone Mode

We can say, it is simply distributed mode, where a worker instance uses no internal topics within the Kafka message broker. This process runs all specified connectors, and their generated tasks, themselves (as threads).

Because standalone mode stores current source offsets in a local file, it does not use Kafka Connect “internal topics” for storage. As a command-line option, information about the connectors to execute is provided, in standalone mode.

REST API

Basically, each worker instance starts an embedded web server. So, through that, it exposes a REST API for status queries and configuration.

Moreover, configuration uploaded via this REST API is saved in internal Kafka message broker topics, for workers in distributed mode. However, the configuration REST APIs are not relevant, for workers in standalone mode.

Kafka Connector Types

  • JDBC
  • HDFS
  • S3
  • Elasticsearch

References :

Kafka Connect Official Documentation: https://docs.confluent.io/platform/current/connect/index.html#:~:text=Kafka%20Connect%20is%20a%20free,Kafka%20Connect%20for%20Confluent%20Platform.

More Knoldus Blogs :

Knoldus Blogs : https://blog.knoldus.com/

Written by 

Prakhar is a Software Consultant at Knoldus . He has completed his Masters of Computer Applications from Bharati Vidyapeeth Institute of Computer Applications and Management, Paschim Vihar . He likes problem solving and exploring new technologies .

Leave a Reply