Let us discuss something about Kafka connector and some basic fundamental of it. Before start, we need to have basic knowledge of Kafka or we can go through this Document.
Apache Kafka is a distributed, resilient, fault tolerant platform. Apache Kafka is a well-known name in the world of Big Data. It is one of the most used distributed streaming platforms. Kafka is just not a messaging queue but a full-fledged event streaming platform.
It is a framework for storing, reading and analyzing streaming data. It is a publish-subscribe based durable messaging system exchanging data between processes, applications, and servers.Apache Kafka is a distributed, resilient, fault tolerant platform .
Table of content
- what is Kafka Connect
- Architecture of Kafka connect
- Connectors and tasks
- Sources ans sinks
- Standalone vs distributed Mode
What is Kafka Connect?
Apache Kafka is a distributed streaming platform and kafka Connect is framework for connecting kafka with external systems like databases, key-value stores, search indexes, and file systems, using so-called Connectors. Kafka Connect is only used to copy the streamed data, thus its scope is not broad.It executes as an independent process for testing and a distributed, scalable service support for an organization.
Kafka connect makes our task much easier to connect Kafka to the other systems, without having to write all the glue code yourself.
common Kafka Use Cases:
|Source ->Kafka||Producer API||Kafka Connect Source|
|Kafka <-> Kafka||Consumer API, Producer API||Kafka Streams|
|Kafka <-Sink||Consumer API||Kafka Connect Sink|
Architecture of kafka connect
- It is a separate Cluster.
- Each Worker contains one or many Connector Tasks.
- A cluster can have multiple workers and worker runs on the cluster only.
- Tasks are automatically load-balanced if there is any failure as shown in the picture below.
- Above all, tasks in Kafka Connect act as Producers or Consumers depending on the type of Connector.
- Kafka connect cluster has multiple loaded connectors
Connectors and Tasks
Connectors are responsible to manage the tasks that will run. They must decide how data will be splitted to tasks, and provide tasks with specific configuration to perform their job well.
Tasks are responsible to get things in and out of Kafka. They get their context from the worker. Once initialized, they are started with a
Properties object, containing connectors configuration. Once started, the tasks poll an external source and return a list of records (and the worker will send those data to a Kafka broker).
Sources and Sinks
Kafka Connects focused on streaming data to and from kafka, According to direction of the data moved, the connector is classified as:
Source connector – Ingests entire databases and streams table updates to Kafka topics. A source connector can also collect metrics from all your application servers and store these in Kafka topics, making the data available for stream processing with low latency.
Sink connector – Delivers data from Kafka topics into secondary indexes such as Elasticsearch, or batch systems such as Hadoop for offline analysis.
Tasks are executed by Kafka connect workers
- A worker is a single java process
- Workers run Connectors (each connector is class inside a
- A Worker can run in standalone mode or distributed mode
- If a worker crashes, a rebalance will occur (the heartbeat mechanism in the Kafka consumer’s Protocol is applied here)
- If a worker joins a Connect cluster, other workers will notice that and assign connectors or tasks to this new worker, in order to balance the cluster.To join a cluster, a worker must have the same
Standalone vs Distributed Mode
- Single Process run both connectors and tasks.
- Configuration use
- Very easy to get start with, useful for development and testing.
- Not fault tolerant, no scalability, hard to monitor
- Multiple workers run connectors and tasks
- Configuration is performed by a REST API
- easy to scale, and fault tolerant(rebalancing in case a worker dies)
- Useful for production deployment of connectors.
Kafka connect features include:
- Common Framework For Kafka Connectors – makes the connector deployment easy.
- REST Interface – we can manage connectors using a REST API
- Automatic Offset management -Kafka Connect helps us to handle the offset commit process, which saves us the trouble of implementing this error-prone part of connector development manually
- Distributed and Standalone Modes -Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.
- Distributed and Scalable by Default – It builds upon the existing group management protocol. And to scale up a Kafka Connect cluster we can add more workers.
- Streaming/Batch Integration – Kafka Connect is an ideal solution for bridging streaming and batch data systems in connection with Kafka’s existing capabilities
- Transformations- these allow us to make simple and lightweight modifications to individual messages
If You don’t want to use Kafka Connect to integrate Kafka with your other apps and databases. You can write your own code using the producer and Consumer API, or use the Stream API.
Or you could even use an integration framework that supports Kafka, like Apache Camel or Spring Integration.
In conclusion, In this blog, we have learned basics of Kafka Connector like features, use cases, Architecture etc. and in the next blog we will see how we can setup and Launch kafka connector.
If you want to know more about Apache Kafka, Streams and Connect, then I recommend these articles: