Kafka connect is not just a free, open source component of Apache Kafka. But it also works as a centralised data hub for simple data integration between databases, key-value stores etc. The fundamental components include-
- Dead letter Queue
Moreover it is a framework to stream data in and out of Apache Kafka. In addition, the confluent platform comes with many built-in connectors,used for streaming data to and from different data sources.
Connectors are basically used to coordinate and manage the copying of data between not just Kafka but also other systems.It defines where data should be copied to and from.The creation of connector instance takes place which is responsible for managing the copying of data between Kafka and another system.Moreover the classes that are implemented or used by connector are defined in a connector plugin.
Tasks are responsible for the main implementation of copying of data. Moreover they are main actor in data model for connect.The connector instance are the one that coordinates with the set of tasks that in return are copying data. Also with very little configuration Kafka Connect provides built-in support for parallelism and scalable data copying.
Since we know that connectors and task are the logical unit of work. Therefore they should be scheduled to execute in a process known as workers.Types of workers –
In short standalone workers are those workers who are responsible for executing all connectors and tasks.As it is a single process the requirement of configuration is minimal. However it is a single process only the functionality and scalability is limited.
Contrary to the above one, distributed mode provided scalability and automatic fault tolerance for Kafka Connect. In this we can add or remove nodes as per our requirement.Firstly you start the worker process using same group.id . Then it automatically coordinate to schedule execution of connectors and tasks across all available workers.
They are used by tasks to change the format of data from bytes to a Connect internal data format and vice versa.Moreover they are responsible for serializing and de-serializing the data.
- When Kafka connect as a source the converter perform the serialization on data received from connector. Later push the serialized data into the Kafka cluster.
- When Kafka connect as a sink the converter perform the deserialization on data read from Kafka cluster. And send the data to connector.
The main purpose of transform is to alter the data to make it simple and lightweight. It is convenient for minor data adjustments and event routing. It accepts one record as an input and outputs a modified record.
Dead Letter Queue
There are multiple reasons due to which an invalid record may occur.The main errors that occur are serialization and deserialization (serde) errors. For example, an error occurs when a record arrives at the sink connector in JSON format, but the sink connector configuration is expecting another format, like Avro. When serde errors occurs, the connector does not stop. Rather, the connector continues to process records and sends the errors to a Dead Letter Queue . You can use the record headers in a Dead Letter Queue topic record to identify and correct an error when it occurs.
The Purpose of Kafka Connect
It consists of two things –
- Source Connector
- Sink Connector
Source Connector’s purpose is to pull data from data source and publish it to the Kafka Cluster.Therefore to achieve this Source connector internally uses Kafka Producer API.
Sink Connector’s purpose is to consume data from the Kafka Cluster and sync it to the target data source. However this is achieved by internally using Kafka Consumer API.
Use Cases of Kafka Connect
- Streaming pipelines from source to target system.
- Writing data from application to data stores from Kafka.
- Processing data from legacy application to new systems from Kafka.
Architecture of Kafka Connect
- Sources- databases, JDBC, MongoDB, Redis, Solr etc., whose data we want to copy to the Kafka cluster
- In between source data and Kafka cluster there is a Kafka Connect cluster, which is made of multiple Kafka Connect workers where connectors & tasks are running. The tasks are pulling data from the sources and push them safely to the Kafka cluster.
- We can also send our data from our Kafka cluster, to any sink- Amazon S3, Cassandra, Redis, MongoDB, HDFS, etc. The tasks will pull data from the Kafka cluster and write them to the sinks.
Features of Kafka Connect
- Common Framework: It The Kafka Connect allows to integrate other systems with Kafka. Therefore works as a common framework for the connectors.Which makes the connector deployment, management as well as development simple.
- Can work in standalone or distributed modes: It can either scale up to provide centrally managed service support to an organization or scale down for testing, developing, and deploying small productions.
- REST interface: REST API submits as well as manages Kafka connectors to the Kafka Connect.
- Manages offset automatically:It automatically manage the commit process by getting little information from the Connectors.
- Distributed as well as scalable: By default, it is scalable and distributed. Therefore, the number of workers can be extended for scaling up the Kafka Connect cluster.
- Streaming or batch integration: Kafka Connect provides the solution to bridge the streaming and batch systems.