Hello! In this article we are going to talk about the Kafka Connect. This page includes why, what and features of Kafka connect.
What is Kafka Connect?
Kafka Connect, is the pluggable and declarative data integration framework for Kafka. It connects data source/destination to Kafka, letting the rest of the ecosystem do what it is expected from it. It is declarative and makes integration between Kafka and other system easy using fewer configuration properties. We can quickly define Kafka connectors that move large collections of data into and out of Kafka. Being from the family of Apache Kafka, it is inherently fault tolerant and scalable.
Kafka Connect can ingest entire databases or aggregate metrics from all of our application servers into Kafka topics, providing for low-latency stream processing. Data from Kafka topics can also be exported to secondary storage and query systems, as well as batch systems for offline analysis. However, Kafka Connect is not a viable choice for large-scale data transformation. Despite this, the most current versions of Kafka Connect allow the configuration parameters for a connector to define simple data transformations.
This function assumes that tasks change their input into AVRO or JSON format before posting the record to a Kafka topic for “source” connectors. Similarly, it considers that information on the input Kafka topic is already in AVRO or JSON format.
Why we need Kafka Connect?
We have Apache Kafka which is a distributed, scalable and fault tolerant streaming platform. It has numerous use cases including distributed logging, stream processing, data integration, and pub/sub messaging. It provides its own API’s to produce and consume data to/from the Kafka.
As in most of the scenario we may have our own system (databases etc) containing data which we want to put into or fetch from Kafka. WE can write our own piece of code to integrate these non-Kafka systems utilising the Producer and Consumer API provided by Apache Kafka.
The problem is that if we are going to do this , then we will need to address the scenarios of failure, retries, logging, serialization, data formats and scaling. After overcoming all these challenges the client can come to we with another non-Kafka system to integrate. Writing data integration code for each one of those systems would have we writing the same boilerplate and unextracted framework code over and over again.
So, to handle these challenges Kafka Connect came to rescue developers and let them focus on the business logic only.
- As discussed above, Kafka connect let developer focus on writing data to source or reading from sink only. It standardizes the integration of other data systems with Kafka.
- We are provided with Rest API to manage the connectors easily.We can submit and manage connectors to our Kafka Connect cluster.
- We have the option to deploy in distributed or standalone mode. Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.
- Kafka Connect comes with offset commit management out of the box. Even with only a little information from connectors, Kafka Connect can manage the offset commit process automatically. As a result, connection designers don’t have to be concerned about this error-prone aspect of connector development.
- It provides option to transform messages while they travel through the data pipeline, that means we can perform data operation while the data is in pipeline and written to or read from the Kafka.
- Since it is part of Kafka lineage It extends the current group management protocol. We can also add more workers to a Kafka Connect cluster to scale it up.
- Connectors are responsible for the interaction between Kafka Connect and the external technology being integrated with. We can further divide connectors in two categories viz. “Source Connectors” and “Sink Connectors”. We have ready to use Connectors available at Confluent Hub or we can write our own Connectors using the Connectors API
- Converters are responsible for the serialization and deserialization of data with in Kafka connect.The Converter interface provides support for translating between Kafka Connect’s runtime data format and byte. Internally, this likely includes an intermediate step to the format used by the serialization layer.
- Transformations can be optionally applied to the data passing through Kafka connect to manipulate data as per our requirement. SMT modify events before they are stored in Kafka, masking sensitive information, adding identifiers, tagging events, removing unnecessary columns, and more. SMT also modify events after they leave Kafka, routing high priority events to a faster datastore, casting data types to match destination, and more.
Kafka Connect Cluster
A Kafka Connect Cluster has one (standalone) or more (distributed) nodes running on one or multiple servers, distributing work among the available processes.
In standalone mode, the Kafka Connect worker uses file storage for its state. Connectors are created from local configuration files. Meaning that we cannot scale for throughput nor have fault-tolerant behavior. his means that standalone mode is appropriate if we have a connector that needs to execute with server locality, for example, reading from files on a particular machine or ingesting data sent to a network port at a fixed address.
Kafka Connect uses Kafka topics to store state in distributed mode, such as configuration and connector status. Compacted topics are those that are set up to keep this information for an indefinite period of time. The Kafka Connect REST API is used to create and maintain connectors. This means that we can then add additional workers easily, as they can read everything that they need from Kafka. When we add workers from a Kafka Connect cluster, the tasks are rebalanced across the available workers to distribute the workload. If we decide to scale down our cluster, Kafka Connect will rebalance again to ensure that all the connector tasks are still executed.
We’ve seen how Kafka Connect can integrate with other systems with in built in feature of fault tolerance and scalability. We’ll move on the topic to understand the architecture and internals of Kafka Connect. Thank you!