In this blog, we will be going to deep dive in with Kafka Connect and know about some basic fundamentals of it.
Apache Kafka is one of the central technologies in Undertaking Models these days. More than 60% of Fortune 500 companies are utilizing Kafka. The innovation has advanced apart over the final decade i.e. from Pub/Sub to a Total Occasion Streaming Stage. Therefore, The exceptionally begins necessity to work in an occasion-driven framework is to ingest data/events. For the same reason, Kafka Interface was involved in the Kafka Biological system in 2015. This empowers us to coordinate outside frameworks without composing a single line of code (in case connector as of now exists).
What is Kafka Connect?
Kafka Connect is a framework for connecting Kafka with external systems, including databases. Above all, a Kafka Connect cluster is a separate cluster from the Kafka cluster. In addition, the Kafka Connect cluster supports running and scaling out connectors (components that support reading and/or writing between external systems). It builds as another layer on the core platform of Apache Kafka®, for supporting large-scale streaming data. It works in the two modes i.e.
- Import from any external system (called Source) like MySQL, hdfs, etc to Kafka broker cluster
- Export from Kafka cluster to any external system (called Sink) like hdfs, s3, etc
Why Kafka Connect?
- In an ETL pipeline, it takes care of the E(Extract) and L(Load) parts irrespective of the Processing engine.
- Kafka Connectors are ready-to-use components. Which can help us to import data from external systems into Kafka topics and export data from Kafka topics into external systems.
- By using it we can reuse a piece of code for different processing engines.
- It reduces the effort in implementing any Stream processing framework by taking up how-to-ingest-data responsibility.
- It ensures Decoupling and Reusability.
Architecture of Kafka Connect
The above diagram shows the architectural structure diagram. So, let us discuss it:
- It is a separate Cluster.
- Each Worker contains one or many Connector Tasks.
- A cluster can have multiple workers and worker runs on the cluster only.
- Tasks are automatically load-balanced if there is any failure as shown in the picture below.
- Above all, tasks in Kafka Connect act as Producers or Consumers depending on the type of Connector.
Modes of operation
Basically, there are two modes:
- Standalone Mode: In Standalone mode, a single process executes all connectors and their associated tasks. Also, It is easy to test Kafka Connect in standalone mode. There is no automated fault tolerance out of the box when a connector goes offline.
- Distributed Mode: Distributed mode runs Connect workers on one or multiple nodes. Since, while running on multiple nodes, the coordination mechanics to work in parallel do not require an orchestration manager such as YARN.
Installing the Connectors
- Download the JAR file (usually from Confluent Hub but perhaps built manually yourself from elsewhere)
- Place it in a folder on your Kafka Connect worker
- Locate your Kafka Connect worker’s configuration (.properties) file, and open it in an editor
- Search for the plugin.path setting, and amend or create it to include the folder(s) in which your connectors reside
In conclusion, in this blog, we have learned about the Kafka connect / connectors i.e also discuss the architectural structure and saw the installation of connectors. I will cover more topics in the further blogs. Happy learning 🙂
For more, you can refer to the kotlin documentation: https://kafka.apache.org/documentation/
For a more technical blog, you can refer to the knoldus blog: https://blog.knoldus.com/