Kafka connect is an framework to connect kafka with external ecosystem like file systems, databases using kafka connector. The Kafka Connect cluster supports running and scaling out connectors.
Kafka Connectors are ready-to-use components, which can help us to import data from external systems into Kafka topics and export data from Kafka topics into external systems.
What is Kafka Connect?
Kafka connect is use to perform stream integration between kafka and other systems like databases, cloud services, search indexes, file systems, and key-value stores. It makes easy to stream data from different sources to kafka and from kafka to targets.
Various connectors are available for kafka connect like:
- RDBMS (Oracle, SQL Server, DB2, Postgres, MySQL)
- Cloud Object stores (Amazon S3, Azure Blob Storage, Google Cloud Storage)
- Message queues (ActiveMQ, IBM MQ, RabbitMQ)
- NoSQL and document stores (Elasticsearch, MongoDB, Cassandra)
- Cloud data warehouses (Snowflake, Google BigQuery, Amazon Redshift)
Features of Kafka Connect
- It simplifies the development, deployment, and management of connectors by connecting external systems with Kafka.
- It is distributed and standalone cluster that helps us to deploy large clusters, provides setups for development, testing.
- We can manage connectors using a REST API
- Kafka Connect helps us to handle the offset commit process
- By default kafka Connect is distributed and scalable.
Configuration of Kafka Connect
Let us discuss the principle of Kafka Connect, using file source connector and the file sink connector.Conveniently, Confluent Platform comes with both of these connectors, as well as reference configurations.
Source Connector Configuration:
For the source connector, the configuration is available at connect-file-source.properties:
name=local-file-source connector.class=FileStreamSource tasks.max=1 topic=connect-test file=test.txt
name is user as the specified name for the connector.
connector.class specifies the kind of connector.
tasks.max specifies how many instances of our source connector should run in parallel, and
topic defines the topic on which the connector send the output.
Sink Connector Configuration
For the sink connector, the configuration is available at connect-file-sink.properties:
name=local-file-sink connector.class=FileStreamSink tasks.max=1 file=test.sink.txt topics=connect-test
Finally, configure worker, which will integrate our two connectors and do the work of reading from the source connector and writing to the sink connector. For that we will use connect-standalone.properties:
bootstrap.servers=localhost:9092 key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter key.converter.schemas.enable=false value.converter.schemas.enable=false offset.storage.file.filename=/tmp/connect.offsets offset.flush.interval.ms=10000 plugin.path=/share/java
- bootstrap.servers contains the addresses of the Kafka brokers
- key.converter and value.converter are converter classes, which serialize and deserialize the data as it flows from the source into Kafka and then from Kafka to the sink
- key.converter.schemas.enable and value.converter.schemas.enable are converter-specific settings
- offset.storage.file.filename it defines where Connect should store its offset data
- offset.flush.interval.ms defines the interval at which the worker tries to commit offsets for tasks
How Kafka Connect Works
Kafka connect runs its individual process, separated from kafka brokers. It is distributed, scalable, standalone and fault tolerant. It does not require programming, because it is driven by JSON configuration, which makes it available for a wide range of users. Kafka Connect can also perform lightweight transformations on the data as it passes through it.
Kafka Connect Use Cases
When we want to stream data into Kafka from another system, or stream data from Kafka to elsewhere, Kafka Connect should be used. Few ways to use Kafka Connect are:
Streaming Data Pipelines
Kafka Connect can be used to ingest real-time streams of events from a source like database and stream it to a target system for analytics. Because Kafka stores data up to a configurable time interval per data entity (topic), it is possible to stream the same original data down to multiple targets. This could be to use different technologies for different business requirements or to make the same data available to different areas in a business that has their own systems in which to hold it.
Writing to Datastores from an Application
In the application, you may create data that you want to write to a target system. This could be a series of logging events to write to a document store or data to persist to a relational database. By writing the data to Kafka and using Kafka Connect to take responsibility for writing that data to the target, you simplify the footprint.
So, In this blog we touched on kafka connect, its features, configuration of kafka connect, use cases.We will learn more in further blogs.Thank you for reading.
For more, you can refer to: https://kafka.apache.org/documentation/
Also,For a more technical blog, you can refer to the knoldus blog: https://blog.knoldus.com/