Kafka Streams is a Client library where the input and output data are stored in an Apache Kafka cluster. It combines the simplicity of building and deploying Java and Scala processing applications with Kafka topics on the client side with the benefits of Kafka’s server-side cluster technology.
When working with Kafka Streams, there are times when the stream processing application requires integration with data external to the stream. For instance, if a stream processing application is collecting the number of clicks by a user and needs some information related to the users who clicked.
Approach #1: External lookup
An obvious idea to this problem could be, for every click event produced in the Stream, look up the user in the external database and write an event that includes the original click as well as the user information to another topic.
Problem with this Approach
The external lookup will add a significant amount of latency to the processing of every input click event.
Also, the additional load it will place on the external database may not be acceptable since a normal stream-processing system can handle 100K-500K events per second, but a database can only handle 10K events per second. So, this solution may not be feasible
Approach #2: Cache in the Application
Another approach could be to cache the information from the database in our stream processing Application but there can be challenges to this approach to keep the cache data up to date.
Problem with this approach
This may result in a problem of information getting stale in the cache.
If we refresh events too often, we will still be hammering the database and the cache will not help much and If we wait too long to get the new events, we might end up doing stream processing with stale information.
Approach #3: Using Change Data Capture
We can capture all the changes that happen to the database table in a stream of events using any change data capture (CDC) jobs or Connectors provided with Kafka Connect. We can then have our stream-processing applications listen to this stream and update the cache based on database change events.
This will allow us to keep your own private copy of the table, and also get notified whenever there is a database change event so that we can update your own copy accordingly.
- Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale
- Putting Apache Kafka to Use: A practical approach to get kick-started with Apache Kafka and build huge real-time data streaming pipelines
- Blog on Joins in Kafka Streams