Kafka Streams: Data Enrichment with External lookup

spark streaming with kafka
Reading Time: 2 minutes

Kafka Streams is a Client library where the input and output data are stored in an Apache Kafka cluster. It combines the simplicity of building and deploying Java and Scala processing applications with Kafka topics on the client side with the benefits of Kafka’s server-side cluster technology.

When working with Kafka Streams, there are times when the stream processing application requires integration with data external to the stream. For instance, if a stream processing application is collecting the number of clicks by a user and needs some information related to the users who clicked.

Approach #1: External lookup

An obvious idea to this problem could be, for every click event produced in the Stream, look up the user in the external database and write an event that includes the original click as well as the user information to another topic.

Problem with this Approach

The external lookup will add a significant amount of latency to the processing of every input click event.

Also, the additional load it will place on the external database may not be acceptable since a normal stream-processing system can handle 100K-500K events per second, but a database can only handle 10K events per second. So, this solution may not be feasible

Approach #2: Cache in the Application

Another approach could be to cache the information from the database in our stream processing Application but there can be challenges to this approach to keep the cache data up to date.

Problem with this approach

This may result in a problem of information getting stale in the cache.

If we refresh events too often, we will still be hammering the database and the cache will not help much and If we wait too long to get the new events, we might end up doing stream processing with stale information.

Approach #3: Using Change Data Capture

We can capture all the changes that happen to the database table in a stream of events using any change data capture (CDC) jobs or Connectors provided with Kafka Connect. We can then have our stream-processing applications listen to this stream and update the cache based on database change events.

This will allow us to keep your own private copy of the table, and also get notified whenever there is a database change event so that we can update your own copy accordingly.


Written by 

Himani is a Software Consultant, having experience of more than 2.5 years. She is very dedicated, hardworking and focussed. She is Familiar with C#, C++, C , PHP, Scala and Java and has interest in Functional programming. She is very helpful and loves to share her knowledge. Her hobbies include reading books and cooking.