Apache Kafka for beginners

Reading Time: 4 minutes

Introduction

One of the biggest challenges associated with big data is, analyzing the data. But before we get to that part, the data has to be first collected, and also for a system to process impeccably it should be able to grasp and make the data available to users. This is where Apache Kafka comes in handy.

Let’s briefly understand how Kafka came into existence? It’s developed by a team from LinkedIn in 2011, to solve the low-latency ingestion of large amounts of event data from their website and to handle real-time event processing systems. Later, it was donated to the Apache Software Foundation.

Apache Kafka has the ability to handle trillions of events occurring in a day. Kafka was initially developed for a messaging queue. A message queuing system helps in transferring data between applications so that the applications can just concentrate on the data rather than on how the data can be transferred and shared, You can also check here

Architecture of Apache Kafka

Kafka is usually integrated with Apache Storm, Apache HBase, and Apache Spark in order to process real-time streaming data. It is capable of delivering massive message streams to the Hadoop cluster regardless of the industry or use case. Its process flow can be better understood if we take a close look into its ecosystem.

It’s an deployed as a cluster implemented on one or more servers. The cluster is capable of storing topics that consist of streams of ‘records’ or ‘messages’. Every message holds details like a key and a value. Brokers are abstractions used to manage the persistence and replication of the message.

Basically, it has four core APIs:

  • Producer API: This API permits applications to publish a stream of records to one or more topics.
  • Consumer API: Consumer API lets applications to subscribe to one or more topics and process the produced stream of records.
  • Streams API: This API takes the input from one or more topics and produce the output to one or more topics by converting the input streams to the output ones.
  • Connector API: This API is responsible for producing and executing reusable producers and consumers who are able to link topics to the existing applications.

Things We Can Test in a Kafka Applications

Put simply, we can assert whether we have successfully produced (written) a particular record or stream of records to a topic. And we can also assert the consumed (fetched) record or stream of records from one or more topic(s).

Also, we can dive in further and assert at granular levels, for instance :

  1. Whether we have produced the record to a particular partition.
  2. The type of record we are able to produce or consume.
  3. The number of records written to a topic or fetched from a topic.
  4. The offset of the record.
  5. Sending and receiving AVRO or JSON records and asserting the outcome.
  6. Assert the DLQs (Dead Letter Queue) record(s) and the record-metadata.
  7. Schema Registry for AVRO and validate records.
  8. And KSQL of querying streaming data in a SQL fashion and validate the result.

Kafka Testing Challenges

The difficult part is some part of the application logic or a DB procedure keeps producing records to a topic and another part of the application keeps consuming the records and continuously processes them based on business rules.

The records, partitions, offsets, exception scenarios, etc. keep on changing, making it difficult to think in terms of what to test, when to test, and how to test.

Testing Solution Approach

We can go for an end-to-end testing approach which will validate both producing, consuming, and DLQ records as well as the application processing logic. This will give us good confidence in releasing our application to higher environments.

We can do this by bringing up Kafka in dockerized containers or by pointing our tests to any integrated test environment somewhere in our Kubernetes-Kafka cluster or any other micro-services infrastructure.

Here we pick a functionality, produce the desired record and validate, consume the intended record and validate, alongside the HTTP REST or SOAP API validation which helps in keeping our tests much cleaner and less noisy.

To keep the tutorial concise, we will demonstrate only the below aspects.

1.Producer Testing

2.Consumer Testing

Producer Testing

When we produce a record to a topic we can verify the acknowledgment from a Kafka broker. This verification is in the format of ‘ recordMetadata ‘ .

For example, visualizing the “recordMetaData” as JSON would look like:

Consumer Testing

When we read or consume from a topic we can verify the record(s) fetched from the topics. Here we can validate/assert some of the metadata too, but most of the time you might need to deal with the records only (not the metadata).

There may be times, for instance, that we validate only the number of records, i.e. the size only, not the actual records

For example, visualizing the fetched “records” as JSON would look like:

The full record(s) with the meta-data information looks like what we’ve got below, which we can also validate/assert if we have a test requirement to do so.

Conclusion

In this short tutorial, we learned some fundamental concepts of Kafka and the minimum things we need to know before proceeding with Kafka Testing. Also, we learned what all things we can cover in our testing.

References

https://kafka.apache.org/28/documentation/streams/architecture

https://kafka.apache.org/intro

https://kafka.apache.org/documentation/#api

message passing mpsc