By now you must be familiar with KSQL and how to get started with it. If not, check out the Part1 KSQL: Getting started with Streaming SQL for Apache Kafka of this series.
In this blog, we’ll move one step forward to get an understanding of the Dual streaming model to see what abstractions does KSQL use to process the data.
All the data that we are working on with KSQL is produced to Kafka topics by some client. This client can be any Application, Kafka connectors etc., which produces continuous never-ending data to the topics.
KSQL does not directly interact with these topics, it rather introduces a couple of abstractions in between to process the data, which are known as Streams and Tables.
What are Streams and Tables?
Both Streams and Tables are wrappers on top of Kafka topics, which has continuous never-ending data. But, conceptually these abstractions are different because-
Streams represent data in motion capturing events happening in the world, and has the following features-
Storing a never-ending continuous flow of data and thus Streams are unbounded as they have no limit.
Any new data that comes in gets appended to the current stream and does not modify any of the existing records, making it completely immutable.
While A table represents the data at rest or a materialized view of that stream of events with the latest value of a key and has the following features-
Represents a snapshot of the stream at a time, and therefore it has its limits defined.
Any new data(<Key, Value> pair) that comes in gets added to the current table if the table does not have an existing entry with the same key otherwise, the existing record is mutated to have the latest value for that key.
Note: Every record in the Kafka topic is represented as a pair of Key and Value. So, a table will always show the latest value of the given key.
Let’s understand this with a real-world scenario.
Transactions and Account Balances
Consider a Banking system that wants to keep track of the transfer of money using transactions and account balances. Let’s understand how streams and tables deal differently with it.
Suppose, the initial balance for two users Alice and Bob are 200$ and 100$ respectively. And, following are the set of transactions that happen between these two users:
Transaction 1: Alice gives 100$ to Bob
Transaction 2: Bob gives 50$ to Alice
Transaction 3: Bob gives 100$ to Alice
Here, A stream represents an immutable sequence of transaction events recording money transferred from one account to another which will be very similar to the underlying topic data that will be recording those events.
While on the other hand, A table reflects the latest state of the account per user, which evolves after each transaction that happens for this user. The table stores the current balance, for instance.
Thus, we conclude that a table will have the current state of the account and the stream will capture the record of the transactions.
The duality of Streams and Tables
The stream-table duality describes the close relationship between streams and tables.
Stream as Table
A stream can be considered a changelog of a table, as the aggregation of a stream of updates over time yields a table.
Table as Stream
A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream’s data records are key-value pairs). The observation of changes to a table over time yields a stream.
In this blog, we discussed the different abstractions Streams and Tables used by KSQL, the distinction between them, followed by finding a relationship between these two using The Duality of Streams and Tables and Concluded that the physical order of the data is represented by a Table, while a logical order of the data is represented by a Stream.
In our next blog, we’ll see how to create Stream and Tables in KSQL and get our hands on to process some real-time data.
You can now learn more about Apache Kafka with the help of our book, Putting Apache Kafka to Use