Data Streaming with AWS Kinesis

Reading Time: 4 minutes

Data is an essential asset for modern businesses as it helps them to monitor all aspects of the business. Every second we are processing, analysing and transforming a large amount of data. So the need for handling the dynamically generating data is important. As the number, variety, and velocity of data sources grow, new architectures and technologies are needed. This is where the need for data streaming services comes into play. One of these streaming services is Amazon Kinesis data streams.

What is Amazon Kinesis Data Streams?

Kinesis Data Streams is a  fully managed streaming data service by AWS. It makes it easy to collect and process large streams of data records in real time. It’s a highly scalable service which can stream gigabytes of data per second.

You can continuously add various types of data such as clickstreams, application logs, and social media to a Kinesis stream from hundreds of thousands of sources. Within seconds, the data will be available for your Kinesis Applications to read and process from the stream.

Kinesis Data Streams is a part of the AWS Kinesis streaming data platform, along with Kinesis Data FirehoseKinesis Video Streams, and Kinesis Data Analytics.

Kinesis Data Streams Architecture

The streaming data is collected by producer applications from various data sources and continually pushed to a Kinesis Data Stream. Similarly, the consumer applications read the data from the Kinesis Data Stream and process it in real-time as shown in the below diagram:

Diagram illustrating the high-level architecture of Kinesis Data Streams

Consumer applications are custom applications running on Amazon EC2 instances, EMR clusters, Lambda functions, or a Kinesis Data Firehose delivery stream. Once the processing is done by the consumer, then the useful data is moved to either of the AWS services, i.e., DynamoDB, S3, EMR, Redshift.

Key Concepts and Terminology

Data Producer

A Data Producer is an application that puts the data records into Amazon Kinesis Data Streams. For example, a web server sending log data to a stream is a producer.

Data Consumer

A Data Consumer is a distributed Kinesis application or AWS service that retrieves the data records from Amazon Kinesis Data Streams and process them.

Kinesis Data Stream

A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams.

Shard

  • shard is a uniquely identified sequence of data records in a stream.
  • A stream is composed of one or more shards, each of which provides a fixed unit of capacity. 
  • Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second
  • The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.

Data Record

A data record is the unit of data stored in a Kinesis data stream. Data records are composed of a sequence number, a partition key, and a data blob, which is an immutable sequence of bytes. A data blob can be up to 1 MB.

Sequence Number

Each data record has a sequence number which is a unique identifier for that record. The sequence number is assigned by the Kinesis Data Streams service when a producer application calls the putRecord() or putRecords() operation to add data to a Kinesis Data Stream.

Partition Key

A partition key is typically a meaningful identifier, such as a user ID or timestamp. It is specified by the data producer while putting data into a Kinesis data stream and is used to determine which shard a given data record belongs to. Consumers can use the partition key to replay or build a history associated with the partition key.

Capacity Mode

  • A data stream capacity mode determines how capacity is managed and how you are charged for the usage of your data stream. Currenly, in Kinesis Data Streams, we can choose between an on-demand mode and a provisioned mode for our data streams.
  • With the on-demand mode, Kinesis Data Streams automatically manages the shards in order to provide the necessary throughput. We are charged only for the actual throughput that we use and Kinesis Data Streams automatically accommodates our workloads’ throughput needs as they ramp up or down.
  • With the provisioned mode, we must specify the number of shards for the data stream. The total capacity of a data stream is the sum of the capacities of its shards. We can increase or decrease the number of shards in a data stream as needed and charged for the number of shards at an hourly rate.

Retention Period

The retention period is the length of time that data records are accessible after they are added to the stream. A stream’s retention period is set to a default of 24 hours after creation but it can be configured up to 365 days with additional charges.

That’s it for this blog. Hope you got the basic understanding of what Kinesis data streams are, how it works and its key concepts.

For more information related to Kinesis Data Streams, checkout the official documentation here.

To read more such interesting blogs, visit Knoldus blogs

References

AWS Kinesis Documentation

Written by 

Prateek Gupta is a Software Consultant at Knoldus Inc. He likes to explore new technologies and believes in writing clean code. He has a good understanding of programming languages like Java and Scala. He also has good time management skills. In his leisure time, he like to do singing and watch SciFi movies.