Knoldus Blogs

Stateful stream processing with Apache Flink(part 1): An introduction

Reading Time: 4 minutes Apache Flink, a 4th generation Big Data processing framework provides robust stateful stream processing capabilities. So, in a few parts of the blogs, we will learn what is Stateful stream processing. And how we can use Flink to write a stateful streaming application. What is stateful stream processing? In general, stateful stream processing is an application design pattern for processing an unbounded stream of events. Continue Reading

A Quick Demo: Kafka to Flink to Cassandra

December 19, 2020December 21, 2020Apache Flink, Apache Kafka, Big Data and Fast Data, Cassandra, Database, Flink, Functional Programming, NoSql, Studio-Scala#apache flink, Apache Kafka, Cassandra, data analysis, DataStream API, Flink, Flink Streaming, pipeline, Streaming, streaming analytics, streaming data

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Flink with Kafka and Cassandra to build a simple streaming data pipeline. Apache Flink is a framework and distributed processing engine. it is used for stateful computations over unbounded and bounded data streams.Kafka is a scalable, high performance, low latency platform. It allows reading and writing streams of data like a messaging system.Cassandra: A distributed and wide-column Continue Reading

Flink: Join two Data Streams

December 3, 2020December 3, 2020Apache Flink, Big Data and Fast Data, Flink, Java#apache flink, Big, Big Data Analytics, fast data analytics, Flink, Flink Streaming, joins, Streaming, streaming analytics

Reading Time: 3 minutes Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window. Let say we have one stream which contains salary information of all Continue Reading

Flink: Union operator on Multiple Streams

September 15, 2020September 16, 2020Apache Flink, Big Data and Fast Data, Flink, Java, Streaming, Streaming Solutions#apache flink, Big, Big Data Analytics, fast data analytics, Flink, Flink Streaming, Streaming, streaming analytics

Reading Time: 3 minutes Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Union operator in Flink that can combine two or more data streams together. We know in real-time we can have multiple data streams from different sources Continue Reading

Flink: Implementing the Session window.

September 14, 2020September 14, 2020Apache Flink, Big Data and Fast Data, Flink, Java, Streaming, Streaming Solutions#apache flink, Apache Flink Cluster, Big Data, Big Data Analytics, Flink, Flink Streaming, Streaming

Reading Time: 3 minutes In the previous blogs, we learned about Tumbling, Sliding, and Count windows in Flink. There is one another useful way to window the data which Flink offers i.e, Session window. So in this blog, we will explore the Session window in detail with an example. In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online Continue Reading

Flink: Implementing the Count Window

September 10, 2020September 10, 2020Apache Flink, Big Data and Fast Data, Flink, Functional Programming, Java#apache flink, Big Data, Big Data Analytics, DataStream API, Flink Streaming, Window Functions

Reading Time: 3 minutes In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold. Count window set the window size based on how many entities exist within that window. For example, if we fixed the count Continue Reading

Flink: Time Windows based on Processing Time

September 9, 2020October 22, 2020Apache Flink, Big Data and Fast Data, Flink, Java, ML, AI and Data Engineering#apache flink, #apache-flink, #flink, DataStream API, Flink, Flink Streaming, real time streaming data, Stream Processing, Window Functions

Reading Time: 4 minutes In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windows, Sliding windows, Session windows, Count windows, and Continue Reading

Basic Anatomy of a Flink Program

September 2, 2020October 22, 2020Apache Flink, Big Data and Fast Data, Flink, Java#apache flink, #flink, Big Data, Big Data Analytics, fast data, Flink, Stream Processing, Streaming

Reading Time: 3 minutes Hi Folks! Hope you all are safe in the COVID-19 pandemic and learning new tools and tech while staying at home. I also have just started learning a very prominent Big Data framework for stream processing which is Flink. Flink is a distributed framework and based on the streaming first principle, means it is a real streaming processing engine and implements batch processing as a special case. In Continue Reading

Windows operator: Heart of processing infinite streams in Flink

August 24, 2020August 24, 2020Apache Flink, Big Data and Fast Data, Flink, Streaming, Streaming Solutions, Studio-Scala#apache flink, Big Data, Big Data Analytics, Flink, Flink Streaming, Stream Processing, Streaming

Reading Time: 3 minutes Apache Flink is an open-source, distributed, Big Data framework for stream and batch data processing. Flink is based on the streaming first principle which means it is a real streaming processing engine and implements batching as a special case. Flink is considered to have a heart and it is the “Windows” operator. It makes Flink capable of processing infinite streams quickly and efficiently. Windows split Continue Reading

Reading Avro files using Apache Flink

June 24, 2020June 24, 2020Apache Flink, Flink, Streaming, Streaming Solutions, Studio-Scala#apache-flink, #avro files, apache, avro, Flink, programming, scala

Reading Time: 2 minutes In this blog, we will see how to read the Avro files using Flink. Before reading the files, let’s get an overview of Flink. There are two types of processing – batch and real-time. Batch Processing: Processing based on the data collected over time. Real-time Processing: Processing based on immediate data for an instant result. Real-time processing is in demand and Apache Flink is the Continue Reading

Using Apache Flink for Kinesis to Kafka Connect

June 24, 2020June 24, 2020Apache Flink, Flink, Studio-Scala#apache-flink, #kinesis, apache, Flink Streaming, kafka, scala

Reading Time: 3 minutes In this blog, we are going to use kinesis as a source and kafka as a consumer. Let’s get started. Step 1: Apache Flink provides the kinesis and kafka connector dependencies. Let’s add them in our build.sbt: Step 2: The next step is to create a pointer to the environment on which this program runs. Step 3: Setting parallelism of x here will cause all Continue Reading

Comparison between different streaming engines

June 24, 2020October 12, 2020akka-streams, Apache Flink, Apache Kafka, Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaAkka, akka-streams, Flink, Flink Streaming, kafka, Kafka Streams, Spark, Spark Streaming

Reading Time: 5 minutes Distributed stream processing engines have been on the rise in the last few years, first Hadoop became popular as a batch processing engine, then focus shifted towards stream processing engines. Stream processing engines can make the job of processing data that comes in via a stream easier than ever before and by using clustering can enable processing data in larger sets in a timely manner. Continue Reading

Flink on Kubernetes

April 29, 2020April 29, 2020Apache Flink, Flink, Studio-DevOps

Reading Time: 3 minutes Introduction Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. The design of Flink is such as to run in all common cluster environments, perform computations at in-memory speed and at any scale. There are two Flink’s clusters: Flink session cluster and Flink job cluster. A job cluster is a dedicated cluster that runs a single job. The job is part of Continue Reading