Author: kundankumarr

Flink: Union operator on Multiple Streams

Reading Time: 3 minutes Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Union operator in Flink that can combine two or more data streams together. We know in real-time we can have multiple data streams from different sources Continue Reading

Flink: Implementing the Session window.

Reading Time: 3 minutes In the previous blogs, we learned about Tumbling, Sliding, and Count windows in Flink. There is one another useful way to window the data which Flink offers i.e, Session window. So in this blog, we will explore the Session window in detail with an example. In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online Continue Reading

Flink: Implementing the Count Window

Reading Time: 3 minutes In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold. Count window set the window size based on how many entities exist within that window. For example, if we fixed the count Continue Reading

Flink: Time Windows based on Processing Time

Reading Time: 4 minutes In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windows, Sliding windows, Session windows, Count windows, and Continue Reading

Spark SQL in Delta Lake 0.7.0

Reading Time: 3 minutes Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. Continue Reading

Basic Anatomy of a Flink Program

Reading Time: 3 minutes Hi Folks! Hope you all are safe in the COVID-19 pandemic and learning new tools and tech while staying at home. I also have just started learning a very prominent Big Data framework for stream processing which is  Flink. Flink is a distributed framework and based on the streaming first principle, means it is a real streaming processing engine and implements batch processing as a special case. In Continue Reading

Windows operator: Heart of processing infinite streams in Flink

Reading Time: 3 minutes Apache Flink is an open-source, distributed, Big Data framework for stream and batch data processing. Flink is based on the streaming first principle which means it is a real streaming processing engine and implements batching as a special case. Flink is considered to have a heart and it is the “Windows” operator. It makes Flink capable of processing infinite streams quickly and efficiently. Windows split Continue Reading

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline. Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data Continue Reading

Spark: Streaming Datasets

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

Delta Lake: Schema Enforcement & Evolution

Reading Time: 4 minutes Nowadays data is constantly evolving and changing. As well as the business problems and requirements are evolving, the shape or the structure of the data is also changing. When that happens, we want to be in control of how the data or schema changes. But how we can achieve this? Delta Lake has good ways to control how schema changes. With Delta Lake, users have Continue Reading

fetching data from different sources using Spark 2.1

Spark: createDataFrame() vs toDF()

Reading Time: 2 minutes There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them. toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access Continue Reading

Cluster vs Client: Execution modes for a Spark application

Reading Time: 3 minutes Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will Continue Reading

fetching data from different sources using Spark 2.1

Spark: Type Safety in Dataset vs DataFrame

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading