Big Data Analytics

Apache Beam: Ways to join PCollections

Reading Time: 4 minutes Joining multiple sets of data into a singular entity is very often when working with data pipelines. In this blog, We will cover how we can perform Join operations between datasets in Apache Beam. There are different ways to Join PCollections in Apache beam – Extension-based joins Group-by-key-based joins Join using side input Let’s understand the above different way’s to perform Join with examples. We Continue Reading

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

Google BigQuery: An Introduction to Big Data Analytics Platform.

Reading Time: 6 minutes Hey Folks, Today we going to discuss Google BigQuery, an enterprise data warehouse with built-in machine learning capabilities. Before going to BigQuery, let’s understand what is Google Cloud Platform?Google Cloud Platform is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for compute, storage and application development that run on Google hardware. Google Cloud protects your data, applications, Continue Reading

Big Data Analytics: An Introduction

Reading Time: 5 minutes DATA ANALYTICS Data can help businesses better understand their customers and improve their advertising campaigns. It can also help personalise their content, and improve their bottom lines. The advantages of data are many, but you can’t access these benefits without the proper data analytics tools and processes. While raw data has a lot of potentials, you need data analytics to unlock the power to grow Continue Reading

Stateful stream processing with Apache Flink(part 1): An introduction

Reading Time: 4 minutes Apache Flink, a 4th generation Big Data processing framework provides robust stateful stream processing capabilities. So, in a few parts of the blogs, we will learn what is Stateful stream processing. And how we can use Flink to write a stateful streaming application. What is stateful stream processing? In general, stateful stream processing is an application design pattern for processing an unbounded stream of events. Continue Reading

Flink: Join two Data Streams

Reading Time: 3 minutes Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window. Let say we have one stream which contains salary information of all Continue Reading

Flink: Union operator on Multiple Streams

Reading Time: 3 minutes Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Union operator in Flink that can combine two or more data streams together. We know in real-time we can have multiple data streams from different sources Continue Reading

Flink: Implementing the Session window.

Reading Time: 3 minutes In the previous blogs, we learned about Tumbling, Sliding, and Count windows in Flink. There is one another useful way to window the data which Flink offers i.e, Session window. So in this blog, we will explore the Session window in detail with an example. In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online Continue Reading

Flink: Implementing the Count Window

Reading Time: 3 minutes In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold. Count window set the window size based on how many entities exist within that window. For example, if we fixed the count Continue Reading

Basic Anatomy of a Flink Program

Reading Time: 3 minutes Hi Folks! Hope you all are safe in the COVID-19 pandemic and learning new tools and tech while staying at home. I also have just started learning a very prominent Big Data framework for stream processing which is  Flink. Flink is a distributed framework and based on the streaming first principle, means it is a real streaming processing engine and implements batch processing as a special case. In Continue Reading

Windows operator: Heart of processing infinite streams in Flink

Reading Time: 3 minutes Apache Flink is an open-source, distributed, Big Data framework for stream and batch data processing. Flink is based on the streaming first principle which means it is a real streaming processing engine and implements batching as a special case. Flink is considered to have a heart and it is the “Windows” operator. It makes Flink capable of processing infinite streams quickly and efficiently. Windows split Continue Reading

Spark: Streaming Datasets

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

Delta Lake: Schema Enforcement & Evolution

Reading Time: 4 minutes Nowadays data is constantly evolving and changing. As well as the business problems and requirements are evolving, the shape or the structure of the data is also changing. When that happens, we want to be in control of how the data or schema changes. But how we can achieve this? Delta Lake has good ways to control how schema changes. With Delta Lake, users have Continue Reading