Author: Chitra

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

What is ETL (Extract, Transform, Load) in Big Data?

Reading Time: 3 minutes Overview  Nowadays by seeing the current market situation, data processing of structured and unstructured data becomes a very crucial part of an effective business. Business partners are investing more in data processing since the amount (volume) and variety of raw data increases very rapidly. Since the last decade, the ETL (Extract Transform Load) process has become fruitful to flow business processes smoothly. Data mining and Continue Reading

Spark Broadcast Variables Simplified With Example

Reading Time: 3 minutes Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog. What is a Broadcast Variable Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Imagine you want to make some information, Continue Reading

Introduction to Apache Kafka Terminology

Reading Time: 4 minutes What is Apache Kafka? Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. Kafka provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions. Why would you use Kafka? Apache Kafka is capable of handling millions of data or messages per second. Kafka Continue Reading

File Handling and Operations in Scala

Reading Time: 3 minutes Today we will look into Scala File IO operations. So let’s start with the basic definition. File Handling is a way to store the fetched information in a file. File operations mainly include reading data from files or writing data into files. Here we will look into Scala read file and Scala write file programs. Creating and Writing to the file Scala doesn’t provide file writing Continue Reading

Advanced Spark SQL Joins: an Optimization Technique

Reading Time: 4 minutes Welcome back to another imp topic of apache spark. Today will learn about one of the optimization techniques used in spark called Joins. Apache spark supports many types of joins, few come under the regular join types and others are some advanced join types. To know details about regular one please refer the link let’s start with what is optimization in Spark, and all the Continue Reading

Welcome to the world of Apache Spark

Reading Time: 5 minutes Welcome to another very important & interesting topic of big data Apache Spark. What is Apache Spark? Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. Why would you want to use Spark? Spark has some Continue Reading

Dynamic Partitioning in Apache Hive

Reading Time: 3 minutes Introduction We are back with another Important concept of big data is Dynamic partitioning in Hive. Before moving to the dynamic one we should know about static partitioning which I explained In the blog Static partitioning Now it’s time to deep dive into a dynamic one. How Dynamic Differ from Static Partitioing In this partition, columns values are only known at EXECUTION TIME User is Continue Reading

Overview of Static Partitioning in Apache Hive

Reading Time: 4 minutes What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally which also helps to increase query Continue Reading

Best Way of Optimization: Bucketing in Hive

Reading Time: 4 minutes Apache Hive is an open-source data warehouse system used to query and analyze large datasets. Data in Apache Hive can be categorized into the following three parts : Tables Partitions Buckets What is Bucketing in Hive? Bucketing in the hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be Continue Reading


Apache Nifi – The Ingestion tool

Reading Time: 3 minutes What is Apache NiFi ? Apache Nifi is an open source software for automating and managing the data flow between systems, which Leveraging the concept of Extract,Transform and Load. Apache Nifi a powerful as well as reliable system to process and distribute data. Additionally Apache Nifi has a web-based user interface for design, control, feedback, and monitoring of dataflows. History of Apache NiFi Based on Continue Reading