Spark

Apache Spark’s Developers Friendly Structured APIs: Dataframe and Datasets

Reading Time: 3 minutes This is the second part of the blog series on Spark‘s structured APIs Dataframe & Datasets. In the first part we covered Dataframe and I recommend you go read that blog first if you are new to spark. In this blog we’ll cover the Spark Datasets API, so let’s get started. The Datasets API Datasets are also the combination of two characteristics: typed and untyped Continue Reading

Dataframe and Datasets: Apache Spark’s Developers Friendly Structured APIs

Reading Time: 4 minutes This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets. Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping. And the second one of order and structure your data in a Continue Reading

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

RDD to DataFrame Conversion in Spark

Reading Time: 2 minutes Overview In this tutorial, we’ll learn how to convert an RDD to a DataFrame in Spark. We’ll look into the details by calling each method with different parameters. Along the way, we’ll see some interesting examples that’ll help us understand concepts better. RDD and DataFrame in Spark RDD and DataFrame are two major APIs in Spark for holding and processing data. It provides us with low-level APIs for processing distributed data. Continue Reading

Welcome to the world of Apache Spark

Reading Time: 5 minutes Welcome to another very important & interesting topic of big data Apache Spark. What is Apache Spark? Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. Why would you want to use Spark? Spark has some Continue Reading

Abstraction in Java

Reading Time: 2 minutes What is Abstraction? Abstraction is one of the most important feature of OOPs (Object Oriented Programming). There are four essential features in OOPs i.e., encapsulation, inheritance, abstraction and polymorphism. Here we will discuss about one of the four features i.e., Abstraction. It is the process of showing necessary details to the user and hiding all the details of implementation. In other words, we can say Continue Reading

Getting Started with Spark 3

Reading Time: 4 minutes Introduction to Apache Spark Big Data processing frameworks like Apache Spark provides an interface for programming data clusters using fault tolerance and data parallelism. Apache Spark is broadly used for the speedy processing of large datasets. Apache Spark is an open-source platform, built by a broad group of software developers from 200 plus companies. Over 1000 plus developers have contributed since 2009 to Apache Spark.  Continue Reading

Comparing Data Streaming Frameworks | Scala

Reading Time: 4 minutes In this Era of Technology, where the amount of data is growing exponentially and every bit of data holds value. Even, according to some reports, the number of bytes being generated and stored till now in the world has already exceeded the star counts in the sky. As every bit is useful so, it is very important to store them without losing any bit. When Continue Reading

Scala vs Python for Apache Spark: An In-depth Comparison

Reading Time: 5 minutes Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: which one to choose, scala or python. So let’s start with “scala vs python for spark”.  You may wonder if this is a tricky question. What does the enterprise demand say? Is this like asking iOS or Android? Is there a right or wrong answer? So Continue Reading

Install/Configure Hadoop HDFS,YARN Cluster and integrate Spark with it

Reading Time: 5 minutes In our current scenario, we have 4 Node cluster where one is master node (HDFS Name node and YARN resource manager) and other three are slave nodes (HDFS data node and YARN Node manager) In this cluster, we have implemented Kerberos, which makes this cluster more secure. Kerberos services are already running in the different server which would be treated as KDC server. In all Continue Reading

Spark SQL in Delta Lake 0.7.0

Reading Time: 3 minutes Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. Continue Reading

Stateful Streaming in Spark

Reading Time: 4 minutes Apache Spark is a fast and general-purpose cluster computing system. In Spark, we can do the batch processing and stream processing as well. It does near real-time processing. It means that it processes the data in micro-batches. I have discussed more Spark Streaming in my previous blog. Now in this blog, I’ll discuss Stateful Streaming in Spark. So let’s start !! What is Stateful Streaming? Continue Reading