Reading Time: 4 minutes This blog pertains to Apache SPARK, where we will understand how Spark’s Driver and Executors communicate with each other to process a given job. So let’s get started. First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is Continue Reading
Reading Time: 4 minutes In this blog, we’ll be learning about Spark, its Architecture and its components, the working of Spark Architecture, etc. What is Spark? Spark Architecture, an open-source, framework-based component that processes a large amount of unstructured, semi-structured, and structured data for analytics, is utilized in Apache Spark. Apart from Hadoop and map-reduce architectures for big data processing, Apache Spark’s architecture is regarded as an alternative. The Continue Reading
Reading Time: 2 minutes We all know that Apache Spark is a super-fast cluster computing technology that is designed for fast computation on large volumes of big data. Spark’s ability to do this is dependent on its design philosophy which centers around four key characteristics: Speed Ease of use Modularity Extensibility So let us dive into this design philosophy and see how Apache Spark operates Sparks Components & Architecture Continue Reading
Reading Time: 6 minutes This blog pertains to Apache SPARK and YARN (Yet Another Resource Negotiator), where we will understand how Spark runs on YARN with HDFS. So let’s get started. First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept Continue Reading
Reading Time: 3 minutes Once you have downloaded the spark and are ready with the SparkShell and executed some shortcode examples. After that, to understand what’s happening behind your sample code you should be familiar with some of the critical concepts of the Spark application. Some important terminology used are: ApplicationA user program built on Spark using its APIs. It consists of a driver program and executors on the Continue Reading
Reading Time: 3 minutes Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog. What is a Broadcast Variable Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Imagine you want to make some information, Continue Reading
Reading Time: 2 minutes Introduction The need of spark streaming application is that it should be running 24/7. Hence, it must be resilient to failures unrelated to application logic such as system failure, JVM crashes etc. The recovery should also be speedy in case of any loss of data. Spark streaming achieves this by the help of checkpointing. With the help of this, input DStreams can restore before failure Continue Reading
Reading Time: 5 minutes Welcome to another very important & interesting topic of big data Apache Spark. What is Apache Spark? Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. Why would you want to use Spark? Spark has some Continue Reading
Reading Time: 4 minutes What is RDD in Spark? RDD stands for “Resilient Distributed Dataset”. RDD in Apache Spark is a Data structure, and also an immutable collection of objects computes on the different nodes of the cluster. Resilient, i.e. fault-tolerant, the data is present into multiple executable nodes so that in case of failure of any node it can get backup from another executable nodes. Distributed, since Data Continue Reading
Reading Time: 4 minutes Welcome back to another imp topic of apache spark. Today will learn about one of the optimization techniques used in spark called Joins. Apache spark supports many types of joins, few come under the regular join types and others are some advanced join types. To know details about regular one please refer the link let’s start with what is optimization in Spark, and all the Continue Reading
Reading Time: 3 minutes The goal of this blog is to define the processes to make the log4j configuration file configurable for debugging purpose.
Reading Time: 4 minutes Spark provides two shared variables in distributed computing which are accessible to all the nodes in a spark cluster – broadcast variables & Accumulators.
Reading Time: 3 minutes Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will Continue Reading