spark

Deploy modes in Apache Spark

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Different Types of JOIN in Spark SQL

Reading Time: 3 minutes Join in Spark SQL is the functionality to join two or more datasets that are similar to the table join in SQL based databases. Spark works as the tabular form of datasets and data frames. The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios Continue Reading

The ecosystem of Apache Spark

Reading Time: 4 minutes Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing, and graph computations. It is an open-source distributed cluster-computing framework. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming. Apart from supporting all these workloads in a respective system. It reduces the management burden of Continue Reading

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

Spark Broadcast Variables Simplified With Example

Reading Time: 3 minutes Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog. What is a Broadcast Variable Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Imagine you want to make some information, Continue Reading

Apache Spark Streaming Checkpointing

Reading Time: 2 minutes Introduction The need of spark streaming application is that it should be running 24/7. Hence, it must be resilient to failures unrelated to application logic such as system failure, JVM crashes etc. The recovery should also be speedy in case of any loss of data. Spark streaming achieves this by the help of checkpointing. With the help of this, input DStreams can restore before failure Continue Reading