spark

man in white shirt using macbook pro

Concept of UDF in Spark: User-Defined Function

Reading Time: 3 minutes As we all know, Spark contains a whole variety of inbuilt functions through which you can do any sort of transformation in your data frame and achieve your desired output, but sometimes you may find that you don’t require them. Then What? In that case, you can define your own function, known as UDFs (User Defined Functions) which makes it possible to write your own Continue Reading

woman using a computer

Zio Effects

Reading Time: 5 minutes What are Functional effects? Functional effects refer to the changes or modifications in the state of a program or system as a result of executing a specific function. These effects can include updating variables, creating or destroying objects, modifying data structures, or triggering external events such as sending messages or making HTTP requests. They are an important aspect of functional programming, as they allow developers Continue Reading

low angle photo of four high rise curtain wall buildings under white clouds and blue sky

Build Enterprise Data Lake with AWS Cloud

Reading Time: 4 minutes Data Lake A Data Lake is a place to store enterprise data in one common place. This data can be further accessed by data wranglers with analytical needs. However, a data lake is different from a normal database. As a data lake can store current and historical data for different systems in its raw form for analysis. And, a database stores current updated data for Continue Reading

Programmers working on computer program

Apache Spark Best Practices and Performance Tuning

Reading Time: 2 minutes We all know that Apache spark is a Big data processing engine that works on the model of in-memory computation. When we are dealing with extensive data even if we are able to reduce the use of even 1 MB of memory per minute it will result in thousands of dollars per month. Hence it becomes essential to learn the spark best practices and optimization Continue Reading

Deploy modes in Apache Spark

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Different Types of JOIN in Spark SQL

Reading Time: 3 minutes Join in Spark SQL is the functionality to join two or more datasets that are similar to the table join in SQL based databases. Spark works as the tabular form of datasets and data frames. The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios Continue Reading

The ecosystem of Apache Spark

Reading Time: 4 minutes Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing, and graph computations. It is an open-source distributed cluster-computing framework. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming. Apart from supporting all these workloads in a respective system. It reduces the management burden of Continue Reading

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

Spark Broadcast Variables Simplified With Example

Reading Time: 3 minutes Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog. What is a Broadcast Variable Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Imagine you want to make some information, Continue Reading

Apache Spark Streaming Checkpointing

Reading Time: 2 minutes Introduction The need of spark streaming application is that it should be running 24/7. Hence, it must be resilient to failures unrelated to application logic such as system failure, JVM crashes etc. The recovery should also be speedy in case of any loss of data. Spark streaming achieves this by the help of checkpointing. With the help of this, input DStreams can restore before failure Continue Reading