Search Results for: spark

MapReduce vs Spark

Reading Time: 3 minutes What is MapReduce in big data: MapReduce is a programming model for processing large data sets in parallel across a cluster of computers. It is a key technology for handling big data. The model consists of two key functions: Map and Reduce. Map takes a set of data and converts it into another set of data. There individual elements are broken down into tuples (key/value Continue Reading

Introduction to Spark Architecture

Reading Time: 4 minutes In this blog, we’ll be learning about Spark, its Architecture and its components, the working of Spark Architecture, etc. What is Spark? Spark Architecture, an open-source, framework-based component that processes a large amount of unstructured, semi-structured, and structured data for analytics, is utilized in Apache Spark. Apart from Hadoop and map-reduce architectures for big data processing, Apache Spark’s architecture is regarded as an alternative. The Continue Reading

Joins in Spark SQL with examples

Reading Time: 4 minutes Spark SQL Spark SQL is a module in Apache Spark. It allows users to process structured data using a SQL-like syntax. It integrates seamlessly with the Spark ecosystem, including Spark Streaming and MLlib. One of the main benefits of using Spark SQL is that it permits to users to integrate SQL queries with the programming language of their choice, such as Scala, Python, or Java. Continue Reading

Transformation with Examples: Spark RDDs

Reading Time: 3 minutes Transformation is one of the RDD operation in spark before moving this first discuss about what actual Spark and RDD is. What is Spark? Apache Spark is an open-source cluster computing framework. Its main objective is to manage the data created in real time. Hadoop MapReduce was the foundation upon which Spark was developed. Unlike competing methods like Hadoop’s MapReduce, which writes and reads data Continue Reading

Concept of UDF in Spark: User-Defined Function

Reading Time: 3 minutes As we all know, Spark contains a whole variety of inbuilt functions through which you can do any sort of transformation in your data frame and achieve your desired output, but sometimes you may find that you don’t require them. Then What? In that case, you can define your own function, known as UDFs (User Defined Functions) which makes it possible to write your own Continue Reading

Deploy modes in Apache Spark

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Apache Spark Best Practices and Performance Tuning

Reading Time: 2 minutes We all know that Apache spark is a Big data processing engine that works on the model of in-memory computation. When we are dealing with extensive data even if we are able to reduce the use of even 1 MB of memory per minute it will result in thousands of dollars per month. Hence it becomes essential to learn the spark best practices and optimization Continue Reading

Introduction to Apache Spark

Reading Time: 3 minutes Hey guys, welcome to this fresh blog on Apache Spark. In this blog, we’ll learn about what is Apache Spark and its importance in the industry, its comparison with hadoop, spark evolution, features and much more. What is Apache Spark? Apache Spark is a data processing framework that can quickly perform processing tasks on very large datasets. It can also distribute data processing tasks across multiple Continue Reading

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

Receivers in Apache Spark Streaming

Reading Time: 2 minutes Receivers are special objects in Spark Streaming. The receiver’s goal is to consume data from data sources and move it to Spark. We create Receivers by streaming context as long-running tasks on different executors. We can build receivers by extending the abstract class Receiver. To start or stop the receiver there are two methods:- onStart() This method contains all important things like opening connections, creating threads, Continue Reading

The ecosystem of Apache Spark

Reading Time: 4 minutes Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing, and graph computations. It is an open-source distributed cluster-computing framework. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming. Apart from supporting all these workloads in a respective system. It reduces the management burden of Continue Reading

Understanding Spark Application Concepts

Reading Time: 3 minutes Once you have downloaded the spark and are ready with the SparkShell and executed some shortcode examples. After that, to understand what’s happening behind your sample code you should be familiar with some of the critical concepts of the Spark application. Some important terminology used are: ApplicationA user program built on Spark using its APIs. It consists of a driver program and executors on the Continue Reading