What is Apache Spark –
- Apache Spark is an open-source data processing engine to store and process data in real-time across various clusters of computers using simple programming constructs.
- Spark has consistent and composable API’s and Spark supports multiple languages like Python, Java, Scala And R.
- Developers and data sientists incorporate Spark into their applications to rapidly query, analyze, and transform data at large scale.
Hadoop vs Apache Spark
|1. Processing data using MapReduce in Hadoop is slow||1. Spark processes data 100 times faster than MapReduce as it is done in memory.|
|2. Perform batch processing of data.||2. Perform both batch processing and real-time processing of data.|
|3. MapReduce, developers need to hand code each and every operation which makes it very difficult to work.||3. Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.|
|4. Hadoop is a cheaper option available while comparing it in terms of cost.||4. Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.|
- Fast Processing – Spark contains Resilient Distributed Datasets(RDD) which saves time taken in reading and writing operations and hence, it runs almost ten to hundred times faster than Hadoop.
- In-memory computing-In Spark, data is stored in the RAM, so it can access the data quickly and accelerates the speed of analytics.
- Flexible – Spark supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python.
- Fault tolerance – Spark contains Resilient Distributed Datasets(RDD) that are designed to handle the failure of any worker node in the cluster. Thus, it ensures that the loss of data reduces to zero.
- Better analytics- Spark has a rich set of SQL queries, machine learning algorithims, complex analytics, etc. With all these functionalities, analytics can be performed better.
Components of Spark
Spark Core is the base enf=gine for large-scale parallel and distributed data processing. It is responsible for:
- Memory management
- Fault recovery
- Schedulling, distributing and monitoring jobs on a cluster
- Interacting with storage system
RDD(Resilient Distributed Dataset)-
Spark core is embedded with RDDs, an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. RDD performs two operations.
Spark SQL framework component is used for structured and semi-structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Spark Streaming is a lightweight API that allows developers to perform batch processing and real-time streaming of data with ease. It provides secure, reliable, and fast processing of live data streams.
MLlib is a low-level machine learning that is simple to use, is scalable, and compatible with various programming languages. MLlib eases the deployment and development of scalable machine learning algorithms.
GraphX is Spark’s own Graph Computation Engine and data store.
In this blog, we have learned about basic things related to Apache Spark. We also get to know about spark architecture, spark features, components of spark, and differences between the Hadoop framework and Apache Spark.