Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.
Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm.
Apache Spark is an improvement on the original Hadoop MapReduce component of the hadoop big data ecosystem.
Features Of Spark
- Supports more than just Map and Reduce functions.
- Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow.
- Provides concise and consistent APIs in Scala, Java and Python.
- Offers interactive shell for Scala and Python. This is not available in Java yet.
- Spark is written in Scala Programming Language and runs on Java Virtual Machine (JVM) environment.
Spark Ecosystem
Other than Spark Core API, there are additional libraries that are part of the Spark ecosystem and provide additional capabilities in Big Data analytics and Machine Learning areas.
These libraries include:
- Spark Streaming: Spark Streaming can be used for processing the real-time streaming data.It uses the DStream which is basically a series of RDDs, to process the real-time data.
- Spark SQL: Spark SQL provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools.
- MLlib: Spark MLib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression.
- GraphX: Spark GraphX provides API for graphs and graph parallel computation.GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks
Outside of these libraries, there are others like BlinkDB and Tachyon.
Spark Architecture
- Data Storage: Spark uses HDFS for storing data.. It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.
- API: The API provides the application developers to create Spark based applications using a standard API interface. Spark provides API for Scala, Java, and Python programming languages.
- Resource Management: Resource management that tells that spark can be deployed as stand alone server or as a distributed framework like YARN
Steps to install Apache Spark on Linux
- Make sure that Java is installed on your system.
- Download Apache Spark from here.
- Unzip the downloaded spark to a local directory.
- Set your Spark path to bash.rc file by adding the following lines:
export SPARK_HOME=”/PATH/TO/SPARK/spark-2.1.0-bin-hadoop2.7″
export PATH=”$SPARK_HOME/bin:$PATH” - Type source.bashrc on terminal to refresh changes of bashrc file.
- To verify Spark installation, Open the terminal and type spark-shell and hit enter.
- If Spark was installed correctly, you should the see the following messages in the output on the console.
Spark context Web UI available at http://193.163.*.**:4040 Spark context available as 'sc' (master = local[*], app id = local-1492064000358). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc.
In many sources, it is written that Spark is 100x faster than Map Reduce. It seems a myth as speed also depends on maturity of cluster. A cluster with huge RAM support will be able to give speed so fast. Although other factors of Spark definitely makes it better than Map Reduce.