Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLib for machine learning, Graphx for graph processing, and Spark Streaming. Here, are the Spark core components
All the functionalities being provide by Apache Spark are built on the top of Spark Core the most import feature that brings into it is It overcomes the snag of MapReduce by using in-memory computation.
RDD in Apache Spark
The main abstraction Spark provides is a resilient distribute dataset (RDD), which is a collection of elements partition across the nodes of the cluster that can be operate on in parallel. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster
The second abstraction in Spark is share variables that can be use in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task variable support are broadcast variables and accumulators
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Parallelized Collections in Apache Spark
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program where it can operate in parallel.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase.
Two major operation support by RDD:
1) Transformations, which create a new dataset from an existing one
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.All transformations in Spark are lazy, in that they do not compute their results right away. Few transformation are filter, flatMap, distinct, union, Intersection, groupByKey, reduceByKey, join.
2) Action, which return a value to the driver program after running a computation on the dataset.reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.Few other actions collect, count, first, take, saveAsTextFile, foreach.
Apache Spark SQL, DataFrames and Datasets
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provide by Spark SQL provide Spark with more information about the structure of both the data and the computation being performe. Spark SQL uses this extra information to perform extra optimizations
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database but with richer optimizations under the hood
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generate dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.
Spark SQL supports two different methods for converting existing RDDs into Datasets.
- Inferring the Schema Using Reflection
- Programmatically Specifying the Schema
How to Submit a job
1) Create an sbt project with the following dependency:
2) Create file SimpleApp.scala
3) create a package with which will create jar in target/ folder
–class “SimpleApp” \
–master local \
Also published on Medium.