Brief Introduction to Apache Spark

Apache Spark
Reading Time: 3 minutes

What is Apache Spark:

Apache Spark is an open-source data processing engine to store and process data in real-time across various clusters of computers using simple programming constructs. It supports various programming languages like Scala, Python, Java, and R.

Spark Architecture:

Uses of Apache Spark:

It is used for-

  • Applications of Data processing
  • Batch processing
  • Processing Structured data
  • Machine learning
  • Process graph data
  • Processing Streaming data

Features of Apache Spark:

  • In Memory Processing: Take data in the memory and give results after processing
  • Tight integration of components
  • Easy and inexpensive

Cluster Managers:

A cluster manager is a platform (cluster mode) where we can run Spark. Simply put, cluster manager provides resources to all worker nodes as per need, it operates all nodes accordingly. We can say there are a master node and worker nodes available in a cluster.

  • Hadoop YARN: (Hadoop cluster)
  • Apache Mesos: (Cluster computing platform)
  • Standalone scheduler: (Install spark on an empty set of machines)

Storage layers for Spark:

It tells us from which places we can get data and process it.

  • HDFS and other storage systems supported by Hadoop APIs (Local file system, Amazon S3, Cassandra, etc)
  • Supports text file, Avro, Parquet, and many other Hadoop input formats

Spark Core:

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD, which is a logical collection of data partitioned across machines.

  • Task Scheduling
  • memory management
  • fault recovery
  • interacting with storage systems

Home to API that defines RDDs

Spark SQL:

  • It is used to process Structured data.
  • Supports many sources of data including Hive tables, parquet, JSON.
  • Allows developers to intermix SQL with programmatic data manipulations supported by RDDs in python, Scala and Java.
  • Shark was older SQL on Spark project.

Spark Streaming:

Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards.

Conclusion:

This blog is only a theoretical approach for learning purposes and will be used in plenty of useful places while we are learning Spark. So, I would say, keep searching Spark and keep learning as it has a very good scope in the field of IT. That’s all basic I have in my mind as of now, related to the topic.

If you want to add anything or you do not relate to my view on any point, drop me a comment. I will be happy to discuss. For more blogs, click here

apache spark

Written by 

Rituraj Khare is a Software Consultant at Knoldus Inc. in Noida. He did his B.Tech from Dr. A.P.J. Abdul Kalam Technical University. He is familiar with Scala, Python, Unit testing, Git, Kafka, Docker, Jenkins. He is currently working in the Scala Practice area. He loves to dig deep in coding and loves to play indoor games, especially Chess.