Introduction to Apache Spark
Big Data processing frameworks like Apache Spark provides an interface for programming data clusters using fault tolerance and data parallelism. Apache Spark is broadly used for the speedy processing of large datasets.
Apache Spark is an open-source platform, built by a broad group of software developers from 200 plus companies. Over 1000 plus developers have contributed since 2009 to Apache Spark.
Superior abilities for Big Data applications are provided by Apache Spark when compared to other Big Data Technologies like MapReduce or Hadoop.
The Apache Spark features are as follows:
- Holistic framework
Spark delivers a holistic and integrated framework to manage Big Data processing and supports a varied range of data sets including batch data, text data, real-time streaming data, and graphical data.
- Speed
Spark is able to run programs close to 100 times faster than Hadoop clusters in memory, and over ten times faster when running on disk. Spark has a complex DAG or Directed Acrylic Graph) the execution engine that provides support for cyclic data flow and in-memory data sharing across DAGs to carry out different jobs with the same data.
- Easy to use
Spark lets programmers write Python, Scala, Java, or applications in quick time, with a built-in set of over 80 high-level operators.
- Enhanced support
Spark provides support for streaming data, SQL queries, graphic data processing, and machine learning, in addition to Map and Reduce operations.
- Inter-platform operability
Apache Spark applications can be run in the cloud or on a standalone cluster mode. Spark provides access to varied data structures including HBase, Tachyon, HDFS, Cassandra, Hive, and any Hadoop data source. Spark can be deployed on a distributed framework such as YARN or Mesos or as a standalone server.
- Flexibility
In addition to Scala programming language, programmers can use Clojure, Java, and Python to build applications using Spark.
- Holistic library support
One can integrate additional libraries within the same application, and give Big Data analytical and Machine learning capabilities as a Spark programmer. The supported libraries range from Spark Streaming and Spark GraphX to Spark SQL.
Context of Bigdata
Computing vs data
- CPU are incrementally faster.
- Data storage keeps getting better and cheaper.
- Gathering data keeps getting easier and cheaper are very important.
- Standard Single CPU software cannot scale up.
- Data needs to be distributed and processed in parallel.
Motivation For Spark
A 2009 UC Barkeley project by Mark zaharia et al
- MapReduce was the king of large distributed computation.
- Inefficient of large application and ML.
- Each step required another datapass, written as separate application.
Spark Phase-1
- A simple functional programming API.
- Optimize multi step-application.
- In-memory Computation and data sharing across nodes.
Spark Phase-2
- Interactive data science and ad-hoc computation.
- Spark Shell and Spark Sql.
Spark Phase-3
- Same Engine, new Libraries.
- ML, Streamin, GraphX.
Why should you learn Scala for Apache Spark?

- Apache Spark is written in Scala and because of its scalability on JVM – Scala programming is most prominently used programming language, by big data developers for working on Spark projects. Developers state that using Scala helps dig deep into Spark’s source code so that they can easily access and implement the newest features of Spark. Scala’s interoperability with Java is its biggest attraction as java developers can easily get on the learning path by grasping the object oriented concepts quickly.
- Scala programming retains a perfect balance between productivity and performance. Most of the big data developers are from Python or R programming background. Syntax for Scala programming is less intimidating when compared to Java or C++. For a new Spark developer with no prior experience, it is enough for him/her to know the basic syntax collections and lambda to become productive in big data processing using Apache Spark. Also, the performance achieved using Scala is better than many other traditional data analysis tools like R or Python. Over the time, as the skills of a developer develop- it becomes easy to transition from imperative to more elegant functional programming code to improve performance.
- Organizations want to enjoy the expressive power of dynamic programming language without having to lose type safety- Scala programming has this potential and this can be judged from its increasing adoption rates in the enterprise.
- Scala collaborates well within the MapReduce big data model because of its functional paradigm. Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection API’s. Developers just need to learn the standard collections and it would easy to work with other libraries.
- Scala programming language provides the best path for building scalable big data applications in terms of data size and program complexity. With support for immutable data structures, for-comprehensions, immutable named values- Scala provides remarkable support for functional programming.
- Scala programming is comparatively less complex unlike Java. A single complex line of code in Scala can replace 20 to 25 lines of complex java code making it a preferable choice for big data processing on Apache Spark.
- Scala has well-designed libraries for scientific computing, linear algebra and random number generation. The standard scientific library Breeze contains non-uniform random generation, numerical algebra, and other special functions. Saddle is the data library supported by Scala programming which provides a solid foundation for data manipulation through 2D data structures, robustness to missing values, array-backed support, and automatic data alignment.
- Efficiency and speed play a vital role regardless of increasing processor speeds. Scala is fast and efficient making it an ideal choice of language for computationally intensive algorithms. Compute cycle and memory efficiency are also well-tuned when using Scala for Spark programming.
- Other programming languages like Python or Java have lag in the API coverage. Scala has bridged this API coverage gap and is gaining traction from the Spark community. The thumb rule here is that by using Scala or Python – developers can write most concise code and using Java or Scala they can achieve the best runtime performance. The best trade-off is to use Scala for Spark as it makes use of all the mainstream features, instead of developers having to master the advanced constructs.
Conclusion
In this blog, we discussed some basics of apache spark with scala. Discusses their feature, context of big data, motivation, and why you learn scala for apache spark.
We can continue this blog for further more topics about data frames, datasets, and column and expression.
Stay tuned for more blogs.
Reference
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#starting-point-sparksession