Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark recently released its fifth release in the 2.x version line i.e Spark 2.4.
We were lucky enough to experiment with it so soon in one of our projects. Today we will try to highlight the major changes in this version that we explored as well as experienced in our project.
In our earlier blogs on Spark, we tried to cover the RDDs and also tried to understand how we can use the Spark streaming to transform and transport data between Kafka topics.
But in this blog, we will briefly try to summarize the features and improvements that the Spark 2.4 has in store for you.
So, Spark 2.4 was released last month and it has been in the buzz ever since. In this release, there are approximately 1135 JIRAs (new features and bug fixes) from nearly 200 contributors worldwide.
We have tried to gather all those enhancements, compiled them and listed them below.
Barrier Execution Mode
When we talk about the improvements/new features in this version, the first thing that we noted everywhere was the Barrier Execution Mode.
The Apache Spark v2.4 brings support for the Barrier Execution Mode in the scheduler, to better integrate with deep learning frameworks.
This, Barrier Execution Mode, is part of Project Hydrogen, which is an Apache Spark initiative to bring state-of-the-art big data and AI together. It enables proper embedding of distributed training jobs from AI frameworks as Spark jobs. The tool also allows developers to embed training for distributed deep learning as an Apache Spark workload.
Using this new execution mode, Spark launches all training tasks (e.g. Message Passing Interface tasks) together and restarts all tasks in case of task failures. Spark also introduces a new mechanism of fault tolerance for barrier tasks. When any barrier task failed in the middle, Spark would abort all the tasks and restart the stage.
New Built-in Higher-order Functions
The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.
Spark 2.4 introduced 24 new built-in functions, such as
array_max/min, etc., and 5 higher-order functions, such as
The entire list can be found here.
Earlier, for manipulating the complex types (e.g. array type) directly, there are two typical solutions:
1) exploding the nested structure into individual rows, and applying some functions, and then creating the structure again.
2) building a User Defined Function (UDF).
In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance.
Experimental Scala 2.12 Support
Another important experiment that spark did in this version is that it Build and tested Spark against Scala 2.12 in order to bring the support for Scala 2.12. Also, this version is cross-built with both Scala 2.11 and 2.12, which are available in both the Maven repository and the download page.
Now users can write Spark applications with Scala 2.12, by picking the Scala 2.12 Spark dependency.
This also comes with better interoperability with Java 8, which offers improved serialization of lambda functions. It also includes new features and bug fixes that users desire.
Improved Kubernetes integration
Additionally, Apache Spark 2.4 comes with improved Kubernetes integration in three specific ways:
- Supports running containerized PySpark and SparkR on Kubernetes
- Client mode is now provided, meaning users can run tools in a pod on a Kubernetes cluster
- Increased mounting options for more Kubernetes volumes, including emptyDir, hostPath, and persistentVolumeClaim.
The community is also working on and is planning to work on features that further enhance the Kubernetes scheduler backend such as dynamic resource allocation, external shuffle service etc.
Apache Avro as a Built-in Data Source
This release also has built-in support for Apache Avro, the popular data serialization format. Now, developers can read and write their Avro data, right in Apache Spark! This module started life as a Databricks project and provides a few new functions and logical support.
In addition, it provides:
- New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files.
- Avro logical types support, including Decimal, Timestamp and Date type. See the related schema conversions for details.
- 2X read throughput improvement and 10% write throughput improvement.
The new built-in spark-avro module provides better user experience and IO performance in Spark SQL and Structured Streaming. The original spark-avro will be deprecated in favor of the new built-in support for Avro in Spark itself.
Built-in Image Data Source
With recent advances in deep learning frameworks for image classification and object detection, the demand for standard image processing in Apache Spark has never been greater.
Apache Spark 2.3 provided the ImageSchema.readImages API, which was originally developed in the Azure MMLSpark library.
In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column. Also, this features provides the standard representation one can code against and also can provide abstracts from the details of a particular image representation.
Now one can easily load images as:
df = spark.read.format("image").load("...")
This enhancement will be able to solve most of the specific challenges related to the Image handling and preprocessing – for example, images come in different formats (eg., jpeg, png, etc.), sizes, and color schemes, and no easy way to test for correctness.
These were some of the interesting features that we came across while exploring the new version of Spark.
Apart from these, there are some more features as well, some of them being
- flexible streaming sinks,
- better support for compacted kafka topics
- improvement in off
- elimination of the 2-GB block size limitation during transfer
- Pandas UDF improvements, some Spark SQL enhancements
and many more…
We will try to cover them as well in the future blogs once we get to explore them a bit more.
Another thing to note is that mesosphere still hasn’t officially provided any direct support for the Spark 2.4 version yet. The latest version of spark currently used in their image is still 2.2.1. So One might require a custom image in order to run the spark jobs with Spark 2.4 version. But that’s a completely different story altogether.
To sum it up, in this blog, we have tried to cover most of the enhancements and features that Spark v2.4 has brought along.
If you want to go into a bit more technical detail, you guys can refer the release notes for Spark 2.4
Hope this Helps. Stay tuned for more interesting articles. 🙂