What’s new in Apache Spark 2.2


Apache recently released a newer version of Spark i.e Apache Spark2.2. The new version comes with new improvements as well as the addition of new functionalities.

The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed.

Some of the high-level changes and improvements :

  • Production ready Structured Streaming
  • Expanding SQL functionalities
  • New distributed machine learning algorithms in R
  • Additional Algorithms in MLlib and GraphX

Spark 2.2 declares Structured Streaming as production ready with additional high-level changes:

  • Kafka Source and Sink: In the previous spark version Kafka was supported only as a source but in the current release we can use Kafka both as a Source and a Sink
  • Kafka Improvements: Now a cached instance of Kafka producer will be used for writing to KafkaSink thereby reducing latency
  • Additional Stateful APIs: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
  • Run Once Triggers: Allows to trigger only one-time execution, hence lowering the cost of clusters

Spark2.2 adds a number of SQL functionality :

  • API updates: Added support for creating hive table with DataFrameWriter and Catalog, LATERAL VIEW OUTER explode(), unify CREATE TABLE syntax for data source and hive serde tables. Added Broadcast Hints BROADCAST, BROADCASTJOIN, and MAPJOIN for SQL Queries, support session local timezone when machines or users are in different time zones. It also adds support for ADD COLUMNS with ALTER TABLE command.
  • Overall Performance and stability:
    • Cost-based optimizer: Cardinality estimation for filter, join, aggregate, project and limit/sample operators. Deciding the join order of a multi-way join query based on the cost function. TPC-DS performance improvements using star-schema heuristics.
    • File listing/IO improvements for CSV and JSON
    • Partial aggregation support of HiveUDAFFunction
    • Introduce a JVM object based aggregate operator
    • Limiting the max number of records written per file
  • Other notable changes:
    • Support for parsing multi-line JSON and CSV files
    • Analyze Table Command on partitioned tables
    • Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
    • More robust view canonicalization without full SQL expansion
    • Support reading data from Hive metastore 2.0/2.1
    • Removed support for Hadoop 2.5 and earlier
    • Removed Java 7 support


A major set of changes in Spark 2.2 focuses on advanced analytics and Python.  PySpark from PyPI can be installed using pip install

Few new algorithms were also added to MLlib and GraphX:

  • Locality Sensitive Hashing
  • Multiclass Logistic Regression
  • Personalized PageRank

Spark 2.2 also adds support for the following distributed algorithms in SparkR:

  • ALS
  • Isotonic Regression
  • Multilayer Perceptron Classifier
  • Random Forest
  • Gaussian Mixture Model
  • LDA
  • Multiclass Logistic Regression
  • Gradient Boosted Trees

The main focus of SparkR in the 2.2.0 release was adding extensive support for existing Spark SQL features:

  • Structured Streaming API for R
  • Support complete Catalog API in R
  • column functions to_json, from_json
  • Coalesce on DataFrame and coalesce on column
  • Support DataFrame checkpointing
  • Multi-column approxQuantile in R

Some of the features like support for Python2.6 has been dropped and createExternalTable have been deprecated.

Happy Learning !!


knoldus-advt-sticker


 

Advertisements
This entry was posted in apache spark, big data, Scala, Spark, Streaming and tagged , , , , , , , , , , . Bookmark the permalink.

2 Responses to What’s new in Apache Spark 2.2

  1. Pingback: What’s New In Spark 2.2? – Curated SQL

  2. Pingback: New Features on Spark 2.2 – sendilsadasivam

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s