Spark Structured Streaming with Elasticsearch

Table of contents

Reading Time: 3 minutes

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in another blog. But today we’ll focus on saving streaming data to Elasticseach using Spark Structured Streaming. Elasticsearch added support for Spark Structured Streaming 2.2.0 onwards in version 6.0.0 version of “Elasticsearch For Apache Hadoop” dependency. We will be using these versions or higher to build our sbt-scala project.

Pre-requisites

First, you need to add “Spark SQL” dependency:

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"

and “Elasticsearch For Apache Hadoop” dependency to your build.sbt

libraryDependencies += "org.elasticsearch" % "elasticsearch-hadoop" % "6.4.1"

or you can go to maven repository for Elasticsearch For Apache Hadoop and Spark SQL and get a suitable version. Note that version should be at least 6.0.0 for "Elasticsearch For Apache Hadoop" and 2.2.0 or higher for "Spark-SQL".

Let’s get started with the code

We will be reading a JSON file and saving its data to elasticsearch in this code. But first let’s go through the basic code to read a JSON file as a DataFrame and for that, we need to create a schema for our JSON.

val jsonSchema = StructType(
  Seq(
    StructField("name", StringType, true),
    StructField("age", IntegerType, true),
    StructField("nationality", StringType, true),
    StructField("skills", ArrayType(StringType, true), true)
  )
)

Now we can use this schema to read the JSON file in streaming mode with the following code:

val sparkSession = SparkSession.builder()
  .master("local[*]")
  .appName("sample-structured-streaming")
  .getOrCreate()

val streamingDF: DataFrame = sparkSession
  .readStream
  .schema(jsonSchema)
  .json("src/main/resources/json-resources/")

Now with this “streamingDF”, we will send data to the destination. Let’s use the console for verifying the data.

streamingDF
  .writeStream
  .outputMode("append")
  .format("console")
  .start().awaitTermination()

Here .format(“console”) tells the result to print on console. This code prints the JSON data from resource file on the console:

+----+---+-----------+--------------------+
|name|age|nationality|              skills|
+----+---+-----------+--------------------+
|Anuj| 24|     Indian|[Scala, Spark, Akka]|
+----+---+-----------+--------------------+

That was just a simple structured streaming code where a JSON file was the source and console was the destination.

Elasticsearch as the destination in streaming

Now to add the elasticsearch as the destination for Spark Structured Streaming, we need to add the configuration of elasticsearch in the object of SparkSession:

val sparkSession = SparkSession.builder()
  .config(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, "username")
  .config(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, "password")
  .config(ConfigurationOptions.ES_NODES, "127.0.0.1")
  .config(ConfigurationOptions.ES_PORT, "9200")
  .master("local[*]")
  .appName("sample-structured-streaming")
  .getOrCreate()

We are adding the authentication credentials here with the elasticsearch nodes and port. Currently, the address is for the local machine for testing.

Now with the previous code, we need to add the destination for elasticsearch for DataStreamWriter:

streamingDF.writeStream
  .outputMode("append")
  .format("org.elasticsearch.spark.sql")
  .option("checkpointLocation", "path-to-checkpointing")
  .start("index-name/doc-type).awaitTermination()

Here we changed the format from console to org.elasticsearch.spark.sql which tells the destination for streaming is now elasticsearch.

Now the code is complete. After running it, this will save the JSON data to elasticsearch. To make sure use curl:

curl http://localhost:9200/index-name/_search

This will show all the documents saved to this index, index-name. The following got saved to Elasticsearch:

You can find the whole code here. In case of any queries please comment below.

I hope that helped.
Thanks 🙂

References:

https://discuss.elastic.co/t/spark-structured-streaming-sink-in-append-mode/105664

High performance systems

Data Engineering, Strategy and Analytics

Intelligence Driven Decisioning - AI/ML

Cloud Engineering

Architecture Strategy, Audit & Academy

Platforms

KDP

KDSP

Products

Premon

Studio9

Tech Hub

Akka

Scala

Rust

Spark

Functional Java

Kafka

Flink

ML/AI

DevOps

Data Warehouse

Travel

Retail

Finance

Healthcare

Media and Publishing

Consumer Internet

Hi-tech & IoT

Case Studies

Blogs

Books

Community

Resources

OS contributions

Webinars

Knolx

Check out our open positions

Services

Go to Overview

Accelerators

Go to Overview

Platforms

Products

TechHub

Industries

Go to Overview

Travel

Insights

Go to Overview

Spark Structured Streaming with Elasticsearch

Pre-requisites

Let’s get started with the code

Elasticsearch as the destination in streaming

Share the Knol:

Related

Written by Anuj Saxena

1 thought on “Spark Structured Streaming with Elasticsearch3 min read”

COMPANY

Sign up to our newsletter

Certificates

Partners

© 2023 Knoldus, Inc. All Rights Reserved.

Part of NashTech

Privacy Policy | Sitemap

Discover more from Knoldus Blogs

Check out our
open positions