ElasticSearch: How to index data in bulk in elasticsearch with scala using java Bulk API.

Table of contents
Reading Time: 2 minutes

Elasticsearch is an open-source, restful, distributed, search engine built on top of apache-lucene.In this post, we will learn to use elasticsearch java api in Scala to index data using BulkRequest.

we will begin with adding dependency of elasticsearch in the project. At the time 0.19.8 was the latest. The artifact is available on typesafe repository. Here is the snippet in build.sbt file.

name := “ElasticSearchBulkproject”

version := “0.1.0”

scalaVersion := “2.9.2”

resolvers += “Typesafe Repo” at “http://repo.typesafe.com/typesafe/releases/”

libraryDependencies += “org.elasticsearch” % “elasticsearch” % “0.19.8”

Elasticsearch is schemaless. We can index any json to it. We have a bulk json file, each line is a json. For our implementation: Application reads file line by line and add json to bulkRequest

Here is the bulk json which we need to index. Every line represents a json.
{ “id”: 1, “source”: “wordpress”, “data”: “document 1” }
{ “id”: 2, “source”: “wordpress”, “data”: “document 2” }
{ “id”: 3, “source”: “wordpress”, “data”: “document 3” }
{ “id”: 4, “source”: “wordpress”, “data”: “document 4” }
{ “id”: 5, “source”: “wordpress”, “data”: “document 5” }
{ “id”: 6, “source”: “wordpress”, “data”: “document 6” }

Firstly We need to create node in scala using java api which will interact with elasticsearch server running on our machine.

We then create a client from node created.

After this we will create an indexRequest to create an index named “wordpress” using createIndexRequest.

if we have elasticsearch servers running on other nodes, then we can also specify number of shards and replicas parameters using ImmutableSettings.settingsBuilder

here is the code for creating index with specified properties.

Now we will create a bulkRequest.

in add method of bulkRequest we can pass prepareIndex method call as parameter and setSource as json as below.

in the prepareIndex call on client we pass three parameters

1) name of the index: “wordpress” in our case

2) document type: “wordpressStream” in our case

3) unique id: an unique id for each document to be indexed By using this unique id we will be able to search for particular document residing on particular index.

Now start elasticsearch by executing following command.

elasticsearch -f

combning all the peices together we have following code .

Now we can check data using curl request.

curl -XGET http://localhost:9200/wordpress/wordpressStream/1

here “http://localhost:9200 is url:port of elasticsearch. “wordpress” is name of index and “wordpressStream” is document type and last one is the id.

And here is the output of the curl request.

{“_index”:”wordpress”,”_type”:”wordpressStream”,”_id”:”1″,”_version”:3,”exists”:true, “_source” : { “id”: 1, “source”: “wordpress”, “data”: “document 1” }}