Neo4j With Scala: Awesome Experience with Spark

Table of contents

Reading Time: 4 minutes

Lets start our journey again in this series. In the last blog we have discussed about the Data Migration from the Other Database to the Neo4j. Now we will discuss that how can we combine Neo4j with Spark?

Before starting the blog here is recap :

Now we start our journey. We know that we use Apache Spark is a generalized framework for distributed data processing providing functional API for manipulating data at large scale, in-memory data caching and reuse across computations. Now when we are using Neo4j as a graph database and we have to perform data processing that time we use Spark for the data processing. We have to follow some basic step before start playing with Spark.

We can download Apache Spark 2.0.0 (Download Here). Here we have to remember that we can use only Spark 2.0.0 because which one connector we are going to use, it is build on Spark 2.0.0 .
Set the SPARK_HOME Path in the .bashrc file(for Linux user).
Now We can use Neo4j-Spark-Connector :

Configuration with Neo4j :

When we are running Neo4j with default host and port then we have to configure only our password in the spark-defaults.conf.
```
 spark.neo4j.bolt.password=your_neo4j_password
```

If we are not running Neo4j with default host and port then we have to configure it spark-defaults.conf.

  1. spark.neo4j.bolt.url=bolt://$host_name:$port_number
  2. spark.neo4j.bolt.user=user_name
  3. spark.neo4j.bolt.password=your_neo4j_password

We can provide user and password with the URL.

 spark.neo4j.bolt.url=bolt://neo4j:<password>@localhost

Integration of Neo4j Spark Connector :

Download Neo4j-Spark-Connector(Download Here) and build it. After building we can provide jar path when we start spark-shell.
```
$ $SPARK_HOME/bin/spark-shell --jars neo4j-spark-connector_2.11-full-2.0.0-M2.jar
```

We can provide Neo4j-Spark-Connector as a package also when we start spark-shell.

$ $SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2

We can directly integrate Neo4j-Spark-Connector in SBT project.

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.0.0-M2"

We can directly integrate Neo4j-Spark-Connector with POM(See Here).

Now we are ready to use Neo4j with Spark. We will start Spark from the command prompt.

$ $SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2

Now we are ready to start a new journey with the Spark and Neo4j.

I suppose that you have created data in the Neo4j for running the basic commands. If you have not created yet then here is a cypher for creating 1K record in your Neo4j Database :

1. UNWIND range(1, 1000) AS row
   CREATE (:Person {id: row, name: 'name' + row, age: row % 100})

2. UNWIND range(1, 1000) AS row
   MATCH (n)
   WHERE id(n) = row
   MATCH (m)
   WHERE id(m) = toInt(rand() * 1000)
   CREATE (n)-[:KNOWS]->(m)

RDD Operations :

We start with some RDD operations on the data which store in the Neo4j.


import org.neo4j.spark._

val neo = Neo4j(sc)

val rowRDD = neo.cypher("MATCH (n:Person) RETURN n.name as name limit 10").loadRowRdd
rowRDD.map(t => "Name: " + t(0)).collect.foreach(println)


import org.neo4j.spark._

val neo = Neo4j(sc)

//calculate mean from the age data

val rowRDD = neo.cypher("MATCH (n:Person) RETURN n.age").loadRdd[Long].mean
//rowRDD: Double = 49.5

// load relationships via pattern

neo.pattern("Person",Seq("KNOWS"),"Person").partitions(12).batch(100).loadNodeRdds.count
//res30: Long = 1000

DataFrame Operations :

Now we will perform some operation on the DataFrame.


import org.neo4j.spark._

val neo = Neo4j(sc)
val df = neo.cypher("MATCH (n:Person) RETURN n.name as name, n.age as age limit 5").loadDataFrame

df.collect.foreach(println)


import org.neo4j.spark._

val neo = Neo4j(sc)

//calculate sum from the age data

val df = neo.cypher("MATCH (n:Person) RETURN n.age as age").loadDataFrame
df.agg(sum(df.col("age"))).collect
//res10: Array[org.apache.spark.sql.Row] = Array([49500])

// load relationships via pattern

val df = neo.pattern("Person",Seq("KNOWS"),"Person").partitions(2).batch(100).loadDataFrame.count
//df: Long = 200

Graphx Graph Operations :

Now we will perform some operation on the Graphx Graph.


import org.neo4j.spark._

val neo = Neo4j(sc)

import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._

//load graph

val graphQuery = "MATCH (n:Person)-[r:KNOWS]->(m:Person) RETURN id(n) as source, id(m) as target, type(r) as value"
val graph: Graph[Long, String] = neo.rels(graphQuery).partitions(10).batch(100).loadGraph

graph.vertices.count
//res0: Long = 1000

graph.edges.count
//res1: Long = 999

//load graph with pattern

val graph = neo.pattern(("Person","id"),("KNOWS","since"),("Person","id")).partitions(10).batch(100).loadGraph[Long,Long]

//Count in-degree

graph.inDegrees.count
//res0: Long = 505

val graph2 = PageRank.run(graph, 5)

graph2.vertices.sortBy(_._2).take(3)

//res1: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((602,0.15), (630,0.15), (476,0.15))

This is the basic example for using Spark with Neo4j.

I hope it will help you to start working with Spark and Neo4j.

Thanks.

Reference: