Aggregating Neighboring vertices with Apache Spark GraphX Library

Reading Time: 2 minutes

To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?”

In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example.

So, let us get back on the problem statement. Let us assume following graph as example dataset of Twitter users and followers. The property age is stored as vertex attribute and the arrow from any vertex x to y says: x follows y.

Neighborhood Aggregation
Twitter Graph With (User,Age)

Defining Graph Components

Let us creates the list of vertices for the above graph. The following line create a list of vertices as Twitter users with property VertexId, and Age. The vertices in the graph are labelled with letters and we are using a sequence of letters as vertex ID.

The Edge is a case class that contains IDs of source and destination vertices followed by an attribute “relationship” as String.

with vertices and edges defined, following line creates a graph with these details.

Aggregating Attributes

The following code aggregates values from the neighboring edges and vertices to compute total followers of each vertex under age twenty:

Let us discuss how we got so far with aggregating values around each vertex. GraphX uses operation aggregateMessages as core aggregation operation.
The value tripletFields used in the operation aggregateMessages yields an EdgeContext which contains everything about an Edge i.e. IDs of the source and destination vertices, attributes of the source and destination vertices and attributes of the edge. Fragment tripletFields.srcAttr yields age of each source vertex which is used to compare the age if it is less than 20, and if so, tripletFields.sendToDst(1) sends a message of type Int to destination vertex with follower count 1. The method sendToDst() is just like map function of RDDs which returns Unit.

The function (a,b) => (a+ b) which has been passed in aggregateMessages operation is a reducer which takes two messages of type Int and sums them up whenever a follower is encountered with age under 20.

Result

When collecting the vertex RDD followersUnderTwenty we get the result:

The following line yields result (3,3) i.e. User “C” which has 3 followers under 20.

This was a very simple example of aggregation operation with very tiny but clear graph Dataset. I hope it was helpful 🙂

References:

http://spark.apache.org/docs/latest/graphx-programming-guide.html

Written by 

Manish Mishra is Lead Software Consultant, with experience of more than 7 years. His primary development technology was Java. He fell for Scala language and found it innovative and interesting language and fun to code with. He has also co-authored a journal paper titled: Economy Driven Real Time Deadline Based Scheduling. His interests include: learning cloud computing products and technologies, algorithm designing. He finds Books and literature as favorite companions in solitude. He likes stories, Spiritual Fictions and Time Traveling fictions as his favorites.

1 thought on “Aggregating Neighboring vertices with Apache Spark GraphX Library3 min read

Comments are closed.