We all are familiar with Gnip, Inc. which provides data from dozens of social media websites via a single API. It is also known as the Grand Central Station for social media web. One of its popular API is PowerTrack which provides Tweets from Twitter in realtime along with the ability to filter Twitter’s full firehose, giving its customers only what they are interested in. This ability of Gnip’s PowerTrack has a number of applications and can be applied anywhere, such as with Spark Streaming.
Yes!!!, we can integrate Gnip’s PowerTrack with Apache Spark’s Streaming library and build a powerful utility which can provide us Tweets from Twitter in real-time. Also, we can apply all available features of Apache Spark on Gnip’s PowerTrack data to do real-time analysis.
In this blog, we will see a utility which will help us to pull Tweets from Gnip using Spark Streaming and have better handling of Gnip’s PowerTrack data.
Following are steps for implementation :-
1. Setup build.sbt file
2. Create a ReceiverInputDStream for Gnip :-
Now we need to create a ReceiverInputDStream for Gnip which will create a connection with Gnip, using the supplied Gnip authentication credentials and return a set of all tweets during each interval.
3. Utility for Gnip Input Stream :-
At last we need a utility for Gnip Input Stream which will create an input stream that returns tweets received from Gnip.
4. Example :-
Now we are ready to use our Gnip input stream receiver using Apache Spark Streaming API. Here is a short example of how to use above utility:
As we can see in the example above, using this utility we can easily integrate Spark Streaming with Gnip’s PowerTrack, by just providing it Gnip authentication credentials and Resource (PowerTrack) Url.
The full source code for the utility package can be downloaded from here.