Intercepting Nutch Crawl Flow with a Scala Plugin

Table of contents
Reading Time: 4 minutes

Apache Nutch, is an open source web search project. One of the interesting things that it can be used for is a crawler. The interesting thing about Nutch is that it provides several extension points through which we can plugin our custom functionality. Some of the existing extension points can be found here. It supports a plugin system which is used in Eclipse as well.

For one of our subprojects, which required to crawl a few websites, we decided to inject our Scala plugin into the flow to do our specific logic. For starters, here is how the Nutch Crawl mechanism overview

As you would notice, in order to begin the crawl, we need to populate the Crawldb and then the fetchlist is generated. On the basis of this fetchlist, a segment for crawl is prepared and new links are injected back to the crawldb. This process continues till the desired depth is reached or the crawldb is out of links.

For getting more information on how to start with Nutch, refer to the tutorial.

What we were required to do is that, as a part of the crawl cycle, specifically between steps 4 and 5 in the diagram above, we wanted to do update our database (MongoDB) with some information on the basis of the crawl. For that we decided to inject a Scala plugin in the crawl cycle.

In order for us to compile the Scala Plugin project, we need to get some dependencies into SBT. Our build.sbt looks like this

The code for our ParserFilter looks like this

As you would notice, we extend the HtmlParseFilter and inject our logic there. As soon as we get the parsed text, we send a message to a Scala actor which does further processing. This way we do not keep the crawl cycle hostage to our processing and can easily delegate the processing asynchronously to an actor.

Let us look at the actor implementation,

Our scala actor picks up the processing part and puts the desired parsed information into MongoDB. Once we have the code compiling then it is time to package it. For that we used sbt assembly This helped us a get a jar of the project which can directly be deployed in Nutch.

Nutch requires a corresponding plugin.xml file for it to register the plugin. Our plugin.xml looks like this

Here we are telling Nutch, which extension point we are extending and where is the implementation.

Next step is to make Nutch aware of the plugin. For this change the following setting in the ../runtime/local/conf/nutch-site.xml to make Nutch aware of our plugin as well

Now once these steps are done, Nutch would know that it has to call this plugin as well as a part its crawl cycle. When we execute Nutch with something like

bin/nutch crawl urls -dir vikas -depth 2 -topN 3

you would be able to see the records present in MongoDB as a part of the crawl process. Thus, it is easy to intercept and extend the Nutch cycle with your custom plugins. You just need to get lucky by finding your way through some sketchy documentation 😉

Written by 

Vikas is the CEO and Co-Founder of Knoldus Inc. Knoldus does niche Reactive and Big Data product development on Scala, Spark, and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). Vikas has been working in the cutting edge tech industry for 20+ years. He was an ardent fan of Java with multiple high load enterprise systems to boast of till he met Scala. His current passions include utilizing the power of Scala, Akka and Play to make Reactive and Big Data systems for niche startups and enterprises who would like to change the way software is developed. To know more, send a mail to hello@knoldus.com or visit www.knoldus.com

2 thoughts on “Intercepting Nutch Crawl Flow with a Scala Plugin6 min read

  1. Hello Mr. Vikas,
    I have a problem while using Nutch. When I try to lunch nutch crawler using any command eg. (bin/nutch crawl urls -dir crawl -depth 3 -topN 5). It throws an error: Exception in thread “main” java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Astar\mapred\staging\Astar-2124359786\.staging to 0700. Would you please help me?
    I am using Windows 7.

    Thanks

Comments are closed.