Apache Nutch, is an open source web search project. One of the interesting things that it can be used for is a crawler. The interesting thing about Nutch is that it provides several extension points through which we can plugin our custom functionality. Some of the existing extension points can be found here. It supports a plugin system which is used in Eclipse as well.
For one of our subprojects, which required to crawl a few websites, we decided to inject our Scala plugin into the flow to do our specific logic. For starters, here is how the Nutch Crawl mechanism overview
As you would notice, in order to begin the crawl, we need to populate the Crawldb and then the fetchlist is generated. On the basis of this fetchlist, a segment for crawl is prepared and new links are injected back to the crawldb. This process continues till the desired depth is reached or the crawldb is out of links.
For getting more information on how to start with Nutch, refer to the tutorial.
What we were required to do is that, as a part of the crawl cycle, specifically between steps 4 and 5 in the diagram above, we wanted to do update our database (MongoDB) with some information on the basis of the crawl. For that we decided to inject a Scala plugin in the crawl cycle.
In order for us to compile the Scala Plugin project, we need to get some dependencies into SBT. Our build.sbt looks like this
The code for our ParserFilter looks like this
As you would notice, we extend the HtmlParseFilter and inject our logic there. As soon as we get the parsed text, we send a message to a Scala actor which does further processing. This way we do not keep the crawl cycle hostage to our processing and can easily delegate the processing asynchronously to an actor.
Let us look at the actor implementation,
Our scala actor picks up the processing part and puts the desired parsed information into MongoDB. Once we have the code compiling then it is time to package it. For that we used sbt assembly This helped us a get a jar of the project which can directly be deployed in Nutch.
Nutch requires a corresponding plugin.xml file for it to register the plugin. Our plugin.xml looks like this
Here we are telling Nutch, which extension point we are extending and where is the implementation.
Next step is to make Nutch aware of the plugin. For this change the following setting in the ../runtime/local/conf/nutch-site.xml to make Nutch aware of our plugin as well
Now once these steps are done, Nutch would know that it has to call this plugin as well as a part its crawl cycle. When we execute Nutch with something like
bin/nutch crawl urls -dir vikas -depth 2 -topN 3
you would be able to see the records present in MongoDB as a part of the crawl process. Thus, it is easy to intercept and extend the Nutch cycle with your custom plugins. You just need to get lucky by finding your way through some sketchy documentation 😉
It is really helpful. Thanks
Hello Mr. Vikas,
I have a problem while using Nutch. When I try to lunch nutch crawler using any command eg. (bin/nutch crawl urls -dir crawl -depth 3 -topN 5). It throws an error: Exception in thread “main” java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Astar\mapred\staging\Astar-2124359786\.staging to 0700. Would you please help me?
I am using Windows 7.
Thanks