Nutch. It’s sweet.

We (The University of Montana) have had a Google Mini for the last 5 or so years. It was really expensive to purchase and after the short two year support contract ran out we were left to fend for ourselves or shell out another $6k for another one, and so we’ve been limping a long for the last 3 years with a Google Mini that is difficult to manage. One of our biggest issues is that the Google Mini will only index 200,000 thousand pages, and then it quits. As a result our search is terrible, nearly unusable and we (me) hear about it all the time. A big thanks to Tom my boss for telling me about Nutch!

Enter Nutch.  Nutch as described on their site: “Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.”

The best tutorial I found was on the Apache Wiki: http://wiki.apache.org/nutch/NutchTutorial

There are two things i did differently to make my installation / testing process go a bit smoother:

  1. use the command “bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log” to initiate the crawl. It generates a crawl.log file that his handy and interesting to look through
  2. when setting up the tomcat war, edit /tomcat6/ROOT/WEB-INF/classes/nutch-site.xml and make it look like this:
    <configuration>
    <property>
    <name>searcher.dir</name>
    <value>/path/to/nutch/crawl</value>
    </property>
    </configuration>

Nutch Results
I set it up on my local machine running 64-bit Ubuntu and pointed it first at http://nickshontz.com and then did a very shallow sweep of http://www.umt.edu. The results are very promising. To the right are the results from tinkering with it for a few hours today.

In the tutorial it will show you how to set up Nutch, get it to crawl and index whatever content you specify and then start performing searches on it.  What I’ve got on the right is the stock java.war file unpacked into the root of tomcat (this is also in the tutorial) i did update the logo with UM’s to give it a bit of branding for the demo.

Our intention is to build some simple restful web services that can be accessed by our production web box and the results can be displayed however we like using whatever language we want.  You can see in the screenshot that Nutch also provides cached copies of pages as well as some other options for each link.