Archives for the month of: May, 2010

Nutch. It’s sweet.

We (The University of Montana) have had a Google Mini for the last 5 or so years. It was really expensive to purchase and after the short two year support contract ran out we were left to fend for ourselves or shell out another $6k for another one, and so we’ve been limping a long for the last 3 years with a Google Mini that is difficult to manage. One of our biggest issues is that the Google Mini will only index 200,000 thousand pages, and then it quits. As a result our search is terrible, nearly unusable and we (me) hear about it all the time. A big thanks to Tom my boss for telling me about Nutch!

Enter Nutch.  Nutch as described on their site: “Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.”

The best tutorial I found was on the Apache Wiki:

There are two things i did differently to make my installation / testing process go a bit smoother:

  1. use the command “bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log” to initiate the crawl. It generates a crawl.log file that his handy and interesting to look through
  2. when setting up the tomcat war, edit /tomcat6/ROOT/WEB-INF/classes/nutch-site.xml and make it look like this:

Nutch Results
I set it up on my local machine running 64-bit Ubuntu and pointed it first at and then did a very shallow sweep of The results are very promising. To the right are the results from tinkering with it for a few hours today.

In the tutorial it will show you how to set up Nutch, get it to crawl and index whatever content you specify and then start performing searches on it.  What I’ve got on the right is the stock java.war file unpacked into the root of tomcat (this is also in the tutorial) i did update the logo with UM’s to give it a bit of branding for the demo.

Our intention is to build some simple restful web services that can be accessed by our production web box and the results can be displayed however we like using whatever language we want.  You can see in the screenshot that Nutch also provides cached copies of pages as well as some other options for each link.

We hiked up Blodgett Canyon Overlook, a nice 1.5 mile walk to an overlook of the Blodgett Canyon trailhead. more photos can be found on flickr

Blodgett Canyon

The first bit is getting the php-oracle connection setup, this has been tested on Ubuntu 10.04 (Desktop) and 8.10 (Server) both are running the 64-bit OS.

  1. Install PEAR and PECL
    sudo apt-get update
    sudo apt-get install php-pear php5-dev libaio1 build-essential
  2. Download Oracle Instant Client. You need the Basic and SDK.
  3. Move and unzip the files
    sudo mkdir -p /opt/oracle
    cd /opt/oracle
    sudo unzip
    sudo unzip
    sudo mv /opt/oracle/instantclient_11_2 /opt/oracle/instantclient
  4. Create sym links.
    cd /opt/oracle/instantclient
    sudo ln -s
    sudo ln -s
  5. Install oci8.sudo pecl install oci8
  6. At the prompt enter: instantclient,/opt/oracle/instantclient
  7. Enable it by adding this line to your php.ini
  8. Restart Apache.
    sudo /etc/init.d/apache2 restart

The next bit is getting CodeIgniter and Oracle setup:
Notice that database is blank, and hostname includes the host, port and service name
$db['default']['hostname'] = "";
$db['default']['username'] = "dbusername";
$db['default']['password'] = "dbpassword";
$db['default']['database'] = "";
$db['default']['dbdriver'] = "oci8";
$db['default']['dbprefix'] = "";
$db['default']['pconnect'] = TRUE;
$db['default']['db_debug'] = TRUE;
$db['default']['cache_on'] = FALSE;
$db['default']['cachedir'] = "";
$db['default']['char_set'] = "utf8";
$db['default']['dbcollat'] = "utf8_general_ci";

One last thing i’ve learned working with Oracle and PHP is when getting CLOB’s and the like out of the database, to access the data use this syntax: $model->fldName->load().