Monday, February 23, 2009

Nutch in Eclipse

Draft blog (work in progress)

  1. Get the SVN Subclipse plugin for Eclipse as documented in http://subclipse.tigris.org/ and follow the Update mechanism for Eclipse. I noticed one thing that mylyn dependency on eclipse does not work well with the Europa build and windows vista. It crashes eclipse during random copy-paste operations. Hence I deselected it during the plugin installation.
  2. Once installed and eclipse is restarted...we will define the SVN repository to fetch Nutch from. Window > Open Perspective > Other... > SVN Repository Exploring > OK. In the new window, right click > New and enter this URL http://svn.apache.org/repos/asf
  3. Look in the Lucene folder in the SVN tree, Nutch > Trunk > rightclick checkout... > Checkout as a project in the workspace and give it a name. I gave it 'NutchSource'.
  4. Open a terminal (command prompt) and in the NutchSource directory, issue the command ant. This will build all the files required to run Nutch as a standalone application. The 'HowToContribute' wiki explains this much better, after that its ant test (for this you need ant 1.7.1 or higher, since it has a good integration with jUnit, otherwise there is a workaround listed in the jUnit Ant Task page). and then ant war.
  5. Note: I checked out the project using checkout As from the SVN perspective. Hence this is not a java project. This turned out to be difficult since I could not use all the IDE features like F3 and code completion. To convert it to a java project, use either of the 2 approaches below :
    -Delete the project but do not delete the contents. Create a new java project from existing source and specific this project's directory as source. It will mess up the default build folder however that is ok since we use the Ant build to compile. (../../java/ant171/bin/ant)
    -Close the project, edit the .project file and make the change :
    org.eclipse.jdt.core.javanature
  6. Now the F3 works, however for most of the code that refers to Hadoop I am left in the dark. Aight, so I went to SVN and got the Hadoop trunk. This time, I did a checkout as a project configured using the New Project Wizard, and Eclipse asked me whether it is a java project. From there on it was a quick project setup. Then back to Nutch, hit go to source for JobConfig and it opened a .class file, here there's an option to attach source, I selected the newly configured Hadoop directory and now, I can drill down from Nutch into Hadoop source.
  7. The next thing is to crawl. We can modify nutch-default.xml in the conf directory for properties like http.agent.name, or modify the nutch-site.xml as given in the Nutch Tutorial. If any property is left out, running crawl will keep throwing a RuntimeException.
    -Modify the crawl-urlfilter.txt file, with a line like the following only replacing the my.domain.com with the site you would like to crawl.
    +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
    -Then the last thing is to introduce a folder in the bin directory. This folder will have the name 'urls' and have a txt file which will have the URL to crawl.
    -Once you name your crawler agent. its time to run the crawl.
    ./nutch crawl urls -dir ./crawlgle -depth 10 -topN 25
    In this case my crawl directory is crawlgle. (I didn't know that having a different crawl dir name will have implications later on)
    In some time, after a long list of sysou, the crawldir is created.
  8. How to search via command line?
    nutch org.apache.nutch.searcher.NutchBean google
    Searching of a renamed crawldir which in my case is crawlgle (instead of the default crawl) by this method will not yield anything. Since we have the source, we can however overcome this, by modifying the nutch-default.xml's searcher.dir property. Modify it to your crawl dir name, re-compile and run the above program again. Watch the results.
  9. To watch the results in the nutch.war file, first compile it using /ant war.
    -then copy this war file to the tomcat webapps directory.
    cp build/nutch-0.9.war /Users/xyz/Work/apache-tomcat-6.0.16/webapps/
    -then start tomcat from the NutchSource/bin directory (or the parent directory of crawl dir). since the war file takes the searcher.dir relative to the path from where tomcat starts.
    -after this, go to http://localhost:8080/nutch-0.9/ and enter the search term to get results in a google like manner.

2 comments :

thodoris said...

you wrote "Then back to Nutch, hit go to source for JobConfig and it opened a .class file, here there's an option to attach source, I selected the newly configured Hadoop directory and now, I can drill down from Nutch into Hadoop source."

can you be more specific please...a can't figure it out...thanks

Tejas Patil said...

This is deprecated. The updated steps are here: https://wiki.apache.org/nutch/RunNutchInEclipse

Powered by Blogger.