Nutch is an open source web crawler written in Java. I had a previous post talking about Solr and Nutch integration, which mainly covered how you setup Nutch to run on local mode (without Hadoop) and integrate with Apache Solr for search. Today I am going to cover how we can run Nutch 1.6 on top of Hadoop.
1. Setup Apache Nutch 1.6
1.1a Setup Nutch from binary distribution:
This is no longer an option because the binary distribution by default in 1.6 is running on local mode.
1.1b Setup Nutch from source distribution:
- Download Nutch 1.6 source from http://www.apache.org/dyn/closer.cgi/nutch/.
- Unzip it and put the directory as $HOME/apache-nutch-1.6
- cd $HOME/apache-nutch-1.6
- Add your spider name as http.agent.name in conf/nutch-default.xml, for example:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property> - run “ant” command.
- It should generate a directory called $HOME/apache-nutch-1.6/runtime.
1.2 Verify Nutch installation
run the following command:
cd ${NUTCH_RUNTIME_HOME}/deploy
bin/nutch
You are good to go if you are seeing the following:
Usage: nutch [-core] COMMAND
....
Troubleshooting tips:
1. Run the following command if you are seeing "Permission denied":
chmod +x bin/nutch
2. Setup JAVA_HOME if you are seeing JAVA_HOME not set.On Mac, you can run the following command or add it to ~/.bashrc:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
2. Setup Solr 3.6 or 4.1 for search
2.1a Setup Solr 3.6 from source distribution
You can setup Solr from source distribution with Maven. The link below shows how to do that:http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html.
2.1b Setup Solr 3.6 from binary distribution
1. Download binary file from http://www.apache.org/dyn/closer.cgi/lucene/solr/.2. unzip apache-solr-3.6.2.zip
3. cd apache-solr-3.6.2/example
4. java -jar start.jar
2.2 Verify Solr installation
After you started Solr admin console, you should be able to access the following links:http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp
3. Setup Hadoop
You can skip this if you already setup Hadoop, otherwise, follow the instructions here.
4. Integrate Solr with Nutch
We have both Nutch, Solr and Hadoop installed and setup correctly. Below are the steps to make hyperlinks to be searchable:- For Solr 3.*: Run the command:
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ - For Solr 4.*: Run the command below:
cp ${NUTCH_RUNTIME_HOME}/conf/schema-solr4.xml ${APACHE_SOLR_HOME}/example/solr/conf/and add "_version_" to schema.xml for concurrency control as below:
<field name="_version_" type="long" indexed="true" stored ="true"/>
- restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
- Now we are ready to access http://localhost:8983/solr/admin/.
1. cd $HOME/apache-nutch-1.6/runtime/deploy
2. mkdir -p firstSite/urls
3. create a file nutch under firstSite/urls with the following content:
http://tutorial.waycoolsearch.com/
or any site you want Nutch to crawl.
4. Put the firstSite directory to HDFS.
hadoop fs -put firstSite firstSite
5.1 Run one of the following command if you don't want to send results to Solr yet:
hadoop jar apache-nutch-1.6.job org.apache.nutch.crawl.Crawl firstSite/urls -dir urls -depth 1 -topN 5
OR:
bin/nutch crawl firstSite/urls -dir urls -depth 1 -topN 5
5.2 Run the following command, which will crawl the sites and send results to Solr for searching:
bin/nutch crawl firstSite/urls -dir urls -depth 1 -topN 5 -solr http://localhost:8983/solr/
6. Now we are ready to search with http://localhost:8983/solr/admin/.
Note: You must miss 1.1b 4-5 if you are seeing the following error:
ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.