Latest step by Step Installation guide for dummies: Nutch 0.9

By Peter P. Wang, Zillionics LLC

Try the search engine I developed for The Christian Life: Malachi Search

Please support my effort by using the best free/low price web hosting: 1&1 Inc

peterwang@zillionics.com

To add your comments, please go to: http://nutchtube.blogspot.com/2008/02/latest-step-by-step-installation-guide.html

 

  1. Download software
    1. Nutch 0.9: http://www.apache.org/dyn/closer.cgi/lucene/nutch/
    2. JAVA JDK 6 update 3: http://java.sun.com/javase/downloads/index.jsp
    3. Apache web server 6: http://tomcat.apache.org/download-60.cgi
    4. Cygwin: http://www.cygwin.com/

 

  1. Install software one by one
    1. First, install cygwin: run cygwinSetup.exe. You should see this when run it.

A01.png

 

 

  1. Second, install JAVA: run dk-6u3-windows-i586-p.exe

a02.png

 

 

  1. Third, install Apache: run apache-tomcat-6.0.14.exe.

Run it by clicking the Configure Tomcat icon below.

a03.png

 

 

Click the Start button below to start Apache Tomcat Service.

a04.png

 

 

Then you will be able to see the following screen in the browser if you go to http://localhost:8080

a05.png

 

 

  1. Fourth, unzip nutch-0.9.tar.gz to any directory you like, e.g. c:\nutch.

 

a06.png

 

 

  1. Setup the crawler
    1. In Cygwin window, go to the directory of your nutch, and set your JAVA_HOME as follows..

a07.png

 

 

  1. Create a directory called urls to hold the a text file with urls inside of it.

a08.png

 

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

 

  1. In this directory, create the text file with any name you like. Put any URL’s line by line. This is the crawler’s “shopping list”. J

a09.png

 

 

  1. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
                      +^http://([a-z0-9]*\.)*apache.org/

 

a10.png

 

 

  1. Edit the file conf/nutch-site.xml. insert at minimum following properties into it and edit in proper values for the properties:

 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 
<!-- Put site-specific property overrides in this file. -->

 
<configuration>

 
<property>
  <name>http.agent.name</name>
  <value>Peter Wang</value>
  <description>Peter Pu Wang
  </description>
</property>

 
<property>
  <name>http.agent.description</name>
  <value>Nutch spiderman</value>
  <description> Nutch spiderman
  </description>
</property>

 
<property>
  <name>http.agent.url</name>
  <value>http://peterpuwang.googlepages.com </value>
  <description>http://peterpuwang.googlepages.com
  </description>
</property>

 
<property>
  <name>http.agent.email</name>
  <value>MyEmail</value>
  <description>peterpuwang@yahoo.com
  </description>
</property>

 
</configuration>

 
  1. Run the crawler

Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:

  • -dir dir names the directory to put the crawl in.
  • -threads threads determines the number of threads that will fetch in parallel.
  • -depth depth indicates the link depth from the root page that should be crawled.
  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

a11.png

 

  1. Web Searching based on the crawling result above:
    1. Go to http://localhost:8080/manager/html in the browser. In the “WAR file to deploy” section. Select the nutch0.9.war file to upload. It is in your nutch directory. Then you will see the /nutch-0.9 is in the list.

a12.png

 

 

  1. Go to your Apache tomcat directory\webapps, e.g. C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps, and you will see the nutch-0.9.war is already copied there.

a13.png

 

 

  1. In the browser’s page http://localhost:8080/manager/html, click “Start” link in the /nutch-0.9 row. Then you will see a folder called “nutch-0.9” will be created in the webapps folder shown above.

d.      Set Your Searcher Directory

Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the nutch-site.xml file and add the following to it (make sure you don't have two sets of< configuration></configuration> tags!):

<configuration>
  <property>
    <name>searcher.dir</name>
    <value>your_crawl_folder_here</value>
  </property>
</configuration>

For example, if your nutch directory resides at C:\nutch-0.9.0 and you specified crawl as the directory after the -dir command, then enter C:\nutch-0.9.0\crawl\ instead of your_crawl_folder_here.

a14.png

 

e.      Reload

Reload the Application. Use the Tomcat Manager and simply click the "Reload" command for nutch, or restart Tomcat using the windows services tool.

Open up a browser and enter the url http://localhost:8080. The nutch search page should appear. As long as you've defined the correct location of your nutch index directory (as shown above), clicking search should yield results.

a15.png

 

 

Congratulations! It rocks!

 

Peter P. Wang

peterpuwang@zillionics.com

 

 

 

[출처] [web source] http://zillionics.com/resources/Articles/NutchGuideForDummies.htm

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
27 [java][maven] jar 파일 의존성 한번에 다운로드 maven 사용 졸리운_곰 2023.08.24 13
26 Prometheus + Grafana로 Java 애플리케이션 모니터링하기 file 졸리운_곰 2020.12.17 79
25 Blockchain Implementation With Java Code file 졸리운_곰 2019.06.16 108
24 Java 코드로 이해하는 블록체인(Blockchain) 졸리운_곰 2019.06.16 127
23 순수 Java Application 코드로 Restful api 호출 졸리운_곰 2018.10.10 245
22 WebDAV 구현을 위한 환경 설정 file 졸리운_곰 2017.09.24 74
21 [Java] Apache Commons HttpClient로 SSL 통신하기 졸리운_곰 2017.03.27 545
20 JSoup를 이용한 HTML 파싱 졸리운_곰 2017.03.04 100
19 jsoup을 활용해서 Java에서 HTML 파싱하는 방법 정리 file 졸리운_곰 2017.03.04 157
18 NSA의 Dataflow 엔진 Apache NiFi 소개와 설치 file 졸리운_곰 2017.01.23 463
17 wordpress-java-integration 자바와 워드프레스 통합 졸리운_곰 2016.12.30 166
16 Create New Posts in Wordpress using Java and XMLRpc 졸리운_곰 2016.11.14 69
15 자바로 POST 방식으로 통신하기, java httppost 클래스를 활용한 예제 졸리운_곰 2016.11.14 444
14 [Java]아파치 HttpClient사용하기 file 졸리운_곰 2016.11.14 104
13 Building a Search Engine With Nutch Solr And Hadoop file 졸리운_곰 2016.04.21 219
12 Nutch and Hadoop Tutorial file 졸리운_곰 2016.04.21 209
» Latest step by Step Installation guide for dummies: Nutch 0. file 졸리운_곰 2016.04.21 131
10 Nutch 초간단 빌드와 실행 졸리운_곰 2016.04.21 457
9 Nutch로 알아보는 Crawling 구조 - Joinc 졸리운_곰 2016.04.21 352
8 A tiny bittorrent library Java: 자바로 만든 작은 bittorrent 라이브러리 file 졸리운_곰 2016.04.20 242
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED