1日 30分 인생승리의 학습 - Setup Nutch 1.6 to run on Hadoop cluster and integrate Solr for search

Setup Nutch 1.6 to run on Hadoop cluster and integrate Solr for search

Nutch is an open source web crawler written in Java. I had a previous post talking about Solr and Nutch integration, which mainly covered how you setup Nutch to run on local mode (without Hadoop) and integrate with Apache Solr for search. Today I am going to cover how we can run Nutch 1.6 on top of Hadoop.

1. Setup Apache Nutch 1.6

1.1a Setup Nutch from binary distribution:

This is no longer an option because the binary distribution by default in 1.6 is running on local mode.

1.1b Setup Nutch from source distribution:

Download Nutch 1.6 source from http://www.apache.org/dyn/closer.cgi/nutch/.
Unzip it and put the directory as $HOME/apache-nutch-1.6
cd $HOME/apache-nutch-1.6
Add your spider name as http.agent.name in conf/nutch-default.xml, for example:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
run “ant” command.
It should generate a directory called $HOME/apache-nutch-1.6/runtime.

From now on, I am going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.
1.2 Verify Nutch installation
run the following command:
cd ${NUTCH_RUNTIME_HOME}/deploy
bin/nutch
You are good to go if you are seeing the following:
Usage: nutch [-core] COMMAND
....
Troubleshooting tips:
1. Run the following command if you are seeing "Permission denied":
chmod +x bin/nutch
2. Setup JAVA_HOME if you are seeing JAVA_HOME not set.On Mac, you can run the following command or add it to ~/.bashrc:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

2. Setup Solr 3.6 or 4.1 for search

2.1a Setup Solr 3.6 from source distribution

You can setup Solr from source distribution with Maven. The link below shows how to do that:
http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html.

2.1b Setup Solr 3.6 from binary distribution

1. Download binary file from http://www.apache.org/dyn/closer.cgi/lucene/solr/.
2. unzip apache-solr-3.6.2.zip
3. cd apache-solr-3.6.2/example
4. java -jar start.jar

2.2 Verify Solr installation

After you started Solr admin console, you should be able to access the following links:
http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

3. Setup Hadoop
You can skip this if you already setup Hadoop, otherwise, follow the instructions here.

4. Integrate Solr with Nutch

We have both Nutch, Solr and Hadoop installed and setup correctly. Below are the steps to make hyperlinks to be searchable:

For Solr 3.*: Run the command:
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

For Solr 4.*: Run the command below:
cp ${NUTCH_RUNTIME_HOME}/conf/schema-solr4.xml ${APACHE_SOLR_HOME}/example/solr/conf/

and add "_version_" to schema.xml for concurrency control as below:

<field name="_version_" type="long" indexed="true" stored ="true"/>

restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
Now we are ready to access http://localhost:8983/solr/admin/.

5. Crawl your first website
1. cd $HOME/apache-nutch-1.6/runtime/deploy
2. mkdir -p firstSite/urls
3. create a file nutch under firstSite/urls with the following content:
http://tutorial.waycoolsearch.com/
or any site you want Nutch to crawl.
4. Put the firstSite directory to HDFS.
hadoop fs -put firstSite firstSite
5.1 Run one of the following command if you don't want to send results to Solr yet:
hadoop jar apache-nutch-1.6.job org.apache.nutch.crawl.Crawl firstSite/urls -dir urls -depth 1 -topN 5
OR:
bin/nutch crawl firstSite/urls -dir urls -depth 1 -topN 5
5.2 Run the following command, which will crawl the sites and send results to Solr for searching:
bin/nutch crawl firstSite/urls -dir urls -depth 1 -topN 5 -solr http://localhost:8983/solr/
6. Now we are ready to search with http://localhost:8983/solr/admin/.

Note: You must miss 1.1b 4-5 if you are seeing the following error:
ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

번호	제목	글쓴이	날짜	조회 수
1195	[ 一日30分 인생승리의 학습법] VBA Web Scraping: How Can VBA Be Used To Scrape Website Data?	졸리운_곰	2024.04.13	3
1194	[ 一日30分 인생승리의 학습법] 윈도우 실행파일 구조(PE파일)	졸리운_곰	2024.03.31	3
1193	[ 一日30分 인생승리의 학습법] [Analysis] PE(Portable Executable) 파일 포맷 공부	졸리운_곰	2024.03.31	3
1192	[ 一日30分 인생승리의 학습법] 성공하는 메타버스의 3가지 조건	졸리운_곰	2024.03.30	7
1191	[ 一日30分 인생승리의 학습법] REST, REST API, RESTful 과 HATEOAS	졸리운_곰	2024.03.10	9
1190	[ 一日30分 인생승리의 학습법] 렌더링 삼형제 CSR, SSR, SSG 이해하기	졸리운_곰	2024.03.10	2
1189	[ 一日30分 인생승리의 학습법] 엑셀 VBA에서 셀레니움 사용을 위한 Selenium Basic 설치	졸리운_곰	2024.02.23	11
1188	[ 一日30分 인생승리의 학습법]500 Lines or Less Blockcode: A Visual Programming Toolkit : 500줄 이하의 블록코드: 시각적 프로그래밍 툴킷	졸리운_곰	2024.02.12	4
1187	[ 一日30分 인생승리의 학습법] 구글 클라이언트(앱) 아이디를 발급받으려면 어떻게 해야 하나요?	졸리운_곰	2024.01.28	3
1186	[ 一日30分 인생승리의 학습법] 빅뱅 프로젝트를 성공적으로 오픈하기 위한 팁	졸리운_곰	2023.12.27	16
1185	[ 一日30分 인생승리의 학습법]“빅뱅 전환보다 단계적 전환 방식이 이상적 애자일팀과 협업 쉽게 체질 개선을”	졸리운_곰	2023.12.27	12
1184	[ 一日30分 인생승리의 학습법] Big-bang / phased 접근	졸리운_곰	2023.12.27	3
1183	[ 一日30分 인생승리의 학습법] CodeDragon 메뉴 데이터 전환의 개념 이해 - 데이터 전환의 개념, 데이터 전환방식, 데이터 전환방식 및 장단점 비교, 데이터전환 이후 검토해야 할 사항	졸리운_곰	2023.12.27	5
1182	[ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 분쟁 해결 시스템	졸리운_곰	2023.12.27	6
1181	[ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 개념과 리뷰 시스템	졸리운_곰	2023.12.27	4
1180	[ 一日30分 인생승리의 학습법] 소켓 CLOSE_WAIT 발생 현상 및 처리 방안	졸리운_곰	2023.12.03	7
1179	[ 一日30分 인생승리의 학습법] robots 설정하기	졸리운_곰	2023.12.03	3
1178	[ 一日30分 인생승리의 학습법] A Tutorial and Elementary Trajectory Model for the Differential Steering System of Robot Wheel Actuators : 로봇 휠 액츄에이터의 차동 조향 시스템에 대한 튜토리얼 및 기본 궤적 모델	졸리운_곰	2023.11.29	6
1177	[ 一日30分 인생승리의 학습법] Streamline Your MLOps Journey with CodeProject.AI Server : CodeProject.AI 서버로 MLOps 여정을 간소화하세요	졸리운_곰	2023.11.25	2
1176	[ 一日30分 인생승리의 학습법] Comparing Self-Hosted AI Servers: A Guide for Developers / : 자체 호스팅 AI 서버 비교: 개발자를 위한 가이드	졸리운_곰	2023.11.25	10

번호

제목

글쓴이

날짜

조회 수

1195

[ 一日30分 인생승리의 학습법] VBA Web Scraping: How Can VBA Be Used To Scrape Website Data? file

Setup Nutch 1.6 to run on Hadoop cluster and integrate Solr for search

Setup Nutch 1.6 to run on Hadoop cluster and integrate Solr for search

1. Setup Apache Nutch 1.6

1.1a Setup Nutch from binary distribution:

2. Setup Solr 3.6 or 4.1 for search

2.1a Setup Solr 3.6 from source distribution

2.1b Setup Solr 3.6 from binary distribution

2.2 Verify Solr installation

3. Setup Hadoop You can skip this if you already setup Hadoop, otherwise, follow the instructions here.

4. Integrate Solr with Nutch

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다. 어린이용이며, 설치가 필요없는 브라우저 게임입니다. https://s1004games.com

댓글 0

로그인

3. Setup Hadoop
You can skip this if you already setup Hadoop, otherwise, follow the instructions here.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com