JAVA 인터넷 Building a Search Engine With Nutch Solr And Hadoop

2016.04.21 21:45

Building a Search Engine With Nutch Solr And Hadoop

The World Wide Web (WWW) is an ocean of information, in fact a kid can stay at home from birth and learn everything without going to school a single day through the WWW (despite the fact that he would become a complete idiot at the end).

Swimming through this ocean and finding what is relevant to you is the duty of a web search engine. It has to crawl through the web, know which site contains which information and keep a record of that so that it can provide with relevant URL’s and data. Google has been the leading web search Engine throughout but now there are others such as Microsoft’s Bing which some say presents more focused results than Google.

Following steps describes building a search engine using Nutch and Solr, I did this for my organization IMS Health as a search Engine for Life sciences. Nutch and Solr are Apache Software’s and hence Open source. Nutch is a Web Crawler, which means it will look at a list of URL’s and bring in the data that is contained on those web pages. Solr is an open source search platform based on Apache Lucene Engine. Web Crawling is a tremendous task which requires extensive processing power and network bandwidth, because of this I run the nutch crawler on top of Hadoop which is another Apache product, Hadoop is a distributed processing and storage system that can be implemented on common hardware systems without the requirement of high end servers.

In short, nutch will work on crawling on top of hadoop and give the results to solr, solr will produce an index(inverted index as Solr describes) to give amazingly fast search responses to the user, solr will give results in the form of XML or JSON.

Softwares Applications Used:

Apache Nutch 1.7 as the crawler
Apache Solr 4.6.0 as the search indexing engine
Apache Hadoop 1.2.1 as the distributed processing environment
Apache Ant builder tool
Java Version 1.7.0
Cent OS 5.9 on the server
SSH configured on the server
Create a dedicated user named hadoop with password hadoop for this operation
Change on to the hadoop user
Enable SSH access to local machine with new key
Test SSH coonection to local machine with
Download Apache Solr (Latest version as present date is 4.6.1)
Extract the download
Rename the directory as solr
Navigate to the example directory inside solr and start solr using start.jar
This will start solr on jetty with the default port of 8983, you can check solr admin panel on http://<your domain>:8983/solr/
Download apache ant binary
Extract downloaded source
Set ANT_HOME variable in /etc/profile
At the end add the following lines
Set the variable value

Note: The URLS mentioned in the wget will be changed, searching through the archives will give you the accurate URLS.

Dedicated User creation:

useradd hadoop

passwd hadoop

> hadoop

su – hadoop

Configure SSH:

ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

ssh localhost

Apache Solr Installation:

wget http://mirror.reverse.net/pub/apache/lucene/solr/4.6.1/solr-4.6.1.tgz

tar zxf solr-4.6.1.tgz

sudo mv solr-4.6.1 solr

cd solr/example

java –jar start.jar

Apache Ant Installation

wget http://apache.mirrors.tds.net//ant/binaries/apache-ant-1.9.2-bin.tar.gz

tar -zxf apache-ant-1.9.2-bin.tar.gz

sudo vi /etc/profile

export ANT_HOME= /apache-ant-1.9.2

export PATH=$PATH:$ANT_HOME/bin

source /etc/profile

Apache Hadoop Installation

Download Hadoop 1.2.1

wget http://apache.spinellicreations.com/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

Extract and rename

sudo tar xzf hadoop-1.2.1.tar.gz

sudo mv hadoop-1.2.1 hadoop

sudo chown hadoop:hadoop hadoop

Update /profile

sudo vi /etc/profile

Add the following lines at the end of the profile

# Set Hadoop-related environment variables

export HADOOP_HOME = ~/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk

# Some convenient aliases and functions for running Hadoop-related commands

unalias fs &> /dev/null

alias fs="hadoop fs"

unalias hls &> /dev/null

alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and

# compress job outputs with LZOP (not covered in this tutorial):

# Conveniently inspect an LZOP compressed file from the command

# line; run via:

# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo

# Requires installed 'lzop' command.

lzohead () {

hadoop fs -cat 34054 | lzop -dc | head -1000 | less

}

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin

Edit hadoop-env.sh and add java home

cd ~/hadoop

sudo vi conf/hadoop-env.sh

# The java implementation to use. Required.

JAVA_HOME='/usr/lib/jvm/java-1.7.0-openjdk'

Create hadoop tmp folder and create ownership to hadoop

sudo mkdir /app/hadoop/tmp

sudo chown hadoop:hadoop /app/hadoop/tmp

Edit conf/core-site.xml

sudo vi conf/core-site.xml

Add the following lines

<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

< value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

Edit conf/mapred-site.xml

sudo vi conf/mapred-site.xml

Add the following lines

<property>

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

Edit conf/hdfs-site.xml

sudo vi conf/hdfs-site.xml

Add the following lines

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

Format the HDFS via the single name node

bin/hadoop namenode –format

Start the single node cluster

bin/start-all.sh

Use jps to check if all the tasks are running

jps

Output should display TaskTracker, Job Tracker, Namenode, Secondary Namenode and Job Tracker

Hadoop Debugging tips

In order to restart hadoop, perform the following steps:

cd ~/hadoop

bin/stop-all.sh

Delete the hadoop temp folder and recreate it with necessary access rights

sudo rm –r /app/hadoop/tmp

sudo mkdir /app/hadoop/tmp

sudo chmod 777 –R /app/hadoop/tmp (777 is given for this example purpose only)

bin/hadoop namenode –format

bin/start-all.sh

If any of the components are not starting up, refer to the hadoop logs for any exceptions. Hadoop logs are located in the logs/ subdirectory in hadoop root. For each of the five processes on hadoop separate logs are created.

Apache Nutch Installation

Download nutch version 1.7 (download the src in order to run with hadoop)

wget http://psg.mtu.edu/pub/apache/nutch/1.7/apache-nutch-1.7-src.tar.gz

Extract the download and rename the folder

sudo tar zxf apache-nutch-1.7-src.tar.gz

sudo mv apache-nutch-1.7-src nutch

Copy conf/nutch-default.xml to conf/nutch-site.xml

cd nutch/conf

sudo mv nutch-site.xml nutch-site_bak.xml

sudo cp nutch-default.xml nutch-site.xml

Edit nutch-site.xml and set the name of your crawler for the http.agent.name property

sudo vi nutch-site.xml

edit:

<name>http.agent.name</name>

<value>Your Name<value>

Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml from ~/hadoop/conf to ~/nutch/conf

sudo cp ~/hadoop/conf/hadoop-env.sh ~/nutch/conf

sudo cp ~/hadoop/conf/hdfs-site.xml ~/nutch/conf

sudo cp ~/hadoop/conf/mapred-site.xml ~/nutch/conf

sudo cp ~/hadoop/conf/core-site.xml ~/nutch/conf

Edit nutch default.properties and

cd ~/nutch

sudo vi default.properties

Change name from apache-nutch to nutch in default.properties

Build nutch using ant

cd ~/nutch

sudo ant runtime

Create classpath

export CLASSPATH=~/nutch/runtime/local/lib

Making Solr compatible with Nutch

Rename schema.xml in solr/example/solr/collection1/conf

mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

Move the nutch schema-solr4.xml into solr/example/solr/collection1/conf and rename to schema.xml

cp ${NUTCH_RUNTIME_HOME}/conf/schema-solr4.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml

Edit the new schema.xml

vi ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml

Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>

restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example

Running the crawl

Create URLS file in nutch

cd ~/nutch

sudo mkdir urls

cd ~/nutch

sudo nano seed.txt

Add the necessary urls to this file one per each line(Three example URLS used below)

http://diabetesdailypost.com/

http://www.allfortheboys.com/

http://www.diabetesdaily.com/

Put the urls folder to HDFS

cd ~/hadoop

bin/hadoop dfs –moveFromLocal ~/nutch/urls urls

Check whether the file exists, the urls folder will be created under /user/hadoop

bin/hadoop dfs –ls

Finally run the nutch crawl on hadoop and add the data to solr index

bin/hadoop jar /home/hadoop/nutch/runtime/deploy/nutch-1.7.job org.apache.nutch.crawl.Crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 5

Using the Hadoop web interfaces

You can check the status of the hadoop jobs that are currently running with the job tracker and HDFS web interfaces

http://<your domain>:50030 – Job tracker interface

http://<your domain>:50070 – HDFS interface

References:

http://www.cyberciti.biz/faq/howto-add-new-linux-user-account/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://wiki.apache.org/nutch/NutchTutorial

http://www.sysadminhub.in/2013/07/installing-apache-ant-on-linux-centos.html

http://nutchhadoop.blogspot.com/

http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick- reference/

[출처] http://ranithsachin.blogspot.kr/2014/04/building-search-engine-with-nutch-solr.html

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.

이 게시물을

번호	제목	글쓴이	날짜	조회 수
27	[java][maven] jar 파일 의존성 한번에 다운로드 maven 사용	졸리운_곰	2023.08.24	100
26	Prometheus + Grafana로 Java 애플리케이션 모니터링하기	졸리운_곰	2020.12.17	136
25	Blockchain Implementation With Java Code	졸리운_곰	2019.06.16	175
24	Java 코드로 이해하는 블록체인(Blockchain)	졸리운_곰	2019.06.16	189
23	순수 Java Application 코드로 Restful api 호출	졸리운_곰	2018.10.10	295
22	WebDAV 구현을 위한 환경 설정	졸리운_곰	2017.09.24	131
21	[Java] Apache Commons HttpClient로 SSL 통신하기	졸리운_곰	2017.03.27	616
20	JSoup를 이용한 HTML 파싱	졸리운_곰	2017.03.04	175
19	jsoup을 활용해서 Java에서 HTML 파싱하는 방법 정리	졸리운_곰	2017.03.04	431
18	NSA의 Dataflow 엔진 Apache NiFi 소개와 설치	졸리운_곰	2017.01.23	508
17	wordpress-java-integration 자바와 워드프레스 통합	졸리운_곰	2016.12.30	220
16	Create New Posts in Wordpress using Java and XMLRpc	졸리운_곰	2016.11.14	139
15	자바로 POST 방식으로 통신하기, java httppost 클래스를 활용한 예제	졸리운_곰	2016.11.14	507
14	[Java]아파치 HttpClient사용하기	졸리운_곰	2016.11.14	139
»	Building a Search Engine With Nutch Solr And Hadoop	졸리운_곰	2016.04.21	276
12	Nutch and Hadoop Tutorial	졸리운_곰	2016.04.21	258
11	Latest step by Step Installation guide for dummies: Nutch 0.	졸리운_곰	2016.04.21	186
10	Nutch 초간단 빌드와 실행	졸리운_곰	2016.04.21	528
9	Nutch로 알아보는 Crawling 구조 - Joinc	졸리운_곰	2016.04.21	411
8	A tiny bittorrent library Java: 자바로 만든 작은 bittorrent 라이브러리	졸리운_곰	2016.04.20	298

첫 페이지 1 2 끝 페이지

쓰기

태그

JAVA 인터넷 Building a Search Engine With Nutch Solr And Hadoop

Building a Search Engine With Nutch Solr And Hadoop

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다. 어린이용이며, 설치가 필요없는 브라우저 게임입니다. https://s1004games.com

댓글 0

로그인

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com