- 전체
- JAVA 일반
- JAVA 수학
- JAVA 그래픽
- JAVA 자료구조
- JAVA 인공지능
- JAVA 인터넷
- Java Framework
- Java GUI (AWT,SWING,SWT,JFACE)
- SWT and RCP (web RAP/RWT)[eclipse], EMF
JAVA 인터넷 Building a Search Engine With Nutch Solr And Hadoop
2016.04.21 21:45
Building a Search Engine With Nutch Solr And Hadoop
The World Wide Web (WWW) is an ocean of information, in fact a kid can stay at home from birth and learn everything without going to school a single day through the WWW (despite the fact that he would become a complete idiot at the end).
Swimming through this ocean and finding what is relevant to you is the duty of a web search engine. It has to crawl through the web, know which site contains which information and keep a record of that so that it can provide with relevant URL’s and data. Google has been the leading web search Engine throughout but now there are others such as Microsoft’s Bing which some say presents more focused results than Google.
Following steps describes building a search engine using Nutch and Solr, I did this for my organization IMS Health as a search Engine for Life sciences. Nutch and Solr are Apache Software’s and hence Open source. Nutch is a Web Crawler, which means it will look at a list of URL’s and bring in the data that is contained on those web pages. Solr is an open source search platform based on Apache Lucene Engine. Web Crawling is a tremendous task which requires extensive processing power and network bandwidth, because of this I run the nutch crawler on top of Hadoop which is another Apache product, Hadoop is a distributed processing and storage system that can be implemented on common hardware systems without the requirement of high end servers.
Softwares Applications Used:
- Apache Nutch 1.7 as the crawler
- Apache Solr 4.6.0 as the search indexing engine
- Apache Hadoop 1.2.1 as the distributed processing environment
- Apache Ant builder tool
- Java Version 1.7.0
- Cent OS 5.9 on the server
- SSH configured on the server
- Create a dedicated user named hadoop with password hadoop for this operation
- Change on to the hadoop user
- Enable SSH access to local machine with new key
- Test SSH coonection to local machine with
- Download Apache Solr (Latest version as present date is 4.6.1)
- Extract the download
- Rename the directory as solr
- Navigate to the example directory inside solr and start solr using start.jar
- This will start solr on jetty with the default port of 8983, you can check solr admin panel on http://<your domain>:8983/solr/
- Download apache ant binary
- Extract downloaded source
- Set ANT_HOME variable in /etc/profile
- At the end add the following lines
- Set the variable value
Note: The URLS mentioned in the wget will be changed, searching through the archives will give you the accurate URLS.
Dedicated User creation:
useradd hadoop
passwd hadoop
> hadoop
su – hadoop
Configure SSH:
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
Apache Solr Installation:
tar zxf solr-4.6.1.tgz
sudo mv solr-4.6.1 solr
cd solr/example
java –jar start.jar
Apache Ant Installation
tar -zxf apache-ant-1.9.2-bin.tar.gz
sudo vi /etc/profile
export ANT_HOME= /apache-ant-1.9.2
export PATH=$PATH:$ANT_HOME/bin
source /etc/profile
Apache Hadoop Installation
Download Hadoop 1.2.1
Extract and rename
sudo tar xzf hadoop-1.2.1.tar.gz
sudo mv hadoop-1.2.1 hadoop
sudo chown hadoop:hadoop hadoop
Update /profile
sudo vi /etc/profile
Add the following lines at the end of the profile
# Set Hadoop-related environment variables
export HADOOP_HOME = ~/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat 34054 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Edit hadoop-env.sh and add java home
cd ~/hadoop
sudo vi conf/hadoop-env.sh
# The java implementation to use. Required.
JAVA_HOME='/usr/lib/jvm/java-1.7.0-openjdk'
Create hadoop tmp folder and create ownership to hadoop
sudo mkdir /app/hadoop/tmp
sudo chown hadoop:hadoop /app/hadoop/tmp
Edit conf/core-site.xml
sudo vi conf/core-site.xml
Add the following lines
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
< value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
Edit conf/mapred-site.xml
sudo vi conf/mapred-site.xml
Add the following lines
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
Edit conf/hdfs-site.xml
sudo vi conf/hdfs-site.xml
Add the following lines
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Format the HDFS via the single name node
bin/hadoop namenode –format
Start the single node cluster
bin/start-all.sh
Use jps to check if all the tasks are running
jps
Output should display TaskTracker, Job Tracker, Namenode, Secondary Namenode and Job Tracker
Hadoop Debugging tips
In order to restart hadoop, perform the following steps:
cd ~/hadoop
bin/stop-all.sh
Delete the hadoop temp folder and recreate it with necessary access rights
sudo rm –r /app/hadoop/tmp
sudo mkdir /app/hadoop/tmp
sudo chmod 777 –R /app/hadoop/tmp (777 is given for this example purpose only)
bin/hadoop namenode –format
bin/start-all.sh
If any of the components are not starting up, refer to the hadoop logs for any exceptions. Hadoop logs are located in the logs/ subdirectory in hadoop root. For each of the five processes on hadoop separate logs are created.
Apache Nutch Installation
Download nutch version 1.7 (download the src in order to run with hadoop)
Extract the download and rename the folder
sudo tar zxf apache-nutch-1.7-src.tar.gz
sudo mv apache-nutch-1.7-src nutch
Copy conf/nutch-default.xml to conf/nutch-site.xml
cd nutch/conf
sudo mv nutch-site.xml nutch-site_bak.xml
sudo cp nutch-default.xml nutch-site.xml
Edit nutch-site.xml and set the name of your crawler for the http.agent.name property
sudo vi nutch-site.xml
edit:
<name>http.agent.name</name>
<value>Your Name<value>
Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml from ~/hadoop/conf to ~/nutch/conf
sudo cp ~/hadoop/conf/hadoop-env.sh ~/nutch/conf
sudo cp ~/hadoop/conf/hdfs-site.xml ~/nutch/conf
sudo cp ~/hadoop/conf/mapred-site.xml ~/nutch/conf
sudo cp ~/hadoop/conf/core-site.xml ~/nutch/conf
Edit nutch default.properties and
cd ~/nutch
sudo vi default.properties
Change name from apache-nutch to nutch in default.properties
Build nutch using ant
cd ~/nutch
sudo ant runtime
Create classpath
export CLASSPATH=~/nutch/runtime/local/lib
Making Solr compatible with Nutch
Rename schema.xml in solr/example/solr/collection1/conf
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
Move the nutch schema-solr4.xml into solr/example/solr/collection1/conf and rename to schema.xml
cp ${NUTCH_RUNTIME_HOME}/conf/schema-solr4.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
Edit the new schema.xml
vi ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
Running the crawl
Create URLS file in nutch
cd ~/nutch
sudo mkdir urls
cd ~/nutch
sudo nano seed.txt
Add the necessary urls to this file one per each line(Three example URLS used below)
http://diabetesdailypost.com/
http://www.allfortheboys.com/
Put the urls folder to HDFS
cd ~/hadoop
bin/hadoop dfs –moveFromLocal ~/nutch/urls urls
Check whether the file exists, the urls folder will be created under /user/hadoop
bin/hadoop dfs –ls
Finally run the nutch crawl on hadoop and add the data to solr index
bin/hadoop jar /home/hadoop/nutch/runtime/deploy/nutch-1.7.job org.apache.nutch.crawl.Crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 5
Using the Hadoop web interfaces
You can check the status of the hadoop jobs that are currently running with the job tracker and HDFS web interfaces
http://<your domain>:50030 – Job tracker interface
References:
http://www.cyberciti.biz/faq/howto-add-new-linux-user-account/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://wiki.apache.org/nutch/NutchTutorial
http://www.sysadminhub.in/2013/07/installing-apache-ant-on-linux-centos.html
http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick- reference/
본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.