Building a Search Engine With Nutch Solr And Hadoop

 
 
 
seo-search-engine-optimization-600-3001.jpg

 



The World Wide Web (WWW) is an ocean of information, in fact a kid can stay at home from birth and learn everything without going to school a single day through the WWW (despite the fact that he would become a complete idiot at the end).
 
Swimming through this ocean and finding what is relevant to you is the duty of a web search engine. It has to crawl through the web, know which site contains which information and keep a record of that so that it can provide with relevant URL’s and data. Google has been the leading web search Engine throughout but now there are others such as Microsoft’s Bing which some say presents more focused results than Google.

Following steps describes building a search engine using Nutch and Solr, I did this for my organization IMS Health as a search Engine for Life sciences. Nutch and Solr are Apache Software’s and hence Open source. Nutch is a Web Crawler, which means it will look at a list of URL’s and bring in the data that is contained on those web pages. Solr is an open source search platform based on Apache Lucene Engine.  Web Crawling is a tremendous task which requires extensive processing power and network bandwidth, because of this I run the nutch crawler on top of Hadoop which is another Apache product, Hadoop is a distributed processing and storage system that can be implemented on common hardware systems without the requirement of high end servers.
 
In short, nutch will work on crawling on top of hadoop and give the results to solr, solr will produce an index(inverted index as Solr describes) to give amazingly fast search responses to the user, solr will give results in the form of XML or JSON.
 
 
Softwares Applications Used:
 
  • Apache Nutch 1.7 as the crawler
  • Apache Solr 4.6.0 as the search indexing engine
  • Apache Hadoop 1.2.1 as the distributed processing environment
  • Apache Ant builder tool
  • Java Version 1.7.0
  • Cent OS 5.9 on the server
  • SSH configured on the server
  • Create a dedicated user named hadoop with password hadoop for this operation
  • Change on to the hadoop user
  • Enable SSH access to local machine with new key
  • Test SSH coonection to local machine with
  • Download Apache Solr (Latest version as present date is 4.6.1)
  • Extract the download
  • Rename the directory as solr
  • Navigate to the example directory inside solr and start solr using start.jar
  • This will start solr on jetty with the default port of 8983, you can check solr admin panel on http://<your domain>:8983/solr/
  • Download apache ant binary
  • Extract downloaded source
  • Set ANT_HOME variable in /etc/profile
  • At the end add the following lines
  • Set the variable value
 
Note: The URLS mentioned in the wget will be changed, searching through the archives will give you the accurate URLS.
 
Dedicated User creation:
 
 
useradd hadoop
passwd hadoop    
> hadoop
 
 
su – hadoop
 
Configure SSH:
 
   ssh-keygen -t rsa -P ""
 
 
     cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
 
 
 
     ssh localhost
 
Apache Solr Installation:
 
 
 
 
 
tar zxf solr-4.6.1.tgz
 
 
            sudo mv solr-4.6.1 solr
 
 
            cd solr/example
            java –jar start.jar
 
 
 
Apache Ant Installation
 
 
 
 
 
 
             tar -zxf apache-ant-1.9.2-bin.tar.gz
 
 
 
             sudo vi /etc/profile
 
 
 
            export ANT_HOME= /apache-ant-1.9.2
export PATH=$PATH:$ANT_HOME/bin
 
 
 
               source /etc/profile
 
Apache Hadoop Installation
 
Download Hadoop 1.2.1
 
 
Extract and rename         
 
                  sudo tar xzf hadoop-1.2.1.tar.gz
 
                  sudo mv hadoop-1.2.1 hadoop
 
                   sudo chown hadoop:hadoop hadoop
 
Update /profile
 
 
                    sudo vi  /etc/profile
 
Add the following lines at the end of the profile
 
# Set Hadoop-related environment variables
export HADOOP_HOME = ~/hadoop
 
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
 
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
 
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat 34054 | lzop -dc | head -1000 | less
}
 
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
 
Edit hadoop-env.sh and add java home
 
cd ~/hadoop
 
                  sudo vi conf/hadoop-env.sh
 
# The java implementation to use.  Required.
JAVA_HOME='/usr/lib/jvm/java-1.7.0-openjdk'
 
Create hadoop tmp folder and create ownership to hadoop
 
            sudo mkdir /app/hadoop/tmp
 
            sudo chown hadoop:hadoop /app/hadoop/tmp
 
Edit conf/core-site.xml
 
             sudo vi conf/core-site.xml
 
Add the following lines
<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>fs.default.name</name>
 < value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
Edit conf/mapred-site.xml
 
                  sudo vi conf/mapred-site.xml
 
Add the following lines
 
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
Edit conf/hdfs-site.xml
 
sudo vi conf/hdfs-site.xml
 
Add the following lines
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
Format the HDFS via the single name node
 
                  bin/hadoop namenode –format
 
Start the single node cluster
 
                  bin/start-all.sh
 
Use jps to check if all the tasks are running 
 
                  jps
     
 
Output should display TaskTracker, Job Tracker, Namenode, Secondary Namenode and Job Tracker
 
Hadoop Debugging tips
 
     In order to restart hadoop, perform the following steps:
 
cd  ~/hadoop
bin/stop-all.sh
 
Delete the hadoop temp folder and recreate it with necessary access rights
 
sudo rm –r /app/hadoop/tmp
sudo mkdir /app/hadoop/tmp
sudo chmod 777 –R /app/hadoop/tmp (777 is given for this example purpose only)
bin/hadoop namenode –format
bin/start-all.sh
 
 
If any of the components are not starting up, refer to the hadoop logs for any exceptions. Hadoop logs are located in the logs/ subdirectory in hadoop root. For each of the five processes on hadoop separate logs are created.
 
Apache Nutch Installation
 
Download nutch version 1.7 (download the src in order to run with hadoop)
 
 
Extract the download and rename the folder
 
sudo tar zxf apache-nutch-1.7-src.tar.gz
            sudo mv apache-nutch-1.7-src nutch
 
Copy conf/nutch-default.xml to conf/nutch-site.xml
 
            cd nutch/conf
            sudo mv nutch-site.xml nutch-site_bak.xml
            sudo cp nutch-default.xml nutch-site.xml
 
Edit nutch-site.xml and set the name of your crawler for the http.agent.name property
 
          sudo vi nutch-site.xml
 
          edit:
           <name>http.agent.name</name>
           <value>Your Name<value>
 
Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml from ~/hadoop/conf to ~/nutch/conf
 
       sudo cp ~/hadoop/conf/hadoop-env.sh ~/nutch/conf
       sudo cp ~/hadoop/conf/hdfs-site.xml ~/nutch/conf
       sudo cp ~/hadoop/conf/mapred-site.xml ~/nutch/conf
       sudo cp ~/hadoop/conf/core-site.xml ~/nutch/conf
 
Edit nutch default.properties and
 
       cd ~/nutch
      sudo vi default.properties      
 
Change name from apache-nutch to nutch in default.properties 
 
Build nutch using ant
 
            cd ~/nutch
            sudo ant runtime
 
Create classpath
 
 
            export CLASSPATH=~/nutch/runtime/local/lib 
 
 Making Solr compatible with Nutch
 
Rename schema.xml in solr/example/solr/collection1/conf
 
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
 
Move the nutch schema-solr4.xml into solr/example/solr/collection1/conf and rename to schema.xml
 
cp ${NUTCH_RUNTIME_HOME}/conf/schema-solr4.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
 
Edit the new schema.xml
 
vi ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
 
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
 
Running the crawl
 
Create URLS file in nutch
 
cd ~/nutch
sudo mkdir urls
cd ~/nutch
sudo nano seed.txt
 
Add the necessary urls to this file one per each line(Three example URLS used below)
 
http://diabetesdailypost.com/
http://www.allfortheboys.com/
 
Put the urls folder to HDFS
 
cd ~/hadoop
bin/hadoop dfs –moveFromLocal  ~/nutch/urls urls
 
Check whether the file exists, the urls folder will be created under /user/hadoop
 
bin/hadoop dfs –ls
 
Finally run the nutch crawl on hadoop and add the data to solr index
 
 
bin/hadoop jar /home/hadoop/nutch/runtime/deploy/nutch-1.7.job org.apache.nutch.crawl.Crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 5
 
Using the Hadoop web interfaces
 
You can check the status of the hadoop jobs that are currently running with the job tracker and HDFS web interfaces
 
 http://<your domain>:50030 – Job tracker interface
 
 http://<your domain>:50070 – HDFS interface
 
References:
 
http://www.cyberciti.biz/faq/howto-add-new-linux-user-account/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://wiki.apache.org/nutch/NutchTutorial
http://www.sysadminhub.in/2013/07/installing-apache-ant-on-linux-centos.html
 
         http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-    reference/
 
 
 

 

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

 

 

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
27 [java][maven] jar 파일 의존성 한번에 다운로드 maven 사용 졸리운_곰 2023.08.24 12
26 Prometheus + Grafana로 Java 애플리케이션 모니터링하기 file 졸리운_곰 2020.12.17 79
25 Blockchain Implementation With Java Code file 졸리운_곰 2019.06.16 108
24 Java 코드로 이해하는 블록체인(Blockchain) 졸리운_곰 2019.06.16 126
23 순수 Java Application 코드로 Restful api 호출 졸리운_곰 2018.10.10 244
22 WebDAV 구현을 위한 환경 설정 file 졸리운_곰 2017.09.24 74
21 [Java] Apache Commons HttpClient로 SSL 통신하기 졸리운_곰 2017.03.27 545
20 JSoup를 이용한 HTML 파싱 졸리운_곰 2017.03.04 100
19 jsoup을 활용해서 Java에서 HTML 파싱하는 방법 정리 file 졸리운_곰 2017.03.04 157
18 NSA의 Dataflow 엔진 Apache NiFi 소개와 설치 file 졸리운_곰 2017.01.23 463
17 wordpress-java-integration 자바와 워드프레스 통합 졸리운_곰 2016.12.30 166
16 Create New Posts in Wordpress using Java and XMLRpc 졸리운_곰 2016.11.14 69
15 자바로 POST 방식으로 통신하기, java httppost 클래스를 활용한 예제 졸리운_곰 2016.11.14 444
14 [Java]아파치 HttpClient사용하기 file 졸리운_곰 2016.11.14 104
» Building a Search Engine With Nutch Solr And Hadoop file 졸리운_곰 2016.04.21 219
12 Nutch and Hadoop Tutorial file 졸리운_곰 2016.04.21 209
11 Latest step by Step Installation guide for dummies: Nutch 0. file 졸리운_곰 2016.04.21 131
10 Nutch 초간단 빌드와 실행 졸리운_곰 2016.04.21 454
9 Nutch로 알아보는 Crawling 구조 - Joinc 졸리운_곰 2016.04.21 352
8 A tiny bittorrent library Java: 자바로 만든 작은 bittorrent 라이브러리 file 졸리운_곰 2016.04.20 242
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED