Crawling with OpenWebSpider

 

Spiderman, Where are You Coming From Spiderman?

Sometimes you just need the right tools to help you do what you want. This is especially true when the tool you are limited to has a bunch of restrictions. I ran into this particular problem the other day when I wanted to "spider" a website to find out all its possible URLs being served, including the URLs that are ignored by the usual "robots.txt" file sitting on the web server.

I actually had to find a way around the "robots.txt" file for a web site that was blocking web crawlers from finding most of the URLs that were available. Imagine a spider that is crawing through a tunnel, looking for all the tasty bugs - and there is a giant boulder in the way providing the biggest obstacle possible to getting those bugs it wants to feed on.

My requirements for a spider alternative (if possible) were:
  • free
  • no lengthy configuration
  • a GUI
  • a way of ignoring the "robots.txt" file's instructions
  • the search results in a spreadsheet file
  • able to run on Windows (and Linux too, if possible)
After some googling, I experienced true frustration at only being able to find command-line libraries or gigantic, enterprise-level Java applications. Then I came across an open source project named OpenWebSpider. The project's website is a bit confusing but it's worth persevering through.

OpenWebSpider is a neat application written in node.js, which crawls a website and saves its results to a database that you nominate. The project seems to be reasonably active and at the time of writing is hosted on SourceForge. To get the application going in Windows, first you'll need to do a few things.
 

Install WAMP Server

OK, so the database I wanted to store my results in was MySQL. One of the best ways to get this (and other nice bits) in Windows is to install the handy WAMP server package. (However, the providers do warn you that you might first need a Visual Studio C++ 2012 redistributable package installed.)

One you do get WAMP Server installed, you'll have a nice new "W" icon in your system tray. Click on it, then click Start All Services.
 


You can view the MySQL database in the phpMyAdmin interface, just to be nice and friendly. In your browser then, go to:
http://localhost/phpmyadmin

Note: if you have any troubles with phpMyAdmin, you might have to edit its config file, usually in this location:
C:\wamp\apps\phpmyadmin4.x\config.inc.php

In phpMyAdmin, create a database to store your spidering results (I've named my database "ows"). Also create a user for the database.
 

Install node.js

Node.js is one of the most wonderful Javascript frameworks out there, and it's what OpenWebSpider is written with. So go to the node.js project site, download the installer and run it. Check that node.js is working OK by opening a command shell window and typing the word "node".
 
 
 
 
 
 
 
 

 

Get OpenWebSpider and Create the Database's Schema

Download the OWS zip file from Sourceforge. Unzip the file. Inside the project folder, you'll see a number of files. Double-click the "openwebspider.bat" file to launch the application in a little shell window.
 












Then, as the readme.txt instruction file tells you:
  • Open a web-browser at http://127.0.0.1:9999/
  • Go in the third tab (Database) and configure your settings
  • Verify that openwebspider correctly connects to your server by clicking the "Verify" button
  • "Save" your configuration
  • Click "Create DB"; this will create all tables needed by OpenWebSpider

Now, check your database's new tables in phpMyAdmin:
 
 

 Start Spidering!

Go to the OWS browser view, ie. at http://127.0.0.1:9999/. Click on the Worker tab and alter any settings you might thing useful. Enter the URL of the site you want to crawl in the URL box (and make sure "http://www" is in front if needed). Then hit the Go! button.
 
 
You should then be bounced to the Workers tab. Here you'll get see real-time progress of the site crawling as it happens. You can click on the second-tier History tab if you miss the crawler finishing its run.

View Your Results

In phpMyAdmin, click on the pages table and you should automatically see a view of the crawling results. You might not need all the columns you'll see, in fact my usual SQL query to run is just something like:
select hostname, page, title from pages where hostname = 'www.website.com';
 
 


And of course, anyone with half a brain knows you can export data from phpMyAdmin as a CSV file. Then you can view your data by importing the CSV file into a spreadsheet application.
 
 
 

Ignore Those Pesky Robots

Please note, the next instruction is for the version of OpenWebSpider from October 2015. It probably is outdated for the latest version now.

One of the requirements was that we could bypass the "robots.txt" file, which good web crawlers by default must follow. What you'll need to do is close OpenWebSpider, and just edit one source code file. Find this file and open it up in a text editor: openwebspider\src\_worker\_indexerMixin.js.

It's a Javascript file, so all you need to do is comment out these lines (with // symbols):
 // if (!canFetchPage || that.stopSignal === true)
        // {
            // if (!canFetchPage)
            // {
                // msg += "\n\t\t blocked by robots.txt!";
            // }
            // else
            // {
                // msg += "\n\t\t stop signal!";
            // }

            // logger.log("index::url", msg);

            // callback();
            // return;
        // }

Re-save the file. Then launch OWS again, and try another search. You'll usually find the number of found URL results has gone up, since the crawler is now ignoring the instructions in the "robots.txt" file.
 

Everyday Use

It's a good idea to make a desktop shortcut for the "openwebspider.bat" file. So, if you want to use OpenWebSpider regularly, what you'll need to remember to do each time you need it is:
  • Start up the WAMP server in the system tray
  • On the desktop, double-click the "openwebspider.bat" shortcut icon
  • In a browser, go to localhost:9999 (Openwebspider) to run the worker
  • In a browser, go to localhost/phpmyadmin to see your results


Happy spidering!

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

 

[출처] http://scriptsonscripts.blogspot.kr/2015/10/crawling-with-openwebspider.html

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1195 [ 一日30分 인생승리의 학습법] VBA Web Scraping: How Can VBA Be Used To Scrape Website Data? file 졸리운_곰 2024.04.13 3
1194 [ 一日30分 인생승리의 학습법] 윈도우 실행파일 구조(PE파일) file 졸리운_곰 2024.03.31 3
1193 [ 一日30分 인생승리의 학습법] [Analysis] PE(Portable Executable) 파일 포맷 공부 file 졸리운_곰 2024.03.31 3
1192 [ 一日30分 인생승리의 학습법] 성공하는 메타버스의 3가지 조건 file 졸리운_곰 2024.03.30 7
1191 [ 一日30分 인생승리의 학습법] REST, REST API, RESTful 과 HATEOAS file 졸리운_곰 2024.03.10 9
1190 [ 一日30分 인생승리의 학습법] 렌더링 삼형제 CSR, SSR, SSG 이해하기 file 졸리운_곰 2024.03.10 2
1189 [ 一日30分 인생승리의 학습법] 엑셀 VBA에서 셀레니움 사용을 위한 Selenium Basic 설치 file 졸리운_곰 2024.02.23 11
1188 [ 一日30分 인생승리의 학습법]500 Lines or Less Blockcode: A Visual Programming Toolkit : 500줄 이하의 블록코드: 시각적 프로그래밍 툴킷 졸리운_곰 2024.02.12 4
1187 [ 一日30分 인생승리의 학습법] 구글 클라이언트(앱) 아이디를 발급받으려면 어떻게 해야 하나요? 졸리운_곰 2024.01.28 3
1186 [ 一日30分 인생승리의 학습법] 빅뱅 프로젝트를 성공적으로 오픈하기 위한 팁 졸리운_곰 2023.12.27 16
1185 [ 一日30分 인생승리의 학습법]“빅뱅 전환보다 단계적 전환 방식이 이상적 애자일팀과 협업 쉽게 체질 개선을” file 졸리운_곰 2023.12.27 12
1184 [ 一日30分 인생승리의 학습법] Big-bang / phased 접근 file 졸리운_곰 2023.12.27 3
1183 [ 一日30分 인생승리의 학습법] CodeDragon 메뉴 데이터 전환의 개념 이해 - 데이터 전환의 개념, 데이터 전환방식, 데이터 전환방식 및 장단점 비교, 데이터전환 이후 검토해야 할 사항 졸리운_곰 2023.12.27 5
1182 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 분쟁 해결 시스템 file 졸리운_곰 2023.12.27 6
1181 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 개념과 리뷰 시스템 file 졸리운_곰 2023.12.27 4
1180 [ 一日30分 인생승리의 학습법] 소켓 CLOSE_WAIT 발생 현상 및 처리 방안 file 졸리운_곰 2023.12.03 7
1179 [ 一日30分 인생승리의 학습법] robots 설정하기 졸리운_곰 2023.12.03 3
1178 [ 一日30分 인생승리의 학습법] A Tutorial and Elementary Trajectory Model for the Differential Steering System of Robot Wheel Actuators : 로봇 휠 액츄에이터의 차동 조향 시스템에 대한 튜토리얼 및 기본 궤적 모델 file 졸리운_곰 2023.11.29 6
1177 [ 一日30分 인생승리의 학습법] Streamline Your MLOps Journey with CodeProject.AI Server : CodeProject.AI 서버로 MLOps 여정을 간소화하세요 file 졸리운_곰 2023.11.25 2
1176 [ 一日30分 인생승리의 학습법] Comparing Self-Hosted AI Servers: A Guide for Developers / : 자체 호스팅 AI 서버 비교: 개발자를 위한 가이드 file 졸리운_곰 2023.11.25 10
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED