Crawling with OpenWebSpider

 

Spiderman, Where are You Coming From Spiderman?

Sometimes you just need the right tools to help you do what you want. This is especially true when the tool you are limited to has a bunch of restrictions. I ran into this particular problem the other day when I wanted to "spider" a website to find out all its possible URLs being served, including the URLs that are ignored by the usual "robots.txt" file sitting on the web server.

I actually had to find a way around the "robots.txt" file for a web site that was blocking web crawlers from finding most of the URLs that were available. Imagine a spider that is crawing through a tunnel, looking for all the tasty bugs - and there is a giant boulder in the way providing the biggest obstacle possible to getting those bugs it wants to feed on.

My requirements for a spider alternative (if possible) were:
  • free
  • no lengthy configuration
  • a GUI
  • a way of ignoring the "robots.txt" file's instructions
  • the search results in a spreadsheet file
  • able to run on Windows (and Linux too, if possible)
After some googling, I experienced true frustration at only being able to find command-line libraries or gigantic, enterprise-level Java applications. Then I came across an open source project named OpenWebSpider. The project's website is a bit confusing but it's worth persevering through.

OpenWebSpider is a neat application written in node.js, which crawls a website and saves its results to a database that you nominate. The project seems to be reasonably active and at the time of writing is hosted on SourceForge. To get the application going in Windows, first you'll need to do a few things.
 

Install WAMP Server

OK, so the database I wanted to store my results in was MySQL. One of the best ways to get this (and other nice bits) in Windows is to install the handy WAMP server package. (However, the providers do warn you that you might first need a Visual Studio C++ 2012 redistributable package installed.)

One you do get WAMP Server installed, you'll have a nice new "W" icon in your system tray. Click on it, then click Start All Services.
 


You can view the MySQL database in the phpMyAdmin interface, just to be nice and friendly. In your browser then, go to:
http://localhost/phpmyadmin

Note: if you have any troubles with phpMyAdmin, you might have to edit its config file, usually in this location:
C:\wamp\apps\phpmyadmin4.x\config.inc.php

In phpMyAdmin, create a database to store your spidering results (I've named my database "ows"). Also create a user for the database.
 

Install node.js

Node.js is one of the most wonderful Javascript frameworks out there, and it's what OpenWebSpider is written with. So go to the node.js project site, download the installer and run it. Check that node.js is working OK by opening a command shell window and typing the word "node".
 
 
 
 
 
 
 
 

 

Get OpenWebSpider and Create the Database's Schema

Download the OWS zip file from Sourceforge. Unzip the file. Inside the project folder, you'll see a number of files. Double-click the "openwebspider.bat" file to launch the application in a little shell window.
 












Then, as the readme.txt instruction file tells you:
  • Open a web-browser at http://127.0.0.1:9999/
  • Go in the third tab (Database) and configure your settings
  • Verify that openwebspider correctly connects to your server by clicking the "Verify" button
  • "Save" your configuration
  • Click "Create DB"; this will create all tables needed by OpenWebSpider

Now, check your database's new tables in phpMyAdmin:
 
 

 Start Spidering!

Go to the OWS browser view, ie. at http://127.0.0.1:9999/. Click on the Worker tab and alter any settings you might thing useful. Enter the URL of the site you want to crawl in the URL box (and make sure "http://www" is in front if needed). Then hit the Go! button.
 
 
You should then be bounced to the Workers tab. Here you'll get see real-time progress of the site crawling as it happens. You can click on the second-tier History tab if you miss the crawler finishing its run.

View Your Results

In phpMyAdmin, click on the pages table and you should automatically see a view of the crawling results. You might not need all the columns you'll see, in fact my usual SQL query to run is just something like:
select hostname, page, title from pages where hostname = 'www.website.com';
 
 


And of course, anyone with half a brain knows you can export data from phpMyAdmin as a CSV file. Then you can view your data by importing the CSV file into a spreadsheet application.
 
 
 

Ignore Those Pesky Robots

Please note, the next instruction is for the version of OpenWebSpider from October 2015. It probably is outdated for the latest version now.

One of the requirements was that we could bypass the "robots.txt" file, which good web crawlers by default must follow. What you'll need to do is close OpenWebSpider, and just edit one source code file. Find this file and open it up in a text editor: openwebspider\src\_worker\_indexerMixin.js.

It's a Javascript file, so all you need to do is comment out these lines (with // symbols):
 // if (!canFetchPage || that.stopSignal === true)
        // {
            // if (!canFetchPage)
            // {
                // msg += "\n\t\t blocked by robots.txt!";
            // }
            // else
            // {
                // msg += "\n\t\t stop signal!";
            // }

            // logger.log("index::url", msg);

            // callback();
            // return;
        // }

Re-save the file. Then launch OWS again, and try another search. You'll usually find the number of found URL results has gone up, since the crawler is now ignoring the instructions in the "robots.txt" file.
 

Everyday Use

It's a good idea to make a desktop shortcut for the "openwebspider.bat" file. So, if you want to use OpenWebSpider regularly, what you'll need to remember to do each time you need it is:
  • Start up the WAMP server in the system tray
  • On the desktop, double-click the "openwebspider.bat" shortcut icon
  • In a browser, go to localhost:9999 (Openwebspider) to run the worker
  • In a browser, go to localhost/phpmyadmin to see your results


Happy spidering!

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

 

[출처] http://scriptsonscripts.blogspot.kr/2015/10/crawling-with-openwebspider.html

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1220 [一日30分 인생승리의 학습법] Qiskit 시작하기 (Getting Started with Qiskit) file 졸리운_곰 2025.06.03 15
1219 [一日30分 인생승리의 학습법] 양자컴퓨팅 프로그래밍 file 졸리운_곰 2025.06.03 11
1218 [一日30分 인생승리의 학습법] [Git] 다중 리모트를 사용하여 여러 Git 연동하기(Gitea, GitHub) file 졸리운_곰 2025.05.25 5
1217 [一日30分 인생승리의 학습법] [GitHub][terminal] 비밀번호 인증 에러를 토큰으로 해결하고 로그인 하기 file 졸리운_곰 2025.05.24 7
1216 [一日30分 인생승리의 학습법] [알아봅시다] 블록체인 게임들의 가능성과 미래 file 졸리운_곰 2025.04.08 28
1215 이 어지러운시대의 극복법 만화보기 file unmask 2025.04.08 54
1214 [ 一日30分 인생승리의 학습법] IT 국비교육, 쓰레기 속에서 그나마 덜 쓰레기인 곳 찾는 팁 file 졸리운_곰 2025.03.08 19
1213 [ 一日30分 인생승리의 학습법] 소프트웨어 개발하다보면 "connection reset" 등, 소프트웨어 버그 적인 문제가아닌 하드웨어나 네트워크 오류 메시지의 예 file 졸리운_곰 2025.03.01 19
1212 [ 一日30分 인생승리의 학습법] 기술부채(Technical Debt)는 소프트웨어 개발이나 프로젝트 과정에서, 약속된 것과 실제로 제공된 것 사이에 차이가 발생하는 것을 의미합니다. file 졸리운_곰 2025.01.23 29
1211 [ 一日30分 인생승리의 학습법] 고가용성(High Availability) 시스템을 위한 5가지 전략 file 졸리운_곰 2024.12.28 31
1210 [ 一日30分 인생승리의 학습법] 켈리 공식을 간단히 투자해 적용해 보자 - 켈리 크라이티리언과 확률적 사고의 중요성 file 졸리운_곰 2024.12.26 33
1209 [ 一日30分 인생승리의 학습법] [markdown] mermaid를 이용해서 UML 그리기 - 플로우차트 file 졸리운_곰 2024.12.01 47
1208 [ 一日30分 인생승리의 학습법] Mermaid.js 정리???????? file 졸리운_곰 2024.12.01 66
1207 [ 一日30分 인생승리의 학습법] Mermaid를 이용한 시퀀스 다이어그램 file 졸리운_곰 2024.12.01 33
1206 [ 一日30分 인생승리의 학습법] Mermaid - 코드로 순서도(flowchart) 그리기 file 졸리운_곰 2024.12.01 27
1205 [ 一日30分 인생승리의 학습법] 유니코드 그래픽 기호(심벌) Huge List of Unicode Symbols 졸리운_곰 2024.07.31 46
1204 [ 一日30分 인생승리의 학습법] PocketBase Attempt to simplify the serve command for prod : 포켓베이스 프로덕션 포트 도메인 네임 설정 졸리운_곰 2024.06.10 73
1203 [ 一日30分 인생승리의 학습법] google spreadsheet app script 로 코인 현황 : 거래소 API 접근할 때 알아두면 좋은 함수 file 졸리운_곰 2024.06.08 60
1202 [ 一日30分 인생승리의 학습법] 매크로 프로그램 정리 졸리운_곰 2024.06.08 88
1201 [ 一日30分 인생승리의 학습법] 스마트스토어 vs 아임웹 vs 카페24 file 졸리운_곰 2024.05.16 77
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED