[github] Awesome-crawler 멋진 웹 크롤러 프로젝트

Awesome-crawler 

A collection of awesome web crawler,spider and resources in different languages.

Contents

Python

  • Scrapy - A fast high-level screen scraping and web crawling framework.
  • pyspider - A powerful spider system.
  • CoCrawler - A versatile web crawler built using modern tools and concurrency.
  • cola - A distributed crawling framework.
  • Demiurge - PyQuery-based scraping micro-framework.
  • Scrapely - A pure-python HTML screen-scraping library.
  • feedparser - Universal feed parser.
  • you-get - Dumb downloader that scrapes the web.
  • Grab - Site scraping framework.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • portia - Visual scraping for Scrapy.
  • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MSpider - A simple ,easy spider using gevent and js render.
  • brownant - A lightweight web data extracting framework.
  • PSpider - A simple spider frame in Python3.
  • Gain - Web crawling framework based on asyncio for everyone.
  • sukhoi - Minimalist and powerful Web Crawler.
  • spidy - The simple, easy to use command line web crawler.
  • newspaper - News, full-text, and article metadata extraction in Python 3
  • aspider - An async web scraping micro-framework based on asyncio.

Java

  • ACHE Crawler - An easy to use web crawler for domain-specific search.
  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML information extraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.
  • StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
  • Spark-Crawler - Evolving Apache Nutch to run on Spark.
  • webBee - A DFS web spider.
  • spider-flow - A visual spider framework, it's so good that you don't need to write any code to crawl the website.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
  • Abot - C# web crawler built for speed and flexibility.
  • Hawk - Advanced Crawler and ETL tool written in C#/WPF.
  • SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
  • Infinity Crawler - A simple but powerful web crawler library in C#.

JavaScript

  • scraperjs - A complete and versatile web scraper.
  • scrape-it - A Node.js scraper for humans.
  • simplecrawler - Event driven web crawler.
  • node-crawler - Node-crawler has clean,simple api.
  • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
  • webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
  • x-ray - Web scraper with pagination and crawler support.
  • node-osmosis - HTML/XML parser and web scraper for Node.js.
  • web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
  • supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
  • headless-chrome-crawler - Headless Chrome crawls with jQuery support
  • Squidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

  • Goutte - A screen scraping and web crawling library for PHP.
  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • QueryList - The progressive PHP crawler framework.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.
  • spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
  • crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.

C++

C

  • httrack - Copy websites to your computer.

Ruby

  • Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
  • upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
  • Spidr - Spider a site ,multiple domains, certain links or infinitely.
  • Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
  • mechanize - Automated web interaction & crawling.

R

  • rvest - Simple web scraping for R.

Erlang

  • ebot - A scalable, distribuited and highly configurable web cawler.

Perl

  • web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

  • pholcus - A distributed, high concurrency and powerful web crawler.
  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  • go_spider - An awesome Go concurrent Crawler(spider) framework.
  • dht - BitTorrent DHT Protocol && DHT Spider.
  • ants-go - A open source, distributed, restful crawler engine in golang.
  • scrape - A simple, higher level interface for Go web scraping.
  • creeper - The Next Generation Crawler Framework (Go).
  • colly - Fast and Elegant Scraping Framework for Gophers.
  • ferret - Declarative web scraping.
  • Dataflow kit - Extract structured data from web pages. Web sites scraping.
  • Hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

 

[출처] https://github.com/BruceDone/awesome-crawler#c-1

 

 

 

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1060 [ 一日30分 인생승리의 학습법] 셀레니움 헤드리스 테스트를 위한 HTMLUnitDriver 및 PhantomJS file 졸리운_곰 2021.11.26 0
1059 [ 一日30分 인생승리의 학습법] 메타버스로 날개 단 오픈소스 프로젝트 file 졸리운_곰 2021.11.23 5
1058 [ 一日30分 인생승리의 학습법] Best JavaScript machine learning libraries in 2021 file 졸리운_곰 2021.11.20 1
1057 [ 一日30分 인생승리의 학습법] 프로그래밍 언어별 딥러닝 라이브러리 정리 file 졸리운_곰 2021.11.19 2
1056 [오픈소스 라이센스 상용화 라이센스 검토] [Software] 공개 SW 라이센스(GPL, LGPL, BSD) 졸리운_곰 2021.11.18 3
1055 [프론트앤드 프레임워크] 프론트엔드 프레임워크 트렌드(Angular / React / Vue.js) file 졸리운_곰 2021.11.17 1
1054 [프론트앤드 프레임워크] [FE] 프론트엔드 3대장 비교와 주관적인 최신 웹 동향에 대해 (feat. React를 기반으로) file 졸리운_곰 2021.11.17 5
1053 [분석 및 설계] DDD(Domain Driven Design) Domain Driven Design에 대해 알아보자 file 졸리운_곰 2021.11.17 1
1052 [분석 및 설계] DDD(Domain Driven Design) - 도메인 주도 설계란? 마이크로서비스의 관점에서 file 졸리운_곰 2021.11.17 1
1051 [분석 및 설계] DDD 핵심만 빠르게 이해하기 file 졸리운_곰 2021.11.17 3
» [github] Awesome-crawler 멋진 웹 크롤러 프로젝트 졸리운_곰 2021.11.09 2
1049 [인공지능] 지식표현과 추론 file 졸리운_곰 2021.11.07 3
1048 File Encodeing Converter (파일 인코딩 일괄 변환기) file 졸리운_곰 2021.10.02 3
1047 CLIPS 6.0 User's Guide May 28, 1993 by Joseph C. Giarratano, Ph.D. file 졸리운_곰 2021.09.28 4
1046 [一日30分 인생승리의 학습법] [알쓸IT잡] 가상현실(VR), 증강현실(AR), 혼합현실(MR)을 아우르는 확장현실(XR, eXtended Reality) file 졸리운_곰 2021.09.23 7
1045 스프레드시트를 생산성 도구로...'구글 테이블' A to Z file 졸리운_곰 2021.09.17 7
1044 칼럼 | 하둡의 실패 넘어선다··· 오픈 데이터 분야를 견인하는 4가지 기술 동향 file 졸리운_곰 2021.09.17 2
1043 [一日30分 인생승리의 학습법] REST API 제대로 알고 사용하기 file 졸리운_곰 2021.09.06 7
1042 데이터를 가치있는 ‘자산’으로 만들기··· '5가지 지침' Martha Heller | CIO file 졸리운_곰 2021.09.02 9
1041 [CIOKorea] 데이터를 가치있는 '자산'으로 만들기 - '5가지 지침' 졸리운_곰 2021.09.02 3
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED