[파이썬] scrapy 로 웹 사이트 크롤링

[출처] http://isbullsh.it/2012/04/Web-crawling-with-scrapy/

Crawl a website with scrapy

Written by Balthazar

Introduction

In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.

For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.

We will consider that you have a working MongoDB server, and that you have installed the pymongo and scrapy python packages, both installable with pip.

If you have never toyed around with Scrapy, you should first read this short tutorial.

First step, identify the URL pattern(s)

In this example, we’ll see how to extract the following information from each isbullsh.it blogpost :

  • title
  • author
  • tag
  • release date
  • url

We’re lucky, all posts have the same URL pattern: http://isbullsh.it/YYYY/MM/title. These links can be found in the different pages of the site homepage.

What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.

Building the spider

We create a Scrapy project, following the instructions from their tutorial. We obtain the following project structure:

isbullshit_scraping/
├── isbullshit
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── isbullshit_spiders.py
└── scrapy.cfg

We begin by defining, in items.py, the item structure which will contain the extracted information:

from scrapy.item import Item, Field

class IsBullshitItem(Item):
    title = Field()
    author = Field()
    tag = Field()
    date = Field()
    link = Field()

Now, let’s implement our spider, in isbullshit_spiders.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = ['http://isbullsh.it'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
    	# r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs
    	Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost')]
    	# r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs
    		
    def parse_blogpost(self, response):
        ...

Our spider inherits from CrawlSpider, which “provides a convenient mechanism for following links by defining a set of rules”. More info here.

We then define two simple rules:

  • Follow links pointing to http://isbullsh.it/page/X
  • Extract information from pages defined by a URL of pattern http://isbullsh.it/YYYY/MM/title, using the callback method parse_blogpost.

Extracting the data

To extract the title, author, etc, from the HTML code, we’ll use the scrapy.selector.HtmlXPathSelector object, which uses the libxml2 HTML parser. If you’re not familiar with this object, you should read the XPathSelector documentation.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

We’ll now define the extraction logic in the parse_blogpost method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):

def parse_blogpost(self, response):
    hxs = HtmlXPathSelector(response)
    item = IsBullshitItem()
    # Extract title
    item['title'] = hxs.select('//header/h1/text()').extract() # XPath selector for title
    # Extract author
    item['tag'] = hxs.select("//header/div[@class='post-data']/p/a/text()").extract() # Xpath selector for tag(s)
    ...
    return item

Note: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a Scrapy shell. That only works if the data position is coherent throughout all the pages you crawl.

Store the results in MongoDB

Each time that the parse_blogspot method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.

First, we need to add a couple of things to settings.py:

ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBPipeline',]

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "isbullshit"
MONGODB_COLLECTION = "blogposts"

Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).

Here is our pipelines.py file :

import pymongo

from scrapy.exceptions import DropItem
from scrapy.conf import settings
from scrapy import log
class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
        
    def process_item(self, item, spider):
    	valid = True
        for data in item:
          # here we only check if the data is not null
          # but we could do any crazy validation we want
       	  if not data:
            valid = False
            raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
        if valid:
          self.collection.insert(dict(item))
          log.msg("Item wrote to MongoDB database %s/%s" %
                  (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
                  level=log.DEBUG, spider=spider) 
        return item

Release the spider!

Now, all we have to do is change directory to the root of our project and execute

$ scrapy crawl isbullshit

The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.

Pretty neat, hm?

Conclusion

This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out Selenium. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.

Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and getting into trouble.

Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)

Please, don’t be a d*ick.

See code on Github

The entire code of this project is hosted on Github. Help yourselves!

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1220 [一日30分 인생승리의 학습법] Qiskit 시작하기 (Getting Started with Qiskit) file 졸리운_곰 2025.06.03 15
1219 [一日30分 인생승리의 학습법] 양자컴퓨팅 프로그래밍 file 졸리운_곰 2025.06.03 11
1218 [一日30分 인생승리의 학습법] [Git] 다중 리모트를 사용하여 여러 Git 연동하기(Gitea, GitHub) file 졸리운_곰 2025.05.25 5
1217 [一日30分 인생승리의 학습법] [GitHub][terminal] 비밀번호 인증 에러를 토큰으로 해결하고 로그인 하기 file 졸리운_곰 2025.05.24 7
1216 [一日30分 인생승리의 학습법] [알아봅시다] 블록체인 게임들의 가능성과 미래 file 졸리운_곰 2025.04.08 28
1215 이 어지러운시대의 극복법 만화보기 file unmask 2025.04.08 54
1214 [ 一日30分 인생승리의 학습법] IT 국비교육, 쓰레기 속에서 그나마 덜 쓰레기인 곳 찾는 팁 file 졸리운_곰 2025.03.08 19
1213 [ 一日30分 인생승리의 학습법] 소프트웨어 개발하다보면 "connection reset" 등, 소프트웨어 버그 적인 문제가아닌 하드웨어나 네트워크 오류 메시지의 예 file 졸리운_곰 2025.03.01 19
1212 [ 一日30分 인생승리의 학습법] 기술부채(Technical Debt)는 소프트웨어 개발이나 프로젝트 과정에서, 약속된 것과 실제로 제공된 것 사이에 차이가 발생하는 것을 의미합니다. file 졸리운_곰 2025.01.23 29
1211 [ 一日30分 인생승리의 학습법] 고가용성(High Availability) 시스템을 위한 5가지 전략 file 졸리운_곰 2024.12.28 31
1210 [ 一日30分 인생승리의 학습법] 켈리 공식을 간단히 투자해 적용해 보자 - 켈리 크라이티리언과 확률적 사고의 중요성 file 졸리운_곰 2024.12.26 33
1209 [ 一日30分 인생승리의 학습법] [markdown] mermaid를 이용해서 UML 그리기 - 플로우차트 file 졸리운_곰 2024.12.01 47
1208 [ 一日30分 인생승리의 학습법] Mermaid.js 정리???????? file 졸리운_곰 2024.12.01 66
1207 [ 一日30分 인생승리의 학습법] Mermaid를 이용한 시퀀스 다이어그램 file 졸리운_곰 2024.12.01 33
1206 [ 一日30分 인생승리의 학습법] Mermaid - 코드로 순서도(flowchart) 그리기 file 졸리운_곰 2024.12.01 27
1205 [ 一日30分 인생승리의 학습법] 유니코드 그래픽 기호(심벌) Huge List of Unicode Symbols 졸리운_곰 2024.07.31 46
1204 [ 一日30分 인생승리의 학습법] PocketBase Attempt to simplify the serve command for prod : 포켓베이스 프로덕션 포트 도메인 네임 설정 졸리운_곰 2024.06.10 73
1203 [ 一日30分 인생승리의 학습법] google spreadsheet app script 로 코인 현황 : 거래소 API 접근할 때 알아두면 좋은 함수 file 졸리운_곰 2024.06.08 60
1202 [ 一日30分 인생승리의 학습법] 매크로 프로그램 정리 졸리운_곰 2024.06.08 88
1201 [ 一日30分 인생승리의 학습법] 스마트스토어 vs 아임웹 vs 카페24 file 졸리운_곰 2024.05.16 77
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED