Python 인터넷 [scrapy] https://pypi.org/project/scrapy-save-as-pdf/

2021.07.03 12:48

[scrapy] https://pypi.org/project/scrapy-save-as-pdf/

Project description

Pipeline to Download PDF or Save page as PDF for scrapy item

Installation

Install scrapy-save-as-pdf using pip:

pip install scrapy-save-as-pdf

Configuration

(Optionally) if you want to use WEBDRIVER_HUB_URL, you can use docker to setup one like this:

docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119

then WEBDRIVER_HUB_URL value is http://docker_host_ip:4444/wd/hub and we often debug on local host, so we use http://127.0.0.1:4444/wd/hub

Add the settings.py of your Scrapy project like this:

PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
    'landscape': False,
    'displayHeaderFooter': False,
    'printBackground': True,
    'preferCSSPageSize': True,
}
WEBDRIVER_HUB_URL = 'http://127.0.0.1:4444/wd/hub'

If both WEBDRIVER_HUB_URL and CHROME_DRIVER_PATH are set, we use WEBDRIVER_HUB_URL.

Enable the pipeline by adding it to ITEM_PIPELINES in your settings.py file and changing priority:

ITEM_PIPELINES = {
    'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}

The order should before your persist pipeline such as save to database and after your preprocess pipeline.

In the demo scrapy project, I put the SaveToQiniuPipeline after this plugin to persist pdf to the cloud.

Usage

set the pdf_url and/or url field in your yielded item

import scrapy

class MySpider(scrapy.Spider):
    start_urls = [
        "http://example.com",
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse)

    def parse(self, response):
        yield {
            "url": "http://example.com/cate1/page1.html",
            "pdf_url": "http://example.com/cate1/page1.pdf",
        }
        yield {
            "url": "http://example.com/cate1/page2.html",
            "pdf_url": "http://example.com/cate1/page2.pdf",
        }

the pdf_url field will be populated with the downloaded pdf file location, if pdf_url field has old value then move it to origin_pdf_url field, you can handle them in your next pipeline.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

Getting help

Please use github issue

Contributing

PRs are always welcomed.

Changes

0.1.0 (2020-12-25)

Initial release

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.

이 게시물을

번호	제목	글쓴이	날짜	조회 수
355	[Python] 파이썬으로 복리 계산하기	졸리운_곰	2021.07.17	184
354	[python, 파이썬] 연습 문제: 복리 이자 계산	졸리운_곰	2021.07.17	726
353	python - 읽은 후 kafka 메시지를 삭제하는 방법	졸리운_곰	2021.07.13	431
»	[scrapy] https://pypi.org/project/scrapy-save-as-pdf/	졸리운_곰	2021.07.03	124
351	Pipeline to Download PDF or Save page as PDF for scrapy item	졸리운_곰	2021.06.26	70
350	[python, 인터넷] [카프카] Python으로 Kafka에 전송(Producer)하고 가져오기(consumer)	졸리운_곰	2021.06.19	103
349	[Python, 인터넷] 네이버 뉴스 기사 크롤링	졸리운_곰	2021.05.23	194
348	[Python, GUI tool] GUI drag & drop style GUI Builder for Python Tkinter	졸리운_곰	2021.05.17	211
347	[python] 파이썬 기초 문법 정리	졸리운_곰	2021.05.17	385
346	[python][flask] webpage-scraper	졸리운_곰	2021.04.28	91
345	[python][자동화] python으로 카카오톡 자동 메시지 전송	졸리운_곰	2021.04.27	122
344	[python 파이썬 2d 그래픽스] The Interesting Python Graphics Libraries for Python Programmers	졸리운_곰	2021.04.27	867
343	[python] [GPU]GPU 사용 Python 코드 실행	졸리운_곰	2021.04.21	194
342	[python][ip 추적] 영화와 같은 ip 위치 추적 python 소스 IP Radar 2	졸리운_곰	2021.04.15	181
341	[웹서버] Flask + REST API + Swagger	졸리운_곰	2021.04.04	128
340	Python Flask 로 간단한 REST API 작성하기	졸리운_곰	2021.04.04	397
339	[python][jupyter notebook][JSON API] Building a JSON API Using Jupyter Notebooks in Under 5 Minutes	졸리운_곰	2021.03.28	105
338	Python Flask 프레임워크 이해하기	졸리운_곰	2021.03.21	107
337	Python Flask 로 간단한 REST API 작성하기	졸리운_곰	2021.03.21	85
336	[python][인공지능] FLASK를 이용하여 PYTHON에서 PYTORCH를 REST API로 배포하기	졸리운_곰	2021.03.20	141

첫 페이지 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 끝 페이지

쓰기

태그