Project description

Pipeline to Download PDF or Save page as PDF for scrapy item


Install scrapy-save-as-pdf using pip:

pip install scrapy-save-as-pdf


  1. (Optionally) if you want to use WEBDRIVER_HUB_URL, you can use docker to setup one like this:
docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119

then WEBDRIVER_HUB_URL value is http://docker_host_ip:4444/wd/hub and we often debug on local host, so we use

  1. Add the of your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
    'landscape': False,
    'displayHeaderFooter': False,
    'printBackground': True,
    'preferCSSPageSize': True,


  1. Enable the pipeline by adding it to ITEM_PIPELINES in your file and changing priority:
    'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,

The order should before your persist pipeline such as save to database and after your preprocess pipeline.

In the demo scrapy project, I put the SaveToQiniuPipeline after this plugin to persist pdf to the cloud.


set the pdf_url and/or url field in your yielded item

import scrapy

class MySpider(scrapy.Spider):
    start_urls = [

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse)

    def parse(self, response):
        yield {
            "url": "",
            "pdf_url": "",
        yield {
            "url": "",
            "pdf_url": "",

the pdf_url field will be populated with the downloaded pdf file location, if pdf_url field has old value then move it to origin_pdf_url field, you can handle them in your next pipeline.

Getting help

Please use github issue


PRs are always welcomed.


0.1.0 (2020-12-25)

Initial release





