Question

our scraper currently not only downloads text but also images. The scraper in its current state is working fine, we have however big problems with the quality of the downloaded images. After checking the standard ImagePipeline, we implemented a custom one that tells Pillow to use the highest quality, it looks like this (and is configured in settings.py):

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from cStringIO import StringIO

class CustomImagesPipeline(ImagesPipeline):

    def convert_image(self, image, size=None):
        buf = StringIO()
        image.save(buf, 'JPEG', quality=100)
        return image, buf

We also tried several other presets taken from this file: https://github.com/python-imaging/Pillow/blob/master/PIL/JpegPresets.py

We did however not see any improvements. Did someone here tackle this problem before or has an idea what's wrong with the code?

Thanks :)

Was it helpful?

Solution

I've fixed this particular problem with a different approach, that was made possible by a recent pull request that hasn't been documented yet.

The pull request introduced a new pipeline, called FilesPipeline: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/pipeline/files.py

I had to make the following changes to get this working:

  • rename the image_urls field, that is used for the image pipeline, to file_urls in your items.py
  • activate the pipeline in your settings.py and define a storage
    • ITEM_PIPELINES = {'scrapy.contrib.pipeline.files.FilesPipeline': 1}
    • FILES_STORE = '/Users/chris/Scrapy/project/images'

Apart from those changes the pipeline works exactly like the image pipeline. Obviously this approach only works if you just need the file from the website in its original format.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top