scrapy log issue

https://stackoverflow.com/questions/12049770

27-06-2021
|

Frage

i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like

LOG_FILE = "scrapy_%s.log" % datetime.now()

what i want is scrapy_SPIDERNAME_DATETIME

but i am unable to provide spidername in log_file name ..

i found

scrapy.log.start(logfile=None, loglevel=None, logstdout=None)

and called it in each spider init method but its not working ..

any help would be appreciated

Lösung

The spider's __init__() is not early enough to call log.start() by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.

In your spider class file:

from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider

class ExampleSpider(BaseSpider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]

    def __init__(self, name=None, **kwargs):
        LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
        # remove the current log
        # log.log.removeObserver(log.log.theLogPublisher.observers[0])
        # re-create the default Twisted observer which Scrapy checks
        log.log.defaultObserver = log.log.DefaultObserver()
        # start the default observer so it can be stopped
        log.log.defaultObserver.start()
        # trick Scrapy into thinking logging has not started
        log.started = False
        # start the new log file observer
        log.start(LOG_FILE)
        # continue with the normal spider init
        super(ExampleSpider, self).__init__(name, **kwargs)

    def parse(self, response):
        ...

And the output file might look like:

scrapy_example_2012-08-25 12:34:48.823896.log

Andere Tipps

There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be

LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())

This is pretty much the same that Scrapy does internally

But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.

A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.

You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.

%(name)s
%(time)s

For example: /tmp/crawled/%(name)s/%(time)s.log

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow