scrapy log issue
-
27-06-2021 - |
質問
i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like
LOG_FILE = "scrapy_%s.log" % datetime.now()
what i want is scrapy_SPIDERNAME_DATETIME
but i am unable to provide spidername in log_file name ..
i found
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
and called it in each spider init method but its not working ..
any help would be appreciated
解決
The spider's __init__()
is not early enough to call log.start()
by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.
In your spider class file:
from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
def __init__(self, name=None, **kwargs):
LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
# remove the current log
# log.log.removeObserver(log.log.theLogPublisher.observers[0])
# re-create the default Twisted observer which Scrapy checks
log.log.defaultObserver = log.log.DefaultObserver()
# start the default observer so it can be stopped
log.log.defaultObserver.start()
# trick Scrapy into thinking logging has not started
log.started = False
# start the new log file observer
log.start(LOG_FILE)
# continue with the normal spider init
super(ExampleSpider, self).__init__(name, **kwargs)
def parse(self, response):
...
And the output file might look like:
scrapy_example_2012-08-25 12:34:48.823896.log
他のヒント
There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be
LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())
This is pretty much the same that Scrapy does internally
But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.
A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.
You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.
- %(name)s
%(time)s
For example: /tmp/crawled/%(name)s/%(time)s.log