I'm building an app that uses Flask and Scrapy. When the root URL of my app is accessed, it processes some data and displays it. In addition, I also want to (re)start my spider if it is not already running. Since my spider takes about 1.5 hrs to finish running, I run it as a background process using threading. Here is a minimal example (you'll also need testspiders):
import os
from flask import Flask, render_template
import threading
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from testspiders.spiders.followall import FollowAllSpider
def crawl():
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
app = Flask(__name__)
@app.route('/')
def main():
run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]
if 'crawler' not in thread_names:
run_in_bg.start()
return 'hello world'
if __name__ == "__main__":
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port)
As a side note, the following lines were my ad hoc approach to try and identify if my crawler thread is still running. If there's a more idiomatic approach, I'd appreciate some guidance.
run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]
if 'crawler' not in thread_names:
run_in_bg.start()
Moving on to the problem — if I save the above script as crawler.py
, run python crawler.py
and access localhost:5000
, then I get the following error (ignore scrapy's HtmlXPathSelector
deprecation errors):
exceptions.ValueError: signal only works in main thread
Although the spider runs, it doesn't stop because the signals.spider_closed
signal only works in the main thread (according to this error). As expected, subsequent requests to the root URL results in copious errors.
How can I design my app to start my spider if it is not already crawling, while at the same time returning control back to my app immediately (i.e. I don't want to wait for the crawler to finish) for other stuff?