Question

I'm building an app that uses Flask and Scrapy. When the root URL of my app is accessed, it processes some data and displays it. In addition, I also want to (re)start my spider if it is not already running. Since my spider takes about 1.5 hrs to finish running, I run it as a background process using threading. Here is a minimal example (you'll also need testspiders):

import os
from flask import Flask, render_template
import threading
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings    
from testspiders.spiders.followall import FollowAllSpider

def crawl():
    spider = FollowAllSpider(domain='scrapinghub.com')
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()

app = Flask(__name__)

@app.route('/')
def main():
    run_in_bg = threading.Thread(target=crawl, name='crawler')
    thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

    if 'crawler' not in thread_names:
        run_in_bg.start()

    return 'hello world'

if __name__ == "__main__":
    port = int(os.environ.get('PORT', 5000))
    app.run(host='0.0.0.0', port=port)

As a side note, the following lines were my ad hoc approach to try and identify if my crawler thread is still running. If there's a more idiomatic approach, I'd appreciate some guidance.

run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

if 'crawler' not in thread_names:
    run_in_bg.start()

Moving on to the problem — if I save the above script as crawler.py, run python crawler.py and access localhost:5000, then I get the following error (ignore scrapy's HtmlXPathSelector deprecation errors):

exceptions.ValueError: signal only works in main thread

Although the spider runs, it doesn't stop because the signals.spider_closed signal only works in the main thread (according to this error). As expected, subsequent requests to the root URL results in copious errors.

How can I design my app to start my spider if it is not already crawling, while at the same time returning control back to my app immediately (i.e. I don't want to wait for the crawler to finish) for other stuff?

Was it helpful?

Solution

Its not the best idea to have flask start long running threads like this.

I would recommend using a queue system like celery or rabbitmq. Your flask application can put tasks on the queue that you would like to do in the background and then return immediately.

Then you can have workers outside of your main app process those tasks and do all of your scraping.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top