Parallelized Python Scraping into Database Across Heroku

Question

First off, if this "project" is short-lived, or generally won't be run in production, I suggest you don't start looking into "better technologies" until you really see that you need to. If you only ever are going to run this 3 times, it's a waste of time.

To your last question: Twisted is an async framework, much like "node", that will allow a higher concurrency factor on a single machine. Celery is distributed tasks, is very cool, and both are generally worth learning and suit you fine. (I wouldn't bother with Twisted unless the scale was huge). Instead of celery, for your simple case, you might consider "RedisQ", a Python module that does something similar (and has very concise documentation) in Redis.

To your MySQL question: that shouldn't be the case. A 1.5M rows table is not small, but inserts and updates should definitely not take minutes. Begin investigation by turning off any keys, indexes and primary keys you have.

To your Heroku architecture question: you would have 2 types of processes: a "web" process (which is your init_scrape.py), of which you will have 1 (heroku ps:scale web=1), and a "worker" process (of which you can have as many as you'd like, and is that increases your scale).

Your procfile will look something like:

web: python init_scrape.py
worker: python worker.py

Note that if you want to communicate with your init_scrape.py process, you must call it "web" in the Procfile. Note also that in that case you must bind a TCP listener (basically: spin up a simple http server) to the port os.environ['PORT']. Only "web" processes get routed HTTP requests from "outside" of Heroku.

Also, note that all your processes should never really "exit" (Or Heroku will simple re-spin them). When they have nothing to do, they should simple wait/poll for tasks. You can then increase or decrease the number of workers by using heroku ps:scale.

The main issue here, with regards to what you write, is that your master will not spin up workers. The worker processes will be in entirely different dynos. The worker will simply initialize the redis queue (as you menion), and maybe spin up a simple http server to communicate with, and then sit idly by.

The workers will need to be passed the redis queue name, and each worker will be in a dyno of its own.

Good luck!