Question

This is the first time that I've actually run into timing issues regarding the task I have to tackle. I need to do a calculation (running against a webservice) with approximately 7M records. This would take more than 180hrs, so I was thinking about running multiple instances of the webservice on EC2 and just running rake tasks in parallel.

Since I have never done this before, I was wondering what needs to be considered. More precisely:

  • What's the maximum number of rake tasks I can run (Is there any limit at all besides your own machine power)?
  • What's the maximum number of concurrent connections to a postgres 9.3 db?
  • Are there any things to be considered when running multiple active_record.save actions at the same time?

I am looking forward to hearing your thoughts. Best, Phil

Was it helpful?

Solution

rake instances

  • Every time you run rake, you are running a new instance of your ruby server, with all associated memory and related load-dependency usages. Look in your Rakefile for the inits.
    • your number of instances in limited by memory and CPU used
    • you must profile each memory and CPU to know how many can be run
    • you could write a program to monitor and calculate what's possible, but heuristics will work better for one-off, and first experiments.

datastore

  • heuristically explore your database capacity, too.
    • watch for write-locks that create blocking
    • watch for slow reads due to missing indices
    • look at your postgres configs to see concurrency limits, cache size, etc.

.save

  • each rake task is its own ruby server, so multiple active_record.save actions impacts:
    • blocking/waiting due to write-locking
    • one instance getting 'old' data that was read prior to another's update .save

operational complexity

  • the number of records (7MM) is just a multiplier for all of the operations that occur upon each record. The operational complexity is the source of limitation, since theoretically, running 7MM workers would solve the problem in the minimum timescale
  • if 180hr is accurate (dubious), then (180 * 60 * 60 * 1000) / 7000000 == 92.57 ms per process.
  • Look for any shared-resource that is an IO blocker.
  • look for any common calculation that you can do in advance and cache. A lookup beats a calc.

errata

  • leave headroom for base OS processes. These will vary by your environment, but you mention AWS but best to conceptually learn how to monitor any system for activity
    1. run top in a separate screen / terminal as the rakes are running.
    2. Prefer to run 2 tops in different screens. sort 1 by memory, sort the other by CPU
    3. have a way to monitor the rakes
    4. watch for events that bubble up the top processes.
    5. if you do this long / well enough, you've profiled you headroom
  • run more rakes to fill your headroom
  • don't overrun your memory or you'll get swapping

You may want to consider beanstalk instead, but my guess is you'll find that more complicated than learning all these good foundations, first.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top