We have been running node.js in production nearly an year starting from 0.4 and currenty 0.8 series. Web app is express 2 and 3 based with mongo, redis and memcached.
Few facts.
- node can not handle large v8 heap, when it grows over 200mb you will start seeing increased cpu usage
- node always seem to leak memory, or at least grow large heap size without actually using it. I suspect memory fragmentation, as v8 profiling or valgrind shows no leaks in js space nor resident heap. Early 0.8 was awful in this respect, rss could be 1GB with 50MB heap.
- hanging requests are hard to track. We wrote our middleware to monitor these especially as our app is long poll based
My suggestions.
- use multiple instances per machine, at least 1 per cpu. Balance with haproxy, nginx or such with session affinity
- write midleware to report hanged connections, ie ones that code never responded or latency was over threshold
- restart instances often, at least weekly
- write poller that prints out memory stats with process module one per minute
- Use supervisord and fabric for easy process management
Monitor cpu, reported memory stats and restart on threshold