Question

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?

I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are

  1. caching
  2. replication
Was it helpful?

Solution

You can improve I/O performance in several ways depending upon what you use for your storage setup:

  1. Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
  2. Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
  3. Use fast disks (Performance Wise: SSD > FC > SATA).
  4. Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
  5. Turn off atime updates in your filesystem.
  6. Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
  7. Combine small files into larger chunks, a.k.a BigTable, HBase.
  8. Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
  9. Use a clustered storage system (yeah not exactly commodity hardware).
  10. Optimize/design your application for sequential disk accesses whenever possible.
  11. Use memcached. :)

You may want to look at "Lessons Learned" section of StackOverflow Architecture.

OTHER TIPS

check out this handy tool:

http://www.sizinglounge.com/

and another guide from dell:

http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555

if you want your own stackoverflow-like community, you can sign up with StackExchange.

you can read some case studies here:

High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study

1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.

If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.

If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.

The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top