How do you do website capacity planning? [closed]

https://stackoverflow.com/questions/9176040

26-04-2021
|

Question

I just read the book The Art of Capacity planning (BTW, I liked it), and in it the author explains how important is measuring your services, finding out your ceilings, forecasting your needs, ensure a easygoing deployment, etc.. etc.. But through the book he explains his experience in Flickr, where he has to face all the time the same product.

Lot of us, we work in companies where we face small-medium project sizes for other companies. We have to understand their business, their needs, plan an architecture, a model, etc.. etc..

Then, the customer says "I need to support 1000 users". Well, and how many requests per second is a user? how long are their sessions? how much data do they transfer? which operations do they execute? how long are they?

Sometimes it is possible to know those figures (monitoring their existing applications or because they have already done that measurements), sometimes it is not possible (because they do not have a current web site, or it is just to possible to know).

How do you make a guess about the number of servers, bandwidth, storage, etc... which figures of reference do you use?

Regards.

Solution

Some points that you need to know to make this planning

How many users per day.
How many data you going to control.
How many data you going to show to each user.
Average user bandwidth that may need.
Average user time using your site.

The average numbers can give some idea what you need monthly. Of cource you need to think also the peak numbers - but when they rend web server computers and site they give bandwidth by the month and some gigabytes on hard disk, so the peak is not an issue at the start point. There you must think that if you run sql query that need too much ram, or if you share the computer with many other sites.

Measure

With out site, with out experience you do not have actually measures. With out measures, you actually can not be sure but you can follow some guides

What ever you do, try to make the grow of your data/features/runs linear and not logarithmic.
The speed of your site is not (only) depend from the capacity and the speed of your computer. Is depend only when the computer is on his limits. If the computer is reach his limit, you add additional resource. But the speed must be take care when you design the software and the good speed software is costing also.
Do you have millions of data every day in the database ? you need more ram and hard disk
Do you have video and many big files to send ? you need more bandwidth.
Do you have people that using the site to work ? you need more speed and stability
Do you make one more e-commerce site ? you need more security with stability

The goal is to have them all, and the priority on what you focus first actually change.

Planning for speed.

Performance and Capacity: Two diffident animals*. The Performance is base on more human work, and the capacity is base on more computer resources. To make it speed you need first to know how to make the computer run smooth and fast, then to know how general tricks to make programs runs fast, especial the one on the web, and then you actually need to spend more time to the actually program after its run, to improve it for performance in the critical areas.

Planning for expand.

Make good software design and take care the possibility of expand in case that you may need more so to give to your client the opportunity to start with little, and pay more only if he needed it. So when you design your software think like you going to use it in a web pool, take care of the synchronization, take care of common resource, give the ability to get data from different servers etc.

Planning with limits

Ok, let say that the customer say that have only 1000 users and did not interesting nether for expand, nether for speed, and just need a cost effective site that do his job. In this case you also design it with this limits. What are this limits. You do not place tens of checks for synchronizations, and you make it work like a single thread, single pool program. You do not use any mutex, any double checks, any thinks that happens when you have 2 pools or 2 computers running the same application. You only note that points of code to change them in case that needs upgrade.

You also not made any code that use multicomputer resources. And when you run it you take care that is run only under one pool to work correctly.

This single pool design is more easy to develop, more easy to debug, easy to control, easy to update buggy code, and cost less, but suffer from speed (one user wait the other on one thread pool) and can not be expand in resource, that actually have to do also with speed.

Finding Statistics

If you do not know how many users you may have, you can use alexa to see similar sites with yours and the average users/ and average page views they have per month. Then you may know the possible bandwidth.

Don't buy before you needed it

Start with your prediction to hardware, but do not go and rent 2 computers from the day one. Start with the first, make your measures, see how data grow, and only expand it when you need it.

Car or Formula One ?

When the programs runs, if you follow it you can find many many thinks that need correction. I can say you only two from my life.

After we place the program online our customer starts to add data. After some months we notice the database grow too much - something that we did not expect it from the data enter. We spend almost one week to find why and fix it, it was a design error that make some statistics data grow logarithmic, we correct it and move on.

After two years of running we notice that we make too many un-necessary calls to SQL server. We trace it down and found again a design error, we correct it and we move on.

Actually we have found and fix many small points for performance every month. For me its like the formula one. You decide what car you have, a formula one that needs all the time correction to gain the maximum of it, or a simple car that only needs a yearly service ?

Customer Point of View

Then, the customer says "I need to support 1000 users" Well the customer did not know programming and try to find a measure from his point of view to compare proposals. Actually there are many more factor here and the 1000 users is not a correct parameter. Is 1000 users per day per minute or per month ? Are needed to suport with live chat, or needed to see large amount of data, or needed to work fast ? So maybe its up to you to sell correctly your program to the customer ether by explain to him that the good program is good the same for one user of for one million users, and actually the start of it is cost by the development and not by the users.

Now if this is a question for actually planning a site, then the simple end point answer is to start do it, and the rest will be reveal. If this is a question because you search answers for your client, then you must ask your self: why the Formula One have sit only for one and your car can fit five ? or how much a movie cost ? or we all knows how to write but why not all of us have write and publishes a book ? My point is that the cost is actually get from the time you spend to make the project, and the users by him self can not be determine that.

Guess, Knowledge or Prediction ?

How do you make a guess about the number of servers, bandwidth, storage, etc... We actually do not guess, we have many sites, we collect every day many statistics automatically, many years experience, and we know from the content of the site, how many users can have per day and how many bandwidth can eat. We also have many databases that runs on our servers and we can see how many data they use. For 99% of our sites all that are low numbers. So this is knowledge and experience, with real live statistics. The prediction come by monitoring the traffic and the use of them, we try to make them better, to get more traffic, more users, and from what we archive we try to predict if they need more resource in the future. Also 99% of the sites are single pool running very simple presentations.

'* From the book

OTHER TIPS

Often this is very difficult since the system is not even designed when the customer is asking for the answer to this question. Which is actutally impossible.

As a very rough rule of thumb we use 100 requests per second per server. The actual number will vary depending on the application and how the users use the system, but we have found it a good first estimate.

The disk usage for a document system is just number of documents times average size. Bandwidth is number of requests times average size of requests.

You just document all of your assumptions and say that the hardware requirements are based on those assumptions.

While developing a recent Asp.Net MVC site, I used selenium to load test my site. Basically you record a selection of macros, in which you perform random tasks.

Then use selenium to simulate a number of users performing those macros. I tested my site with tens, hundreds and then thousands of users. This allows you to find trouble spots in code and in infrastructure before going live.

which figures of reference do you use?

There is really only one figure that needs to be looked at, and then extrapolated on: data. All figures will derive from data requirements.

Small example: A billion requests per hour for an 8 byte binary number will not crash anything and could be run from the simplest of web servers. The reason for this is that the request time will be fractions of milliseconds. There are 1000 (ms/s) * 60 (s/m) * 60 (m/h) * 24 (h/d) = 86.4 million milliseconds in one day, meaning that even if each request took a full millisecond the 1 million required would still be available as the required speed for getting the 8 bytes would be in the 8kb/s range.

Real life version: Looking at the data will determine the requirements, and the data that is being retrieved is almost always in a database. The design of the database (even if conceptually) can help to determine how much data will be being used. There are multiple requirements in real life. The max capacity of the database, or filesystem, should be examined. This capacity can be calculated by looking at how much space each row of a table will require, by summing up the total space consumed by each column (i.e. an id of type int with length 6 will take 6 bytes or space). After summing each column of one row of a table, for each table in the database, it will be easy to tell how much memory each collection of tables will require (usually tables are linked through foreign keys). After the table memory consumption is considered, the users must then be examined for requirements. Mainly of interest is how many tables each user will be accessing per session (with no data this will be an guesstimation - best to overestimate). Because we already know, or have a good idea, what the size of the database tables are we can assume how much server memory the user will require. Comparing this memory usage to the amount of expected users will help to determine which server to use, or how many. Next to figure out is how many tables will be (again, on average guesstimation, or with some collected test data) inserted into the database as a result of user actions. This is very speculative and is best to be done with testing. Without testing, assumptions should be overestimated. Based off of how many rows each user will be inserting, it will be possible to extrapolate the database size and the bandwidth requirements. These will be determined through expanding the data requirement of one user, to the requirements of n users per t time. The data required by n users will make it possible to see bandwidth requirements over t time, and will also determine how n users will grow the database over t time.

In practice, we don't. We make sure we are able to rapidly expand (devops), have a possibility to fall back to using less resources/request, start with a very small number of users and observe performance. Most small-medium projects don't want to spend much time and money on this. For a large or critical project it makes sense to create and run simulations.

Remember, one day of planning costs as much as an extra machine for a year.

You use capacity to cover a number of non-functional qualities of system and are probably trying to encaspulate performance, capacity and scalability into one concept.

Lets start with performance and if you are dealing with a web based architecture, where you are serving resources then this is really quite straightforward and can be split into 2 different KPI's; server response time and page load time (should be called resource load time since not all resources on the web are web pages).

Server response time measures the time to last byte for a request on a given resource. Please note, that this is not inclusive of things such as content negotation. You (or the business) needs to specify the expected server response time for given types of resources. This is based on a single request/response e.g a response to a request for any resource that falls under the type of a 'Car Model', should take no more than 0.5 seconds, time to last byte.

Page load times take things one step further. Given a request for a resource, how long does it take to load that resource, along with any dependent resources. It really has more meaning when in the context of a Web Page. The Web being full of unknowns, makes this a bit of a grey area since all sorts of things come into play on this one (the network, the client, content negotation) so you need speicfy this given a fixed/stabilised network and client (there are all sorts of tools to achieve this). It should also always be defined as an average, without introducing concurrency issues (we are still not thinking about capacity yet).

Once you have specified both, you can start to determine the immediate capacity of your system i.e how many requests per second for resources can I make performantly (as specified above). There are loads of tools to help you define this. This will give you an immediate measure of capacity. You'll notice I use the term immediate because often the business might turn around and say, great, but what happens if we need to increase this capacity.

So we move onto the third non functional, scalability (n.b, there are more than 3 non functional qualities of a system, including availability, reliability, validity, usability, accessibility, extensibility, and manageability). Given a certain capacity, by how much can I increase it performantly. There is also sorts of ways to increase the capacity, but most systems by design usually have a bottleneck somewhere that creates a constraint.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow