Best List Capacity For a Known Distribution

https://stackoverflow.com/questions/7770495

09-02-2021
|

Question

Is there a best algorithm for defining the capacity of a C# list in the constructor, if the general distribution of eventual sizes is known?

As a concrete example, if the numbers of values to be placed in each list has a mean of 500, and a standard deviation of 50, with approximately a normal distribution what is the best initial capacity for the list in terms of memory consumption?

Solution 4

I've done a little research and it seems that there is a "right" answer to this question.

First of all I agree that this can be premature optimisation, so profiling before deciding to switch is essential.

Graph showing memory wasted by capacity for various standard deviations.

The graph above was generated in excel, using a normal distribution, and testing the space overused by various initial list capacities, using 10,000 samples and a mean of 10,000. As you can see it has several interesting features.

For low standard deviations, picking a bad initial capacity can waste up to eight times the space of the best choice.
For high standard deviations relative to the mean, less savings are possible.
Troughs, corresponding to the lowest memory wastage, occur at points dependant on the standard deviation.
It is better to choose a value from the right half of the graph to avoid list reallocations.
I couldn't find an exact formula for the minimum wastage, but mean + 1.75 x standard deviation seems to be the best choice based on this analysis.

Caveat: YMMV with other distributions, means etc.

OTHER TIPS

Leave the list to decide. I wouldn't bother setting it (just use an empty constructor) unless you experience concrete performance problems, at which point there are probably other things you can fix first.

Premaure optimisation is the root of all evil.

This is personal opinion, rather than research-based, but remember that a List itself only holds the reference to each object, and therefore it's probably better to err a little on the side allocating space for a few too many references, rather than accidentally doubling the amount of references that you need. With that in mind, a full two or even three standard deviations extra (600 or 650) is probably not out of line. But, again, that's my opinion rather than a researched result.

If you go with the three sigma rule, http://en.wikipedia.org/wiki/68-95-99.7_rule states if you account for 3 standard deviations, a single sample will be within that range 99.7% of the time.

There's no right answer. It's going to be a tradeoff between memory usage and CPU. The larger you initialize the list, the more memory you're probably wasting but your saving CPU since it doesn't have to be resized again later.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow