Why is vector array doubled?

https://stackoverflow.com/questions/1424826

07-07-2019
|

Question

Why does the classic implementation of Vector (ArrayList for Java people) double its internal array size on each expansion instead of tripling or quadrupling it?

Solution

When calculating the average time to insert into a vector, you need to allow for the non-growing inserts and the growing inserts.

Call the total number of operations to insert n items o_total, and the average o_average.

If you insert n items, and you grow by a factor of A as required, then there are o_total = n + ΣAⁱ [ 0 < i < 1 + ln_An ] operations. In the worst case you use 1/A of the allocated storage.

Intuitively, A = 2 means at worst you have o_total = 2n, so o_average is O(1), and the worst case you use 50% of the allocated storage.

For a larger A, you have a lower o_total, but more wasted storage.

For a smaller A, o_total is larger, but you don't waste so much storage. As long as it grows geometrically, it's still O(1) amortised insertion time, but the constant will get higher.

For growth factors 1.25 ( red ), 1.5 ( cyan ), 2 ( black ), 3 ( blue ) and 4 ( green ), these graphs show point and average size efficiency ( ratio of size/allocated space; more is better ) on the left and time efficiency ( ratio of insertions / operations; more is better ) on the right for inserting 400,000 items. 100% space efficiency is reached for all growth factors just prior to resizing; the case for A = 2 shows time efficiency between 25% and 50%, and space efficiency about 50%, which is good for most cases:

space and time efficiency graph - C like implementations

For runtimes such as Java, arrays are zero filled, so the number of operations to allocate is proportional to the size of the array. Taking account of this gives reduces the difference between the time efficiency estimates:

space and time efficiency graph - Java like implementations

OTHER TIPS

Exponentially doubling the size of the array(or string) is a good compromise between having enough cells in the array and wasting too much memory.

Say we start off with 10 elements:

1 - 10
2 - 20
3 - 40
4 - 80
5 - 160

When we triple the size, we grow too fast

1 - 10
2 - 30
3 - 90
4 - 270
5 - 810

In practice you would grow maybe 10 or 12 times. If you triple you would maybe do it 7 or 8 times - the runtime hit for reallocating is this few times is sufficiently small to worry about but you are more likely to completely overshoot the required size.

If you were to allocate an unusual-sized block of memory, then when that block gets deallocated (either because you're resizing it or it gets GC'd) there would be an unusual-sized hole in memory which could cause headaches for the memory manager. So it's usually preferred to allocate memory in powers of two. In some cases the underlying memory manager will only give you blocks of certain sizes, and if you request a weird size it will round up to the next larger size. So rather than asking for 470 units, getting back 512 anyway, and then resizing again once you've used all 470 that you've asked for, might as well just ask for 512 to begin with.

Any multiple is a compromise. Make it too big and you waste too much memory. Make it too small and you waste much time for reallocations and copying. I guess that doubling is there because it works and is very easy to implement. I also saw a proprietary STL-like library that uses 1.5 as multiplier for the same - I guess its developers considered doubling wasting too much memory.

If you are asking about the Java-specific implementation of Vector and ArrayList, then it's not necessarily doubled on each expansion.

From the Javadoc for Vector:

Each vector tries to optimize storage management by maintaining a capacity and a capacityIncrement. The capacity is always at least as large as the vector size; it is usually larger because as components are added to the vector, the vector's storage increases in chunks the size of capacityIncrement. An application can increase the capacity of a vector before inserting a large number of components; this reduces the amount of incremental reallocation.

One of the constructors for Vector allows you to specify the initial size and capacity increment for the Vector. The Vector class also provides the ensureCapacity(int minCapacity) and setSize(int newSize), for manual adjustments of the minimum size of the Vector and to resize the Vector on your own.

The ArrayList class is very similar:

Each ArrayList instance has a capacity. The capacity is the size of the array used to store the elements in the list. It is always at least as large as the list size. As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.

An application can increase the capacity of an ArrayList instance before adding a large number of elements using the ensureCapacity operation. This may reduce the amount of incremental reallocation.

If you are asking about the general implementation of a vector, than the choice of increase in size and by how much is a trade-off. Generally, vectors are backed by arrays. Arrays are of a fixed size. To resize a vector because it's full means that you have to copy all elements of an array into a new, larger array. If you make your new array too large, then you have allocated memory that you will never use. If it's too small, it might take too long to copy the elements from the old array into the new, larger array - an operation that you don't want to perform very often.

Personally, I think its an arbitriary choice. We could use base e instead of base 2 (instead of doubling just multiple size by (1+e).)

If you are going to be adding large amounts of variables to the vector then it would be advantageous to have a high base (to reduce the amnt of copying you will be doing.) On the flip side if you need to be storing only a few members on avg, then a low base will be fine and reduce the amount of overhead, hence speeding things up.

Base 2 is a compromise.

There is no performance reason for doubling vs tripling or quadrupling as the all have the same big O performance profiles. However in absolute terms doubling will tend to be more space efficient in the normal scenario.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow