How are mixed, sizeless lists implemented in higher level languages?

https://softwareengineering.stackexchange.com/questions/324040

20-12-2020
|

Question

Most higher level (or scripting) languages out there have data structures that can hold different types of data (like numbers, strings and even functions) in the same structure, and you can also add elements without caring about their size, which you don't have to specify. They are sometimes called lists.

In contrast, lower level languages (say C++ or even better C which is the base language for many scripting languages) allow quite the opposite: they only support sized arrays which can only hold one specific type of data. You could virtually continue adding items after the specified size was filled, but that would be dangerous.

So how are these types of data structures found in higher level languages implemented in the lower level language they are written in?

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all and anyway only allows to add one specific type of item and only in a stack-like way (where you can only push or pop, not allocate an item at a random index).

Solution

Arrays, like all data structures, trade one kind of efficiency for another.

The efficiency that an array specializes in is that of rapidly looking up an element by its index. It does that, not by searching through the array for the correct element, but by performing a mathematical calculation. If an array has elements of size 8 bytes, and you want to look up the nth element, it will be located at the memory location which is 8 * n bytes from the beginning of the array.

If you want to add m elements to the end of an array, you can do that by simply allocating the additional memory at the end of the array, if it is free. If that memory is not free, you have to make a new array and copy the old data to it.

Lists work differently from arrays. Lists are data structures that link one element to the next with some kind of pointer. Each element can be stored anywhere in memory, so looking up an element quickly by index is no longer possible. You have to traverse the list, hopping from element to element through the links, until you find the element you want. Some lists have indexes such as b-trees or hashes that make this search process much faster, but it's still not as fast as calculating an index location in an array.

OTHER TIPS

I guess the kind of "lists" you had in mind are not the "linked lists" mentioned by Robert Harvey, but the kind of arrays which are called list in Python, List or ArrayList in C#, ArrayList in Java, or std::vector in C++. Another popular term for this kind of data structure is "dynamic array".

These are indeed implemented internally exactly the way you sketched it: each of them has a maximum capacity, and when that limit is reached, they automatically allocate new block of memory, and copy the content to that new block. A popular, general purpose reallocation strategy is to enlarge the block with a constant factor somewhere between 1 and 2. For most real world programs, this works surprisingly efficient, and if you have a case where this is not efficient enough, you can often control the capacity manually.

And yes, these lists typically support only elements of the same data type. But in Java and C#, all data types are derived (or at least can be derived) from a common base class "object". This lets the dynamic array data type internally store a reference to the content, which allows content values of different types (the references, however, are all of the same size, regardless of the type of the content). Though I am not a Python expert, I am pretty sure that the list implementation will internally work in a similar manner.

Since you wrote, you only have used C++: a std::vector in C++ is bound to a certain type, but by using void * or a pointer to a base type of an inheritance structure, one can come around this limitation. The latter can manage the type of the different elements of the vector by a mechanism called RTTI for you, for using the former you need to manage the type information by yourself somehow.

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all and anyway only allows to add one specific type of item and only in a stack-like way (where you can only push or pop, not allocate an item at a random index).

First of all, good for you for doing this exercise. It is highly educational to try to figure out how these higher-level data structures are implemented in a low-level language. Let me give you some additional thoughts on this.

Try going lower. I notice for example that you take for granted that there is such a thing as a memory allocator, and that it gives you back a block of the size you want, and you can free it when you are done with it. There was a day when those functions did not exist. Someone had to write them for you. Suppose you had functions that could allocate and free memory, but only in 4k blocks. How would you write malloc and free, given only those functions? This is a highly educational exercise.
Be less vague. You say that your proposed "if it doesn't fit, allocate a new block and copy" strategy is "inefficient". Efficiency is defined as value produced divided by resources consumed. State what value and resources you care about, and then work out the actual numeric efficiency of your implementation.
Implement more features. You say that you cannot add an item in the middle of a resize-when-full list, but why not? You can probably figure out how to do so. What is the efficiency of this operation? What about deleting an item from the middle of a list? Your list gets bigger when it gets full; does it get smaller when it gets empty? What's the efficiency of that?
Try a different data structure. Can you implement a linked list? What about a binary tree? What about a balanced binary tree? Can you use them as your abstract list data type? Work out the efficiency of each operation in each of these implementation choices.
Embrace genericity. Can you use templates to make abstract lists that contain elements of a particular type?

To answer your actual question:

Most higher level (or scripting) languages out there have data structures that can hold different types of data (like numbers, strings and even functions) in the same structure ... So how are these types of data structures found in higher level languages implemented in the lower level language they are written in?

You already deduced how variable-sized lists are implemented: there's an array somewhere, it is monitored for fullness, and reallocated when full. But what about list elements all of different sizes?

I can certainly describe for you how we implemented just such a data structure in the mid-1990s implementation of JavaScript at Microsoft. The solution is very simple and clear for the C developer: we used a discriminated union big enough to hold one of the largest thing that would go in the list.

To be precise, the discriminated union type was the OLE Automation VARIANT type used by Visual Basic, which is described here: https://en.wikipedia.org/wiki/Variant_type.

Since each VARIANT is 16 bytes, all you need to do is allocate arrays of 16 byte structs and double that when it gets full. That means that an array of a thousand two-byte integers takes 16K, not 2K, but who cares?

Sadly, this solution is insufficient; we have another problem. In JavaScript, arrays are sparse and associative. That is, arrays can be indexed 0, 1, 5, 1000, "giraffe", if you like. So a C-style dense, integer-indexed array is not going to cut it.

In fact JavaScript arrays used a dynamically grown hash table for lookups, and an associated linked list for enumerations. That way lookups were sublinear but enumerations did not change order as items were added. Indexing was achieved by converting indexes to strings and then hashing the string. To further improve performance in loops, the hash buckets were MRU linked lists.

Implementing such a data structure in C++ is highly educational; give it a shot.

Mixed types are accommodated by storing pointers to objects.

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all

It's quite efficient in time. Typically the new size is twice the old size. So suppose the original size is 128, then it doubles to 256, then to 512, and then to 1024. The first 128 elements are copied three times, the next 128 are copied twice, the next 256 once. The average number of copies is Sum(n/(2^n)) = 2.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange