Question

I need a data structure d with somewhat conflicting requirements. What are the different tradeoffs I could pick?

The same algorithm will be repeatedly done on each time step:

  • push one new element to the right (so all elements remain naturally ordered by the date they came into the data structure).
  • update all elements. I mean iterate (in any order or parallel) and mutate in place.
  • based on the result of the update, drain random elements in order (new or old). I mean that some will have to be both returned in order to user and removed from d.
  • user should also easily iterate in order over remaining elements.

For instance, think of a bunch of fruit that takes a random time to ripen. One new piece of fruit is spawned each day. And each day we need to collect (ordered) ripe fruit, while keeping remaining pieces of fruit ordered by age.

Is there a dedicated data structure?
If not, is there a pattern for this case?
If not, what are the different tradeoffs I'll have to deal with? Is there any assumption I could make on the data (like drain frequency) to help me choose?
(for instance, I suspect that many pieces of fruit will be ripe on the first or the first few update(s), and only a few will last longer)

Was it helpful?

Solution

Tree data structures allow for O(log n) random access and for easy traversal. A naive binary tree likely has worse performance than using a flat array though, but you are not forced to use binary trees.

You can implement k-ary trees that contain an array of items per tree node. The size parameter k is tunable. For sufficiently large k you get similar memory locality benefits to an array, but smaller values of k make random deletions faster.

Similarly, you can upgrade a linked list by not storing a single value per link, but an array of items per link. Of course, linked lists have O(n) random access, regardless of such strategies.

Another strategy to improve memory locality is a kind of arena allocation: store all nodes in a dynamic array and link them by index. Not only will all nodes be close in memory, but the links can possibly be small integer indices instead of full-sized pointers. If your entries have a “empty” representation, then you could also store the nodes in order, and iterate over the array while skipping empty items. Occasionally, you'll have to re-compact the node array so that there aren't large holes in the middle.

If the removed items are not random but are likely to occur at end (so that your data structure behaves like a probabilistic FIFO queue), and your entries have an empty representation that can be skipped over, then a circular buffer might be a good fit. Again, you would have to occasionally re-compact the buffer.

You actually assume the reverse: that many items will be removed quickly, but a few items will remain longer. Here, it might be useful to take inspiration from garbage collection strategies, e.g. generational GC: Store the young generation in a simple data structure like a dynamic array. You can zero out entries when removing them, and compact the items by moving them to the front. Occasionally, move the frontmost items to a separate data structure for the long-lived generation. Depending on your access patterns and on the sizes of the generations, a more sophisticated data structure may be appropriate for the old generation.

OTHER TIPS

That sounds like a doubly-linked list. Appending, (sequential) enumerating and deleting are all quick operations.

Parallel algorithms on doubly-linked lists are harder. If you somehow can't meet performance goals with the sequential enumeration, you could probably utilize a list of lists, but that makes all operations a bit tricky.

Licensed under: CC-BY-SA with attribution
scroll top