Structure with extremely fast insertion time

https://stackoverflow.com/questions/8409269

29-10-2019
|

Question

I'm looking for an ordered data structure which allows very fast insertion. That's the only property required. Data will only be accessed and deleted from the top element.

To be more precised, i need 2 structures :

1) The first structure should allow an ordered insertion using an int value. On completing the insertion, it shall report the rank of the inserted element.

2) The second structure should allow insertion at a specified rank.

The number of elements to be stored is likely to be in thousands, or tens of thousands.

[edit] i must amend the volume hypothesis : even though, at any moment, the size of the ordered structure is likely to be in the range of tens of thousands, the total number of insertion is likely to be in the tens of millions per run.

Insertion time in O(1) would be nice, although O(log(log(n))) is very acceptable too. Currently i've got some interesting candidate for First structure only, but either in log(n), or without the capability to report insertion rank (which is mandatory).

Solution

What about a form of skip-list, specifically the " indexed skiplist" in the linked article. That should give O(lg N) insert and lookup, and O(1) access to the first node for both your use cases.

--Edit--

When I think of O(1) algorithms, I think of radix-based methods. Here is an O(1) insert with rank returned. The idea is to break the key up into nibbles, and keep count of all the inserted items which have that prefix. Unfortunately, the the constant is high (<=64 dereferences and additions), and the storage is O(2 x 2^INT_BITS), which is awful. This is the version for 16 bit ints, expanding to 32 bits should be straightforward.

int *p1;int *p2;int *p3;int *p4;
void **records;
unsigned int min = 0xFFFF;

int init(void)     {
   p1 = (int*)calloc(16,sizeof(int));
   p2 = (int*)calloc(256, sizeof(int));
   p3 = (int*)calloc(4096, sizeof(int));
   p4 = (int*)calloc(65536,sizeof(int));
   records = (void**)calloc(65536,sizeof(void*));
   return 0;
}

//records that we are storing one more item, 
//counts the number of smaller existing items
int Add1ReturnRank(int* p, int offset, int a) {
   int i, sum=0;
   p+=offset;
   for (i=0;i<a;i++)
      sum += p[i];
   p[i]++;
   return sum;
}

int insert(int key, void* data) {
   unsigned int i4 = (unsigned int)key;
   unsigned int i3= (i4>> 4);
   unsigned int i2= (i3>> 4);
   unsigned int i1= (i2>> 4);
   int rank = Add1ReturnRank(p1,0, i1&0xF);
   rank += Add1ReturnRank(p2,i2&0xF0,i2&0xF);
   rank += Add1ReturnRank(p3,i3&0xFF0,i3&0xF);
   rank += Add1ReturnRank(p4,i4&0xFFF0,i4&0xF);
   if (min>key) {min = key;}
   store(&records[i4],data);
   return rank;
}

This structure also supports O(1) GetMin and RemoveMin. (GetMin is instant, Remove has a constant similar to Insert.)

void* getMin(int* key) {
    return data[*key=min];
}

void* removeMin(int* key)  {
   int next = 0;
   void* data = records[min];
   unsigned int i4 = min;
   unsigned int i3= (i4>> 4);
   unsigned int i2= (i3>> 4);
   unsigned int i1= (i2>> 4);

   p4[i4]--;
   p3[i3]--;
   p2[i2]--;
   p1[i1]--;
   *key = min;
   while (!p1[i1]) {
      if (i1==15) { min = 0xFFFF; return NULL;}
      i2 = (++i1)<<4;
   }
   while (!p2[i2])
      i3 = (++i2)<<4;
   while (!p3[i3])
      i4 = (++i3)<<4;
   while (!p4[i4])
      ++i4;
   min = i4;
   return data;
}

If your data is sparse and well distributed, you could remove the p4 counter, and instead do an insertion sort into the P3 level. That would reduce storage costs by 16, at the cost of a higher worst case insert when there are many similar values.

Another idea to improve the storage would be to do combine this idea with something like an Extendable Hash. Use the integer key as the hash value, and keep count of the inserted nodes in the directory. Doing a sum over the relevant dictionary entries on an insert (as above) should still be O(1) with a large constant, but the storage would reduce to O(N)

OTHER TIPS

Order Statistic Tree seems to fit your need at O(LogN) time. Link

An order-statistics tree is an augmented (see AugmentedDataStructures) version of a
BinarySearchTree that supports the additional operations Rank(x), which returns the rank of x (i.e., the number of elements with keys less than or equal to x) and FindByRank(k), which returns the k-th smallest element of the tree.

If you only have tens of thousands of elements, the performance difference between O(LogN) time and O(1) time asymptotic time complexity is not as significant as you thought. For example, consider 100000 elements, the logN method is only 16 times slower.

log(100 000) / log(2) = 16.6096405

In this case the difference in coefficient (implementation, overheads) may be the real target of optimization. Fancy data structures usually have a much higher overhead due to the inherit complexity (sometimes thousands of times slower). They are more likely to come from less refined implementation because they are less used.

You should benchmark(actually test) the different heap implementations to find one with the best real performance.

You say you need an ordered datastructure, which to me sounds like you need something that can yield all the elements contained in O(n) time.

But then you say you will only be accessing the top (least?) element, suggesting that you really just need something that can yield the minimum value, repeatedly - opening the door to something with a partial ordering.

Which is it?

If I understand your question correctly,I would recommend you to use use a Dictionary whose keys are ranks and values are linked list.

With keys you can have ranks and with linked list as the values, you can have O(1) insertion time. Also as removal, you can have O(1). You can implement a stack or queue with linkedlist, which is what you want.

Or you can just use a doubly linked list in which you are guaranteed to have O(1) insertion and removal. for ranking, you can embed that information within the nodes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow