One possible way to implement this data structure is to use a hybrid of an order statistic tree and a hash table.
An order statistic tree is a type of balanced binary search tree that, in addition to the normal binary search tree operations, supports two more operations:
- Rank(key), which returns the number of elements in the tree smaller than a given element, and
- Select(k), which returns the kth smallest element in the tree.
Order statistic trees can be built by augmenting a normal balanced binary search tree (say, a red/black tree or an AVL tree) with extra information that is preserved during rotations. In this way, all of the normal BST operations on an order statistic tree can be made to run in O(log n) time, with the extra operations also running in O(log n) time.
Now, let's suppose that you were purely storing values scores, rather than key/percentile scores. In this case, it would be very simple to implement the percentile lookups as follows. Store all of the values in the order statistic tree. To determine the percentile score for a given value, use the rank operation on the order statistic tree to look up what index that value appears at. This gives a number, in the range from 0 to n - 1 (where n is the number of elements in the tree), denoting the position of that score in the order statistic tree. You can then multiply that number by 99 / (n - 1), to get a percentile score for the value that runs in the range from 0 to 99, as required.
To determine the lowest value greater than some percentile, you can use the select operation as follows. Given a percentile between 0 and 99, multiple that percentile by 99 / (n - 1) to get a real number between 0 and n - 1, inclusive. Taking the ceiling of that number produces a natural number in the range 0 to n - 1, inclusive. Using the select operation on the order statistic tree then can be used to find the first value in the range that is at or above the given percentile.
However, these operations assume that we have purely values in the data structure, not key/value pairs. To make this operation work for key/value pairs, we will augment our data structure as follows:
- Rather than just storing values, we will store key/value pairs in each node. The order statistic tree will sort the key/value pairs purely by their value, with the key carried around as satellite data.
- We will store a secondary hash table that maps keys to their associated values.
These two changes make it possible to implement the needed functionality for our data structure. To get the data structure to do percentile lookups by key, we first query the hash table with the given key to look up its associated value. We then do a percentile lookup on the value as done before. To get the data structure to tell us a key whose value is the first at or above a given percentile, we do a normal find-percentile operation on the order statistic tree as described above, then look up the key associated with the given value.
If we assume that the hash table uses chained hashing, then the time required for each operation is given below:
- Insert: O(log n) time to insert the value/key pair into the order statistic tree, plus O(1) amortized time to insert the key/value pair into the hash table. Total time is O(log n) amortized.
- Delete: O(log n) time to delete the value/key pair from the order statistic tree, plus (1) amortized time to delete the key/value pair from the hash table. Total time is O(log n) amortized.
- Percentile: O(1) expected time to look up the value associated with the key, O(log n) time to do the rank operation, and O(1) extra time to map the rank to a percentile. Total time is O(log n) expected.
- Find-Percentile: O(1) time required to map the percentile to a rank, and O(log n) time required to do the select operation. Total time is O(log n) worst-case.
Hope this helps!