Domanda

Suppose that you have a large collection of key/value pairs, where the value is some arbitrary real number. You're interested in creating a data structure supporting the following operations:

  • Insert, which adds a new key/value pair to the collection,
  • Delete, which removes a key/value pair from the collection,
  • Percentile, which tells which percentile the value associated with a given key is in, and
  • Tell-Percentile, which accepts a percentile number and returns the key whose value is the lowest value at at least the given percentile.

This data structure could be used, for example, to efficiently determine what percentile a given student is in when receiving a stream of nationwide test scores, or to identify hospitals that have unusually good or bad quality of service.

Is there a way to make these operations run efficiently (say, sublinear time?)

È stato utile?

Soluzione

One possible way to implement this data structure is to use a hybrid of an order statistic tree and a hash table.

An order statistic tree is a type of balanced binary search tree that, in addition to the normal binary search tree operations, supports two more operations:

  • Rank(key), which returns the number of elements in the tree smaller than a given element, and
  • Select(k), which returns the kth smallest element in the tree.

Order statistic trees can be built by augmenting a normal balanced binary search tree (say, a red/black tree or an AVL tree) with extra information that is preserved during rotations. In this way, all of the normal BST operations on an order statistic tree can be made to run in O(log n) time, with the extra operations also running in O(log n) time.

Now, let's suppose that you were purely storing values scores, rather than key/percentile scores. In this case, it would be very simple to implement the percentile lookups as follows. Store all of the values in the order statistic tree. To determine the percentile score for a given value, use the rank operation on the order statistic tree to look up what index that value appears at. This gives a number, in the range from 0 to n - 1 (where n is the number of elements in the tree), denoting the position of that score in the order statistic tree. You can then multiply that number by 99 / (n - 1), to get a percentile score for the value that runs in the range from 0 to 99, as required.

To determine the lowest value greater than some percentile, you can use the select operation as follows. Given a percentile between 0 and 99, multiple that percentile by 99 / (n - 1) to get a real number between 0 and n - 1, inclusive. Taking the ceiling of that number produces a natural number in the range 0 to n - 1, inclusive. Using the select operation on the order statistic tree then can be used to find the first value in the range that is at or above the given percentile.

However, these operations assume that we have purely values in the data structure, not key/value pairs. To make this operation work for key/value pairs, we will augment our data structure as follows:

  1. Rather than just storing values, we will store key/value pairs in each node. The order statistic tree will sort the key/value pairs purely by their value, with the key carried around as satellite data.
  2. We will store a secondary hash table that maps keys to their associated values.

These two changes make it possible to implement the needed functionality for our data structure. To get the data structure to do percentile lookups by key, we first query the hash table with the given key to look up its associated value. We then do a percentile lookup on the value as done before. To get the data structure to tell us a key whose value is the first at or above a given percentile, we do a normal find-percentile operation on the order statistic tree as described above, then look up the key associated with the given value.

If we assume that the hash table uses chained hashing, then the time required for each operation is given below:

  • Insert: O(log n) time to insert the value/key pair into the order statistic tree, plus O(1) amortized time to insert the key/value pair into the hash table. Total time is O(log n) amortized.
  • Delete: O(log n) time to delete the value/key pair from the order statistic tree, plus (1) amortized time to delete the key/value pair from the hash table. Total time is O(log n) amortized.
  • Percentile: O(1) expected time to look up the value associated with the key, O(log n) time to do the rank operation, and O(1) extra time to map the rank to a percentile. Total time is O(log n) expected.
  • Find-Percentile: O(1) time required to map the percentile to a rank, and O(log n) time required to do the select operation. Total time is O(log n) worst-case.

Hope this helps!

Altri suggerimenti

There is a simple and highly efficient possibility:

If you can live with searching the percentile only in a finally filled up students structure then:

Use an ArrayList to dynamically build up when you don't know the number of elements.
If you know them then start with the array directly, otherwise create the array from the dynamic array. (e.g ArrayList in java).

insert: not neccesary, replaced by adding at end, and sorting once.
delete: not neccessary, if you can live with that.
tell-percentile: even simpler: something very near to: element[length * percentile]: O(1)

In practise the array approach will be much faster than the balanced tree approach, at least in java, when your application can build up the array once (e.g daily student evaluation, build it up daily)

I have implemented the (my) above algorithm using an self written ArrayListInt, which does the same as ArrayList but uses primitive types (double, int), instead of objects types. I sorted it once, when all data have been read.

Further you wanted key value:
I would just add an Tree Map (balanced tree). Now it is a bit doubtfull if the TreeMap and the additonal percentile array make sense: That depends then on how often you have to search, and memory usage versus search time.

Update:

Results: treeset vs sorted array (dynamic build up the array, then finally sort once:

num elements: 1000 treeSet: 4.55989 array=0.564159
num elements: 10000 treeSet: 2.662496 array=1.157591
num elements: 100000 treeSet: 31.642027 array=12.224639
num elements: 1000000 treeSet: 1319.283703 array=140.293312
num elements: 10000000 treeSet: 21212.307545 array=3222.844045

This ammount of elements (1e7) now is near the limit (1GB heap space) , in the next step memory would run out (already happedned at 1e7, but with cleaning up memory after treeset, the run for measure 1e7 worked, too.

What is missing are the search times, but an sorted array with binsearch is only beatable by an hashtable

Finally: If you can build up the student set once, e.g daily, then using the array approach gives a much simpler percentile search.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top