How to efficiently nearly sort a list?

https://stackoverflow.com/questions/9116500

21-04-2021
|

Question

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.

How can I do this most efficiently?

I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.

_{The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered}

My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:

for(int i=0; i<3; i++) // I know I have more than 6 elements
    std::swap(order[i],order[i+rand()%3]);

No correct solution

OTHER TIPS

Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.

There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.

You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:

pseudocode:

if random(1000) == 0 then
  return = random(2)-1   <-- -1,0,-1 randomly choosen

Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.

Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!

Edit:

if random(100 * container_size) == 0 then <-- here I consider the container size
{
   if element_1 < element_2
      return random(1); <-- do not return the "correct" value of -1
   else if element_1 > element_2
      return random(1)-1; <-- do not return the "correct" value of 1
   else
      return random(1)==0 ? -1  : 1; <-- do not return 0
}

in my pseudocode: random(x) = y where 0 <= y <=x

One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).

For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.

Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).

If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....

edit

More specifically, you would create k buckets, each containing N/k elements.

You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))

This can be useful because you can run sorting for each bucket in parallel, reducing total running time.

This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.

but I do not think you meant any restriction.

Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.

Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.

Assuming you want the array sorted in ascending order, I would do the following:

for M iterations
  pick a random index i
  pick a random index k
  if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])

M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.

Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.

This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.

Bubblesort to the rescue!

For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.

BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

How to efficiently *nearly* sort a list?

How to efficiently nearly sort a list?