How big is the performance gap between std::sort and std::stable_sort in practice?

https://stackoverflow.com/questions/810951

03-07-2019
|

Question

Both should run in O(n log n), but in general sort is faster than stable_sort. How big is the performance gap in practice? Do you have some experience about that?

I want to sort a very large number of structs that have a size of about 20 bytes. The stability of the result would be nice in my case, but it is not a must. At the moment the underlying container is a plain array, perhaps it could be changed to a std::deque later on.

Solution

There are good answers that compared the algorithms theoretically. I benchmarked std::sort and std::stable_sort with google/benchmark for curiosity's sake.

It is useful to point out ahead of time that;

Benchmark machine has 1 X 2500 MHz CPU and 1 GB RAM
Benchmark OS Arch Linux 2015.08 x86-64
Benchmark compiled with g++ 5.3.0 and clang++ 3.7.0 (-std=c++11, -O3 and -pthread)
BM_Base* benchmark tries to measure the time populating std::vector<>. That time should be subtracted from the sorting results for better comparison.

First benchmark sorts std::vector<int> with 512k size.

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:37:43
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              24730499   24726189         28
BM_BaseInt/512k_stddev              293107     310668          0
...
BM_SortInt/512k_mean              70967679   70799990         10
BM_SortInt/512k_stddev             1300811    1301295          0
...
BM_StableSortInt/512k_mean        73487904   73481467          9
BM_StableSortInt/512k_stddev        979966     925172          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:39:07
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseInt/512k_mean              26198558   26197526         27
BM_BaseInt/512k_stddev              320971     348314          0
...
BM_SortInt/512k_mean              70648019   70666660         10
BM_SortInt/512k_stddev             2030727    2033062          0
...
BM_StableSortInt/512k_mean        82004375   81999989          9
BM_StableSortInt/512k_stddev        197309     181453          0

Second benchmark sorts std::vector<S> with 512k size (sizeof(Struct S) = 20).

[ g++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:49:32
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           26485063   26410254         26
BM_BaseStruct/512k_stddev           270355     128200          0
...
BM_SortStruct/512k_mean           81844178   81833325          8
BM_SortStruct/512k_stddev           240868     204088          0
...
BM_StableSortStruct/512k_mean    106945879  106857114          7
BM_StableSortStruct/512k_stddev   10446119   10341548          0

[ clang++ ]# benchmark_sorts --benchmark_repetitions=10
Run on (1 X 2500 MHz CPU )
2016-01-08 01:53:01
Benchmark                         Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------
...
BM_BaseStruct/512k_mean           27327329   27280000         25
BM_BaseStruct/512k_stddev           488318     333059          0 
...
BM_SortStruct/512k_mean           78611207   78407400          9
BM_SortStruct/512k_stddev           690207     372230          0 
...
BM_StableSortStruct/512k_mean    109477231  109333325          8
BM_StableSortStruct/512k_stddev   11697084   11506626          0

Anyone who likes to run the benchmark, here is the code,

#include <vector>
#include <random>
#include <algorithm>

#include "benchmark/benchmark_api.h"

#define SIZE 1024 << 9

static void BM_BaseInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }
  }
}
BENCHMARK(BM_BaseInt)->Arg(SIZE);

static void BM_SortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_SortInt)->Arg(SIZE);

static void BM_StableSortInt(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back(dist(mt));
    }

    std::stable_sort(v.begin(), v.end());
  }
}
BENCHMARK(BM_StableSortInt)->Arg(SIZE);


struct S {
  int key;
  int arr[4];
};

static void BM_BaseStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }
  }
}
BENCHMARK(BM_BaseStruct)->Arg(SIZE);

static void BM_SortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::sort(v.begin(), v.end(),
              [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_SortStruct)->Arg(SIZE);

static void BM_StableSortStruct(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_int_distribution<int> dist;

  while (state.KeepRunning()) {
    std::vector<S> v;
    v.reserve(state.range_x());
    for (int i = 0; i < state.range_x(); i++) {
      v.push_back({dist(mt)});
    }

    std::stable_sort(v.begin(), v.end(),
                     [](const S &a, const S &b) { return a.key < b.key; });
  }
}
BENCHMARK(BM_StableSortStruct)->Arg(SIZE);


BENCHMARK_MAIN();

OTHER TIPS

std::stable_sort performs NlogN comparisons when sufficient memory is available. When insufficient memory is available, it degrades to N((logN)^2) comparisons. Therefore it is roughly of the same efficiency as std::sort (which performs O(NlogN) comparisons in both average and worst case) when memory is available.

For those interested, sort() uses an introsort (quicksort which switches to heapsort when the recursion reaches a certain depth) and stable_sort() uses a merge sort.

Big enough to warrant a separate function that does stable sort and not have std::sort() do it transparently.

Sometimes std::stable_sort() is needed because it maintains order of elements that are equal.

And the conventional advice is that, if maintaining order is not important, you should use std::sort() instead.

However, its context dependant. There is plenty of data that is best sorted with stable sort even if you don't need to maintain order:

Quicksort quickly becomes worst-case performance if the data has consistently poor pivot points.

The Burrows-Wheeler Transform is an algorithm used as part of data compression, e.g. bzip2. It requires sorting all the rotations of the text. For the majority of text data, merge sort (as often used by std::stable_sort()) is substantially faster than quicksort (as usually used by std::sort()).

bbb is a BWT implementation that notes the advantages of std::stable_sort() over sort() for this application.

How big is the performance gap in practice? Do you have some experience about that?

Yes, but it didn't go the way you would expect.

I took a C implementation of the Burrows-Wheeler Transform and C++-ified it. Turned out a lot slower than the C code (though the code was cleaner). So I put timing instrumentation in there and it appeared that the qsort was performing faster than the std::sort. This was running in VC6. It was then recompiled with stable_sort and the tests ran faster than the C version. Other optimisations managed to push the C++ version ~25% quicker than the C version. I think it was possible to eke more speed but the clarity of the code was disappearing.

If you are sorting a large number of structs, the IO speed of your memory/disk starts to become more important than the asymptotic running time. Furthermore, memory usage should also be taken into consideration.

I tried std::stable_sort on 2Gb of data (64B structs), not knowing that std::stable_sort creates an internal copy of the data. The swap trashing that followed almost locked up my pc.

Using the unstable std::sort reduces memory usage by a factor of 2, which is useful when sorting large arrays. I terminated the std::stable_sort, so I cannot determine how much slower it was. However, if stable sort is not required, then I think it is better to use the unstable std::sort.

Was looking for something similar - but was surprised no one talked about Auxiliary space.

As I believe - the implementation of both stable_sort and sort is supposed to guarantee O(NlogN) for all (Best, Average & Worst) cases.

However, the difference exists in the Auxiliary space used. stable_sort needs an Auxiliary space of O(N).

May be the difference in performance lies in acquiring that space. :)
Otherwise, theoretically - they are supposed to be same w.r.t performance.

sort should do what you need unless you need this -> stable_sort preserves the relative order of the elements with equivalent values.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow