Improving the Quick sort

https://stackoverflow.com/questions/1688264

18-09-2019
|

Question

If possible, how can I improve the following quick sort(performance wise). Any suggestions?

void main()
    {
      quick(a,0,n-1);
    }

    void quick(int a[],int lower,int upper)
    {
       int loc;
       if(lower<upper)
       {
        loc=partition(a,lower,upper);
        quick(a,lower,loc-1);
        quick(a,loc+1,upper);

       }
    }

    /* Return type: int
      Parameters passed: Unsorted array and its lower and upper bounds */

    int partition(int a[],int lower,int upper)
    {
      int pivot,i,j,temp;
      pivot=a[lower];
      i=lower+1;
      j=upper;
      while(i<j)
        {
            while((i<upper)&&(a[i]<=pivot))
            i++;
            while((a[j]>pivot))
            j--;
            if(i<j)
                {
                    temp=a[i];
                    a[i]=a[j];
                    a[j]=temp;
                }

        }//end while

        if(pivot>a[j])
        {
             temp=a[j];
             a[j]=a[lower];
             a[lower]=temp;
        }

         return(j);

}//end partition

Solution

The quicksort algorithm is easily parallelized. If you have multiple cores to work with, you could see quite a bit of speed up. Depending on how large your data set is, it could easily provide you with more speed up than any other optimization. However, if you have only one processor or a relatively small data set, there won't be much of a speed up.

I could elaborate more if this is a possibility.

OTHER TIPS

Choose a better pivot: eg. in median-of-three you pick 3 (random) elements and choose the pivot as the median element
When length(a[]) < M (in practice choose M = 9) stop sorting. After qsort() finished apply insert sort which would take roughly M * N = O(N). This avoids many function calls close to leaf of the divide-et-impera recursion tree.

The first suggestion would be: replace one of the recursive calls with iteration. And I mean real iteration, not a manually implemented stack for recursion. I.e. instead of making two "new" calls to quick from quick, "recycle" your current call to quick to process one branch of recursion, and call quick recursively to process another branch.

Now, if you make sure that you always make real recursive call for the shorter partition (and use iteration for the longer one), it will guarantee that the depth of recursion will never exceed log N even in the worst case, i.e. regardless of how well you choose your median.

This all is implemented in qsort algorithm that comes with GCC standard library. Take a look at the source, it should be useful.

Roughly, a more practical implementation that follows the above suggestion might look as follows

void quick(int a[], int lower, int upper)
{
  while (lower < upper)
  {
    int loc = partition(a, lower, upper);
    if (loc - lower < upper - loc)
    { /* Lower part is shorter... */
      quick(a, lower, loc - 1); /* ...process it recursively... */
      lower = loc + 1; /* ...and process the upper part on the next iteration */
    }
    else
    { /* Upper part is shorter... */
      quick(a, loc + 1, upper); /* ...process it recursively... */
      upper = loc - 1; /* ...and process the lower part on the next iteration */
    }
  }
}

This is just a sketch of the idea, of course. Not tested. Again, take a look at GCC implementation for the same idea. They also replace the remaining recursive call with "manual" recursion, but it is not really necessary.

Sorting small blocks (<5 elements) with a loopless algorithm may improve performance. I found only one example how to write a loopless sorting algorithm for 5 elements: http://wiki.tcl.tk/4118

The idea can be used to generate loopless sorting algorithms for 2,3,4,5 elements in C.

EDIT: i tried it on one set of data, and i measured 87% running time compared to the code in the question. Using insertion sort on <20 blocks resulted 92% on the same data set. This measurement is not representative but may show that this is a way how can You improve Your quicksort code.

EDIT: this example code write out loopless sorting functions for 2-6 elements:

#include <stdio.h>

#define OUT(i,fmt,...)  printf("%*.s"fmt"\n",i,"",##__VA_ARGS__);

#define IF( a,b, i, block1, block2 ) \
  OUT(i,"if(t[%i]>t[%i]){",a,b) block1 OUT(i,"}else{") block2 OUT(i,"}")

#define AB2(i,a,b)         IF(a,b,i,P2(i+2,b,a),P2(i+2,a,b))
#define  P2(i,a,b)         print_perm(i,2,(int[2]){a,b});

#define AB3(i,a,b,c)       IF(a,b,i,BC3(i+2,b,a,c),BC3(i+2,a,b,c))
#define AC3(i,a,b,c)       IF(a,c,i, P3(i+2,c,a,b), P3(i+2,a,c,b))
#define BC3(i,a,b,c)       IF(b,c,i,AC3(i+2,a,b,c), P3(i+2,a,b,c))
#define  P3(i,a,b,c)       print_perm(i,3,(int[3]){a,b,c});

#define AB4(i,a,b,c,d)     IF(a,b,i,CD4(i+2,b,a,c,d),CD4(i+2,a,b,c,d))
#define AC4(i,a,b,c,d)     IF(a,c,i, P4(i+2,c,a,b,d), P4(i+2,a,c,b,d))
#define BC4(i,a,b,c,d)     IF(b,c,i,AC4(i+2,a,b,c,d), P4(i+2,a,b,c,d))
#define BD4(i,a,b,c,d)     IF(b,d,i,BC4(i+2,c,d,a,b),BC4(i+2,a,b,c,d))
#define CD4(i,a,b,c,d)     IF(c,d,i,BD4(i+2,a,b,d,c),BD4(i+2,a,b,c,d))
#define  P4(i,a,b,c,d)     print_perm(i,4,(int[4]){a,b,c,d});

#define AB5(i,a,b,c,d,e)   IF(a,b,i,CD5(i+2,b,a,c,d,e),CD5(i+2,a,b,c,d,e))
#define AC5(i,a,b,c,d,e)   IF(a,c,i, P5(i+2,c,a,b,d,e), P5(i+2,a,c,b,d,e))
#define AE5(i,a,b,c,d,e)   IF(a,e,i,CB5(i+2,e,a,c,b,d),CB5(i+2,a,e,c,b,d))
#define BE5(i,a,b,c,d,e)   IF(b,e,i,AE5(i+2,a,b,c,d,e),DE5(i+2,a,b,c,d,e))
#define BD5(i,a,b,c,d,e)   IF(b,d,i,BE5(i+2,c,d,a,b,e),BE5(i+2,a,b,c,d,e))
#define CB5(i,a,b,c,d,e)   IF(c,b,i,DC5(i+2,a,b,c,d,e),AC5(i+2,a,b,c,d,e))
#define CD5(i,a,b,c,d,e)   IF(c,d,i,BD5(i+2,a,b,d,c,e),BD5(i+2,a,b,c,d,e))
#define DC5(i,a,b,c,d,e)   IF(d,c,i, P5(i+2,a,b,c,d,e), P5(i+2,a,b,d,c,e))
#define DE5(i,a,b,c,d,e)   IF(d,e,i,CB5(i+2,a,b,c,e,d),CB5(i+2,a,b,c,d,e))
#define  P5(i,a,b,c,d,e)   print_perm(i,5,(int[5]){a,b,c,d,e});

#define AB6(i,a,b,c,d,e,f) IF(a,b,i,CD6(i+2,b,a,c,d,e,f),CD6(i+2,a,b,c,d,e,f))
#define AC6(i,a,b,c,d,e,f) IF(a,c,i, P6(i+2,c,a,b,d,e,f), P6(i+2,a,c,b,d,e,f))
#define AE6(i,a,b,c,d,e,f) IF(a,e,i,CB6(i+2,e,a,c,b,d,f),CB6(i+2,a,e,c,b,d,f))
#define BD6(i,a,b,c,d,e,f) IF(b,d,i,DF6(i+2,c,d,a,b,e,f),DF6(i+2,a,b,c,d,e,f))
#define BE6(i,a,b,c,d,e,f) IF(b,e,i,AE6(i+2,a,b,c,d,e,f),DE6(i+2,a,b,c,d,e,f))
#define CB6(i,a,b,c,d,e,f) IF(c,b,i,DC6(i+2,a,b,c,d,e,f),AC6(i+2,a,b,c,d,e,f))
#define CD6(i,a,b,c,d,e,f) IF(c,d,i,EF6(i+2,a,b,d,c,e,f),EF6(i+2,a,b,c,d,e,f))
#define DB6(i,a,b,c,d,e,f) IF(d,b,i,BE6(i+2,a,b,c,d,e,f),BE6(i+2,c,d,a,b,e,f))
#define DC6(i,a,b,c,d,e,f) IF(d,c,i, P6(i+2,a,b,c,d,e,f), P6(i+2,a,b,d,c,e,f))
#define DE6(i,a,b,c,d,e,f) IF(d,e,i,CB6(i+2,a,b,c,e,d,f),CB6(i+2,a,b,c,d,e,f))
#define DF6(i,a,b,c,d,e,f) IF(d,f,i,DB6(i+2,a,b,e,f,c,d),BE6(i+2,a,b,c,d,e,f))
#define EF6(i,a,b,c,d,e,f) IF(e,f,i,BD6(i+2,a,b,c,d,f,e),BD6(i+2,a,b,c,d,e,f))
#define  P6(i,a,b,c,d,e,f) print_perm(i,6,(int[6]){a,b,c,d,e,f});

#define SORT(ABn,n,...) \
  OUT( 0, ""); \
  OUT( 0, "inline void sort" #n "( int t[" #n "] ){" ) \
  OUT( 2, "int tmp;" ) \
  ABn( 2, __VA_ARGS__ ) \
  OUT( 0, "}" )

void print_perm( int ind, int n, int t[n] ){
  printf("%*.s", ind-1, "");
  for( int i=0; i<n; i++ )
    if( i != t[i] ){
      printf(" tmp=t[%i]; t[%i]=",i,i);
      for( int j=t[i],k; j!=i; k=j,j=t[j],t[k]=k)
        printf("t[%i]; t[%i]=",j,j);
      printf("tmp;");
    }
  printf("\n");
}

int main( void ){
  SORT( AB2, 2, 0,1 )
  SORT( AB3, 3, 0,1,2 )
  SORT( AB4, 4, 0,1,2,3 )
  SORT( AB5, 5, 0,1,2,3,4 )
  SORT( AB6, 6, 0,1,2,3,4,5 )
}

The generated code 3718 lines long:

sort2():    8
sort3():   24
sort4():   96
sort5():  512
sort6(): 3072

Try another sort algorithm.

Depending on your data, you may be able to trade memory for speed.

According to Wikipedia

Quick sort has a best case O(n log n) performance and O(1) space
Merge sort has a fixed O(n log n) performance and O(n) space
Radix sort has a fixed O(n . <number of digits>) perfomance and O(n . <number of digits>) space

Edit

Apparently your data is integers. With 2.5M integers in the range [0, 0x0fffffff], my implementation of radix-sort is about 4 times as fast as your implementation of quick-sort.

$ ./a.out
qsort time: 0.39 secs
radix time: 0.09 secs
good: 2000; evil: 0

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 2560000
#define RANDOM_NUMBER (((rand() << 16) + rand()) & 0x0fffffff)

int partition(int a[], int lower, int upper) {
  int pivot, i, j, temp;
  pivot = a[lower];
  i = lower + 1;
  j = upper;
  while (i < j) {
    while((i < upper) && (a[i] <= pivot)) i++;
    while (a[j] > pivot) j--;
    if (i < j) {
      temp = a[i];
      a[i] = a[j];
      a[j] = temp;
    }
  }
  if (pivot > a[j]) {
    temp = a[j];
    a[j] = a[lower];
    a[lower] = temp;
  }
  return j;
}

void quick(int a[], int lower, int upper) {
  int loc;
  if (lower < upper) {
    loc = partition(a, lower, upper);
    quick(a, lower, loc-1);
    quick(a, loc+1, upper);
  }
}

#define NBUCKETS 256
#define BUCKET_SIZE (48 * (1 + ARRAY_SIZE / NBUCKETS))

/* "waste" memory */
int bucket[NBUCKETS][BUCKET_SIZE];

void radix(int *a, size_t siz) {
  unsigned shift = 0;
  for (int dummy=0; dummy<4; dummy++) {
    int bcount[NBUCKETS] = {0};
    int *aa = a;
    size_t s = siz;
    while (s--) {
      unsigned v = ((unsigned)*aa >> shift) & 0xff;
      if (bcount[v] == BUCKET_SIZE) {
        fprintf(stderr, "not enough memory.\n");
        fprintf(stderr, "v == %u; bcount[v] = %d.\n", v, bcount[v]);
        exit(EXIT_FAILURE);
      }
      bucket[v][bcount[v]++] = *aa++;
    }
    aa = a;
    for (int k=0; k<NBUCKETS; k++) {
      for (int j=0; j<bcount[k]; j++) *aa++ = bucket[k][j];
    }
    shift += 8;
  }
}

int ar1[ARRAY_SIZE];
int ar2[ARRAY_SIZE];

int main(void) {
  clock_t t0;

  srand(time(0));
  for (int k=0; k<ARRAY_SIZE; k++) ar1[k] = ar2[k] = RANDOM_NUMBER;

  t0 = clock(); while (clock() == t0) /* void */; t0 = clock();
  do {
    quick(ar1, 0, ARRAY_SIZE - 1);
  } while (0);
  double qsort_time = (double)(clock() - t0) / CLOCKS_PER_SEC;

  t0 = clock(); while (clock() == t0) /* void */; t0 = clock();
  do {
    radix(ar2, ARRAY_SIZE);
  } while (0);
  double radix_time = (double)(clock() - t0) / CLOCKS_PER_SEC;

  printf("qsort time: %.2f secs\n", qsort_time);
  printf("radix time: %.2f secs\n", radix_time);

  /* compare sorted arrays by sampling */
  int evil = 0;
  int good = 0;
  for (int chk=0; chk<2000; chk++) {
    size_t index = RANDOM_NUMBER % ARRAY_SIZE;
    if (ar1[index] == ar2[index]) good++;
    else evil++;
  }
  printf("good: %d; evil: %d\n", good, evil);

  return 0;
}

The wikipedia article on quicksort has a bunch of ideas.

You can a eliminate recuriosn overhead by using QuickSort with Explicit Stack

void quickSort(int a[], int lower, int upper)
{
    createStack(); 
    push(lower); 
    push(upper);

    while (!isEmptyStack()) {
    upper=poptop();
    lower=poptop();
       while (lower<upper) {
                 pivPos=partition(selectPivot(a, size), a, lower, upper);
                 push(lower);
                 push(pivPos-1);
                 lower = pivPos+1; // end = end;
       }
   }
}

You can use better pivot selection technique such as:
1. median of 3
2. median of medians
3. random pivot

Currently the most advanced quicksort widely used is implemented in Java's DualPivotQuicksort.java So you can simply follow that approach and you will see a nice performance improvement:

Use Insertion sort for small arrays (47 is the number used in Java)
Use that dual-pivot quicksort choosing the 2nd and 4th elements of 5 as the two pivots
Consider using mergesort for arrays with runs of sorted numbers

Or if you want to add a bit more complexity then code a 3-pivot quicksort.

If this is not just for learning use qsort from stdlib.h

Per your code, when sorted lenght is 10, the deepest stack looks like

#0  partition (a=0x7fff5ac42180, lower=3, upper=5) 
#1  0x000000000040062f in quick (a=0x7fff5ac42180, lower=3, upper=5) 
#2  0x0000000000400656 in quick (a=0x7fff5ac42180, lower=0, upper=5) 
#3  0x0000000000400644 in quick (a=0x7fff5ac42180, lower=0, upper=8) 
#4  0x0000000000400644 in quick (a=0x7fff5ac42180, lower=0, upper=9) 
#5  0x00000000004005c3 in main

Besides the algo itself, if we consider the system behavior such as stack activities as well, something else except normal call stack costs (push/pop) might downgrade the performance a lots, e.g. context switching if multi-task system, you know most OS are multi-task, and deeper the stack higher costs the switching, plus cache miss or cachline boundary across.

I believe in this case if the length become bigger, you'll lose when comparing to some other sorting algos.

for example, when length is up to 40, the stack may look like (may more entries than shown as below):

#0  partition (a=0x7fff24810cd0, lower=8, upper=9) 
#1  0x000000000040065d in quick (a=0x7fff24810cd0, lower=8, upper=9)  
#2  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=8, upper=10) 
#3  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=8, upper=14) 
#4  0x0000000000400684 in quick (a=0x7fff24810cd0, lower=4, upper=14) 
#5  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=4, upper=18) 
#6  0x0000000000400684 in quick (a=0x7fff24810cd0, lower=0, upper=18) 
#7  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=0, upper=26) 
#8  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=0, upper=38) 
#9  0x0000000000400672 in quick (a=0x7fff24810cd0, lower=0, upper=39) 
#10 0x00000000004005ee in main

too deeper the stack.

Function inlining is helpful too, but it increases code footprint of the final executable. I agree with @Swiss, parallel programming might help you so much.

completely dumb answer, but... compile your code in release mode and turn on optimizations !

The first thing to do is benchmark it. And benchmark it against the standard qsort, and against the c++ std::sort and std::stable_sort.

Your results, should your data-set be big enough, will show that the std::sort is superior to qsort in all cases except those where the std::stable_sort is superior. This is because the std::sort is templated, and therefore the comparision can be inlined.

Your code inlines the comparison because its not generic. If your code is slower than qsort, you have a problem right there.

The faster sort would be to sort parts in parallel, e.g. openmp, and then merge them back together.

Copied from my answer to answer question.

Edit: This post assumes you already do obvious things like take advantage of tail recursion to get rid of the unnecessary call overhead.

People like to criticize the quicksort for poor performance with certain inputs, especially when the user has control of the input. The following approach yields performance matching midpoint selection but expected complexity exponentially approaches O(n log n) as the list grows in size. In my experience it significantly outperforms best-of-3 selection due to much less overhead in the majority case. It should perform evenly with midpoint selection for "expected" inputs, but is not vulnerable to poor inputs.

Initialize the quicksort with a random positive integer I. That value will be used throughout the sorting process (don't have to generate multiple numbers).
Pivot is selected as I mod SectionSize.

For additional performance, you should always switch your quicksort to shell sort for "small" list segments - I've seen lengths from 15-100 chosen as the cutoff.

multithreading ?

/*
 * multiple-thread quick-sort.
 * Works fine on uniprocessor machines as well.
 */

#include <unistd.h>
#include <stdlib.h>
#include <thread.h>

/* don't create more threads for less than this */
#define SLICE_THRESH   4096

/* how many threads per lwp */
#define THR_PER_LWP       4

/* cast the void to a one byte quanitity and compute the offset */
#define SUB(a, n)      ((void *) (((unsigned char *) (a)) + ((n) * width)))

typedef struct {
  void    *sa_base;
  int      sa_nel;
  size_t   sa_width;
  int    (*sa_compar)(const void *, const void *);
} sort_args_t;

/* for all instances of quicksort */
static int threads_avail;

#define SWAP(a, i, j, width)
{ 
  int n; 
  unsigned char uc; 
  unsigned short us; 
  unsigned long ul; 
  unsigned long long ull; 

  if (SUB(a, i) == pivot) 
    pivot = SUB(a, j); 
  else if (SUB(a, j) == pivot) 
    pivot = SUB(a, i); 

  /* one of the more convoluted swaps I've done */ 
  switch(width) { 
  case 1: 
    uc = *((unsigned char *) SUB(a, i)); 
    *((unsigned char *) SUB(a, i)) = *((unsigned char *) SUB(a, j)); 
    *((unsigned char *) SUB(a, j)) = uc; 
    break; 
  case 2: 
    us = *((unsigned short *) SUB(a, i)); 
    *((unsigned short *) SUB(a, i)) = *((unsigned short *) SUB(a, j)); 
    *((unsigned short *) SUB(a, j)) = us; 
    break; 
  case 4: 
    ul = *((unsigned long *) SUB(a, i)); 
    *((unsigned long *) SUB(a, i)) = *((unsigned long *) SUB(a, j)); 
    *((unsigned long *) SUB(a, j)) = ul; 
    break; 
  case 8: 
    ull = *((unsigned long long *) SUB(a, i)); 
    *((unsigned long long *) SUB(a,i)) = *((unsigned long long *) SUB(a,j)); 
    *((unsigned long long *) SUB(a, j)) = ull; 
    break; 
  default: 
    for(n=0; n<width; n++) { 
      uc = ((unsigned char *) SUB(a, i))[n]; 
      ((unsigned char *) SUB(a, i))[n] = ((unsigned char *) SUB(a, j))[n]; 
      ((unsigned char *) SUB(a, j))[n] = uc; 
    } 
    break; 
  } 
}

static void *
_quicksort(void *arg)
{
  sort_args_t *sargs = (sort_args_t *) arg;
  register void *a = sargs->sa_base;
  int n = sargs->sa_nel;
  int width = sargs->sa_width;
  int (*compar)(const void *, const void *) = sargs->sa_compar;
  register int i;
  register int j;
  int z;
  int thread_count = 0;
  void *t;
  void *b[3];
  void *pivot = 0;
  sort_args_t sort_args[2];
  thread_t tid;

  /* find the pivot point */
  switch(n) {
  case 0:
  case 1:
    return 0;
  case 2:
    if ((*compar)(SUB(a, 0), SUB(a, 1)) > 0) {
      SWAP(a, 0, 1, width);
    }
    return 0;
  case 3:
    /* three sort */
    if ((*compar)(SUB(a, 0), SUB(a, 1)) > 0) {
      SWAP(a, 0, 1, width);
    }
    /* the first two are now ordered, now order the second two */
    if ((*compar)(SUB(a, 2), SUB(a, 1)) < 0) {
      SWAP(a, 2, 1, width);
    }
    /* should the second be moved to the first? */
    if ((*compar)(SUB(a, 1), SUB(a, 0)) < 0) {
      SWAP(a, 1, 0, width);
    }
    return 0;
  default:
    if (n > 3) {
      b[0] = SUB(a, 0);
      b[1] = SUB(a, n / 2);
      b[2] = SUB(a, n - 1);
      /* three sort */
      if ((*compar)(b[0], b[1]) > 0) {
        t = b[0];
        b[0] = b[1];
        b[1] = t;
      }
      /* the first two are now ordered, now order the second two */
      if ((*compar)(b[2], b[1]) < 0) {
        t = b[1];
        b[1] = b[2];
        b[2] = t;
      }
      /* should the second be moved to the first? */
      if ((*compar)(b[1], b[0]) < 0) {
        t = b[0];
        b[0] = b[1];
        b[1] = t;
      }
      if ((*compar)(b[0], b[2]) != 0)
        if ((*compar)(b[0], b[1]) < 0)
          pivot = b[1];
        else
          pivot = b[2];
    }
    break;
  }
  if (pivot == 0)
    for(i=1; i<n; i++) {
      if (z = (*compar)(SUB(a, 0), SUB(a, i))) {
        pivot = (z > 0) ? SUB(a, 0) : SUB(a, i);
        break;
      }
    }
  if (pivot == 0)
    return;

  /* sort */
  i = 0;
  j = n - 1;
  while(i <= j) {
    while((*compar)(SUB(a, i), pivot) < 0)
      ++i;
    while((*compar)(SUB(a, j), pivot) >= 0)
      --j;
    if (i < j) {
      SWAP(a, i, j, width);
      ++i;
      --j;
    }
  }

  /* sort the sides judiciously */
  switch(i) {
  case 0:
  case 1:
    break;
  case 2:
    if ((*compar)(SUB(a, 0), SUB(a, 1)) > 0) {
      SWAP(a, 0, 1, width);
    }
    break;
  case 3:
    /* three sort */
    if ((*compar)(SUB(a, 0), SUB(a, 1)) > 0) {
      SWAP(a, 0, 1, width);
    }
    /* the first two are now ordered, now order the second two */
    if ((*compar)(SUB(a, 2), SUB(a, 1)) < 0) {
      SWAP(a, 2, 1, width);
    }
    /* should the second be moved to the first? */
    if ((*compar)(SUB(a, 1), SUB(a, 0)) < 0) {
      SWAP(a, 1, 0, width);
    }
    break;
  default:
    sort_args[0].sa_base          = a;
    sort_args[0].sa_nel           = i;
    sort_args[0].sa_width         = width;
    sort_args[0].sa_compar        = compar;
    if ((threads_avail > 0) && (i > SLICE_THRESH)) {
      threads_avail--;
      thr_create(0, 0, _quicksort, &sort_args[0], 0, &tid);
      thread_count = 1;
    } else
      _quicksort(&sort_args[0]);
    break;
  }
  j = n - i;
  switch(j) {
  case 1:
    break;
  case 2:
    if ((*compar)(SUB(a, i), SUB(a, i + 1)) > 0) {
      SWAP(a, i, i + 1, width);
    }
    break;
  case 3:
    /* three sort */
    if ((*compar)(SUB(a, i), SUB(a, i + 1)) > 0) {
      SWAP(a, i, i + 1, width);
    }
    /* the first two are now ordered, now order the second two */
    if ((*compar)(SUB(a, i + 2), SUB(a, i + 1)) < 0) {
      SWAP(a, i + 2, i + 1, width);
    }
    /* should the second be moved to the first? */
    if ((*compar)(SUB(a, i + 1), SUB(a, i)) < 0) {
      SWAP(a, i + 1, i, width);
    }
    break;
  default:
    sort_args[1].sa_base          = SUB(a, i);
    sort_args[1].sa_nel           = j;
    sort_args[1].sa_width         = width;
    sort_args[1].sa_compar        = compar;
    if ((thread_count == 0) && (threads_avail > 0) && (i > SLICE_THRESH)) {
      threads_avail--;
      thr_create(0, 0, _quicksort, &sort_args[1], 0, &tid);
      thread_count = 1;
    } else
      _quicksort(&sort_args[1]);
    break;
  }
  if (thread_count) {
    thr_join(tid, 0, 0);
    threads_avail++;
  }
  return 0;
}

void
quicksort(void *a, size_t n, size_t width,
          int (*compar)(const void *, const void *))
{
  static int ncpus = -1;
  sort_args_t sort_args;

  if (ncpus == -1) {
    ncpus = sysconf( _SC_NPROCESSORS_ONLN);

    /* lwp for each cpu */
    if ((ncpus > 1) && (thr_getconcurrency() < ncpus))
      thr_setconcurrency(ncpus);

    /* thread count not to exceed THR_PER_LWP per lwp */
    threads_avail = (ncpus == 1) ? 0 : (ncpus * THR_PER_LWP);
  }
  sort_args.sa_base = a;
  sort_args.sa_nel = n;
  sort_args.sa_width = width;
  sort_args.sa_compar = compar;
  (void) _quicksort(&sort_args);
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow