Pregunta

I was curious to know how to select a sorting algorithm based on the input, so that I can get the best efficiency.

Should it be on the size of the input or how the input is arranged(Asc/Desc) or the data structure used etc ... ?

¿Fue útil?

Solución

The importance of algorithms generally, and in sorting algorithms as well is as following:

(*) Correctness - This is the most important thing. It worth nothing if your algorithm is super fast and efficient, but is wrong. In sorting, even if you have 2 candidates that are sorting correctly, but you need a stable sort - you will chose the stable sort algorithm, even if it is less efficient - because it is correct for your purpose, and the other is not.

Next are basically trade offs between running time, needed space and implementation time (If you will need to implement something from scratch rather then use a library, for a minor performance enhancement - it probably doesn't worth it)

Some things to take into consideration when thinking about the trade off mentioned above:

  1. Size of the input (for example: for small inputs, insertion sort is empirically faster then more advanced algorithms, though it takes O(n^2)).
  2. Location of the input (sorting algorithms on disk are different from algorithms on RAM, because disk reads are much less efficient when not sequential. The algorithm which is usually used to sort on disk is a variation of merge-sort).
  3. How is the data distributed? If the data is likely to be "almost sorted" - maybe a usually terrible bubble-sort can sort it in just 2-3 iterations and be super fast comparing to other algorithms.
  4. What libraries do you have already implemented? How much work will it take to implement something new? Will it worth it?
  5. Type (and range) of the input - for enumerable data (integers for example) - an integer designed algorithm (like radix sort) might be more efficient then a general case algorithm.
  6. Latency requirement - if you are designing a missile head, and the result must return within specific amount of time, quicksort which might decay to quadric running time on worst case - might not be a good choice, and you might want to use a different algorithm which have a strict O(nlogn) worst case instead.
  7. Your hardware - if for example you are using a huge cluster and a huge data - a distributed sorting algorithm will probably be better then trying to do all the work on one machine.

Otros consejos

It should be based on all those things.

  • You need to take into account size of your data as Insertion sort can be faster than quicksort for small data sets, etc

  • you need to know the arrangement of your data due to differing worst/average/best case asymptotic runtimes for each of the algorithm (and some whose worst/avg cases are the same whereas the other may have significantly worse worst case vs avg)

  • and you obviously need to know the data structure used as there are some very specialized sorting algorithms if your data is already in a special format or even if you can put it into a new data structure efficiently that will automatically do your sorting for you (a la BST or heaps)

The 2 main things that determine your choice of a sorting algorithm are time complexity and space complexity. Depending on your scenario, and the resources (time and memory) available to you, you might need to choose between sorting algorithms, based on what each sorting algorithm has to offer.

The actual performance of a sorting algorithm depends on the input data too, and it helps if we know certain characteristics of the input data beforehand, like the size of input, how sorted the array already is.

For example, If you know beforehand that the input data has only 1000 non-negative integers, you can very well use counting sort to sort such an array in linear time.

The choice of a sorting algorithm depends on the constraints of space and time, and also the size/characteristics of the input data.

At a very high level you need to consider the ratio of insertions vs compares with each algorithm.

For integers in a file, this isn't going to be hugely relevant but if say you're sorting files based on contents, you'll naturally want to do as few comparisons as possible.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top