Question

I have an array of numbers. I want to sort them and remove duplicates. This answer suggest to use set and sort for that kind of operation. The order of operations shouldn't change the results, so I measured the time of computations.

from numpy.random import randint
from time import clock

n = 1000000

A=randint(0,n,n)

t1=clock()
B=sorted(set(A))
t2=clock()
C=set(sorted(A))
t3=clock()

print t2-t1, t3-t2

>>> 0.48011 1.339263

sorted(set(A)) is roughly three times faster than set(sorted(A)).

What makes the one faster than the other? Also, is there any faster way to do it?

Was it helpful?

Solution

This is because when you call:

set(sorted(A))

you are sorting the original full list and then filtering out the duplicate values. However, when you call:

sorted(set(A))

you are first shortening the list by removing duplicate values using set and then sorting the much smaller list hence the shorter time.

Hope that makes sense.

For Example

>>> A = [1,2,3,1,2,3,1,2,3,1,2,3]
>>> A = sorted(A)
[1,1,1,1,2,2,2,2,3,3,3,3]

>>> set(A)
{1, 2, 3}

On the other hand:

>>> A = [3,2,1,3,2,1,3,2,1,3,2,1]
>>> A = set(A)
{3, 2, 1}

>>> sorted(A)
{1, 2, 3}

And as @Barmar said, sorting is much slower than removing duplicates therefore there is a real time gain when you have to sort a much smaller list (1/4 of the list in my example above)

Time Benchmarking

a = [1,2,3] * 10000

set(sorted(a)) --> 0.1890001297
sorted(set(a)) --> 0.0079998970
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top