Question

I have an array of numbers. I want to sort them and remove duplicates. This answer suggest to use set and sort for that kind of operation. The order of operations shouldn't change the results, so I measured the time of computations.

from numpy.random import randint
from time import clock

n = 1000000

A=randint(0,n,n)

t1=clock()
B=sorted(set(A))
t2=clock()
C=set(sorted(A))
t3=clock()

print t2-t1, t3-t2

>>> 0.48011 1.339263

sorted(set(A)) is roughly three times faster than set(sorted(A)).

What makes the one faster than the other? Also, is there any faster way to do it?

Était-ce utile?

La solution

This is because when you call:

set(sorted(A))

you are sorting the original full list and then filtering out the duplicate values. However, when you call:

sorted(set(A))

you are first shortening the list by removing duplicate values using set and then sorting the much smaller list hence the shorter time.

Hope that makes sense.

For Example

>>> A = [1,2,3,1,2,3,1,2,3,1,2,3]
>>> A = sorted(A)
[1,1,1,1,2,2,2,2,3,3,3,3]

>>> set(A)
{1, 2, 3}

On the other hand:

>>> A = [3,2,1,3,2,1,3,2,1,3,2,1]
>>> A = set(A)
{3, 2, 1}

>>> sorted(A)
{1, 2, 3}

And as @Barmar said, sorting is much slower than removing duplicates therefore there is a real time gain when you have to sort a much smaller list (1/4 of the list in my example above)

Time Benchmarking

a = [1,2,3] * 10000

set(sorted(a)) --> 0.1890001297
sorted(set(a)) --> 0.0079998970
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top