Question

Ive always been a bit confused on this, possibly due to my lack of understanding in compilers. But lets use python as an example. If we had some large list of numbers called numlist and wanted to get rid of any duplicates, we could use a set operator on the list, example set(numlist). In return we would have a set of our numbers. This operation to the best of my knowledge will be done in O(n) time. Though if I were to create my own algorithm to handle this operation, the absolute best I could ever hope for is O(n^2).

What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?

Was it helpful?

Solution

You can do this in O(n) in any language, basically as:

# Get min and max values O(n).

min = oldList[0]
max = oldList[0]
for i = 1 to oldList.size() - 1:
    if oldList[i] < min:
        min = oldList[i]
    if oldList[i] > max:
        max = oldList[i]

# Initialise boolean list O(n)

isInList = new boolean[max - min + 1]
for i = min to max:
    isInList[i] = false

# Change booleans for values in old list O(n)

for i = 0 to oldList.size() - 1:
    isInList[oldList[i] - min] = true

# Create new list from booleans O(n) (or O(1) based on integer range).

newList = []
for i = min to max:
    if isInList[i - min]:
        newList.append (i)

I'm assuming here that append is an O(1) operation, which it should be unless the implementer was brain-dead. So with k steps each O(n), you still have an O(n) operation.

Whether the steps are explicitly done in your code or whether they're done under the covers of a language is irrelevant. Otherwise you could claim that the C qsort was one operation and you now have the holy grail of an O(1) sort routine :-)

As many people have discovered, you can often trade off space complexity for time complexity. For example, the above only works because we're allowed to introduce the isInList and newList variables. If this were not allowed, the next best solution may be sorting the list (probably no better the O(n log n)) followed by an O(n) (I think) operation to remove the duplicates.

An extreme example, you can use that same extra-space method to sort an arbitrary number of 32-bit integers (say with each only having 255 or less duplicates) in O(n) time, provided you can allocate about four billion bytes for storing the counts.

Simply initialise all the counts to zero and run through each position in your list, incrementing the count based on the number at that position. That's O(n).

Then start at the beginning of the list and run through the count array, placing that many of the correct value in the list. That's O(1), with the 1 being about four billion of course but still constant time :-)

That's also O(1) space complexity but a very big "1". Typically trade-offs aren't quite that severe.

OTHER TIPS

You can do this in Θ(n) average time using a hash table. Lookup and insertion in a hash table are Θ(1) on average . Thus, you just run through the n items and for each one checking if it is already in the hash table and if not inserting the item.

What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?

The asymptotic complexity of an algorithm does not change if implemented by the language implementers versus being implemented by a user of the language. As long as both are implemented in a Turing complete language with random access memory models they have the same capabilities and algorithms implemented in each will have the same asymptotic complexity. If an algorithm is theoretically O(f(n)) it does not matter if it is implemented in assembly language, C#, or Python on it will still be O(f(n)).

The complexity bound of an algorithm is completely unrelated to whether it is implemented 'internally' or 'externally'

Taking a list and turning it into a set through set() is O(n).

This is because set is implemented as a hash set. That means that to check if something is in the set or to add something to the set only takes O(1), constant time. Thus, to make a set from an iterable (like a list for example), you just start with an empty set and add the elements of the iterable one by one. Since there are n elements and each insertion takes O(1), the total time of converting an iterable to a set is O(n).

To understand how the hash implementation works, see the wikipedia artcle on hash tables

Off hand I can't think of how to do this in O(n), but here is the cool thing:

The difference between n^2 and n is sooo massive that the difference between you implementing it and python implementing is tiny compared to the algorithm used to implement it. n^2 is always worse than O(n), even if the n^2 one is in C and the O(n) one is in python. You should never think that kind of difference comes from the fact that you're not writing in a low level language.

That said, if you want to implement your own, you can do a sort then remove dups. the sort is n*ln(n) and the remove dups in O(n)...

There are two issues here.

Time complexity (which is expressed in big O notation) is a formal measure of how long an algorithm takes to run for a given set size. It's more about how well an algorithm scales than about the absolute speed.

The actual speed (say, in milliseconds) of an algorithm is the time complexity multiplied by a constant (in an ideal world).

Two people could implement the same removal of duplicates algorithm with O(log(n)*n) complexity, but if one writes it in Python and the other writes it in optimised C, then the C program will be faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top