
I am trying to remove duplicates from the list of unicode string without changing the order(So, I don't want to use set) of elements appeared in it.


result = [u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
for e in result:
    count_e = result.count(e)
    if count_e > 1:
        for i in range(0, count_e - 1):
print result


[u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']

Expected Output:

[u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']

So, Is there any way of doing it simple as possible.



You actually don't have duplicates in your list. One time you have http://catb.org while another time you have http://www.catb.org.

You'll have to figure a way to determine whether the URL has www. in front or not.


You can create a new list and add items to it if they're not already in it.

result = [ /some list items/]
uniq = []
for item in result:
    if item not in uniq:

You could use a set and then sort it by the original index:

sorted(set(result), key=result.index)

This works because index returns the first occurrence (so it keeps them in order according to first appearance in the original list)

I also notice that one of the strings in your original isn't a unicode string. So you might want to do something like:

u = [unicode(s) for s in result]
return sorted(set(u), key=u.index)

EDIT: 'http://google.com' and 'http://www.google.com' are not string duplicates. If you want to treat them as such, you could do something like:

def remove_www(s):
    s = unicode(s)
    prefix = u'http://'
    suffix = s[11:] if s.startswith(u'http://www') else s[7:]
    return prefix+suffix

And then replace the earlier code with

u = [remove_www(s) for s in result]
return sorted(set(u), key=u.index)

Here is a method that modifies result in place:

result = [u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', 'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
seen = set()
i = 0
while i < len(result):
    if result[i] not in seen:
        i += 1
        del result[i]
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top