I am trying to remove duplicates from the list of unicode string without changing the order(So, I don't want to use set) of elements appeared in it.

Program:

result = [u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
result.reverse()
for e in result:
    count_e = result.count(e)
    if count_e > 1:
        for i in range(0, count_e - 1):
            result.remove(e)
result.reverse()
print result

Output:

[u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']

Expected Output:

[u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']

So, Is there any way of doing it simple as possible.

有帮助吗?

解决方案

You actually don't have duplicates in your list. One time you have http://catb.org while another time you have http://www.catb.org.

You'll have to figure a way to determine whether the URL has www. in front or not.

其他提示

You can create a new list and add items to it if they're not already in it.

result = [ /some list items/]
uniq = []
for item in result:
    if item not in uniq:
        uniq.append(item)

You could use a set and then sort it by the original index:

sorted(set(result), key=result.index)

This works because index returns the first occurrence (so it keeps them in order according to first appearance in the original list)

I also notice that one of the strings in your original isn't a unicode string. So you might want to do something like:

u = [unicode(s) for s in result]
return sorted(set(u), key=u.index)

EDIT: 'http://google.com' and 'http://www.google.com' are not string duplicates. If you want to treat them as such, you could do something like:

def remove_www(s):
    s = unicode(s)
    prefix = u'http://'
    suffix = s[11:] if s.startswith(u'http://www') else s[7:]
    return prefix+suffix

And then replace the earlier code with

u = [remove_www(s) for s in result]
return sorted(set(u), key=u.index)

Here is a method that modifies result in place:

result = [u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', 'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
seen = set()
i = 0
while i < len(result):
    if result[i] not in seen:
        seen.add(result[i])
        i += 1
    else:
        del result[i]
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top