You actually don't have duplicates in your list. One time you have http://catb.org
while another time you have http://www.catb.org
.
You'll have to figure a way to determine whether the URL has www.
in front or not.
سؤال
I am trying to remove duplicates from the list of unicode string without changing the order(So, I don't want to use set) of elements appeared in it.
Program:
result = [u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
result.reverse()
for e in result:
count_e = result.count(e)
if count_e > 1:
for i in range(0, count_e - 1):
result.remove(e)
result.reverse()
print result
Output:
[u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']
Expected Output:
[u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']
So, Is there any way of doing it simple as possible.
المحلول
You actually don't have duplicates in your list. One time you have http://catb.org
while another time you have http://www.catb.org
.
You'll have to figure a way to determine whether the URL has www.
in front or not.
نصائح أخرى
You can create a new list and add items to it if they're not already in it.
result = [ /some list items/]
uniq = []
for item in result:
if item not in uniq:
uniq.append(item)
You could use a set and then sort it by the original index:
sorted(set(result), key=result.index)
This works because index
returns the first occurrence (so it keeps them in order according to first appearance in the original list)
I also notice that one of the strings in your original isn't a unicode string. So you might want to do something like:
u = [unicode(s) for s in result]
return sorted(set(u), key=u.index)
EDIT: 'http://google.com'
and 'http://www.google.com'
are not string duplicates. If you want to treat them as such, you could do something like:
def remove_www(s):
s = unicode(s)
prefix = u'http://'
suffix = s[11:] if s.startswith(u'http://www') else s[7:]
return prefix+suffix
And then replace the earlier code with
u = [remove_www(s) for s in result]
return sorted(set(u), key=u.index)
Here is a method that modifies result
in place:
result = [u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', 'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
seen = set()
i = 0
while i < len(result):
if result[i] not in seen:
seen.add(result[i])
i += 1
else:
del result[i]