Question

I have an ordered list of things to process that includes some duplicates and I only want to process the first occurrence. Presently, I'm doing it like this in Python v2.7:

seen = set()
for (value, fmt) in formats:
  if fmt not in seen:
    seen.add(fmt)
    process(value, fmt)

Is there anyway to simultaneously insert a new element into seen and detect whether or not it was already present? (This would avoid the repeated lookup of fmt in the set.)

seen = set()
for (value, fmt) in formats:
  # myInsert() would return true if item was not already present.
  if seen.myInsert(fmt):
    process(value, fmt)

Or, alternatively, can I somehow just filter my formats to exclude duplicate entries before looping?

unique_formats = removeDuplicates(formats, key=itemgetter(1))
for (value, fmt) in unique_formats:
  process(value, fmt)
Était-ce utile?

La solution

You could take the length of the set before and after the add(). If it didn't change, the format was already in the set.

seen = set()
for (value, fmt) in formats:
    l1 = len(seen)
    seen.add(fmt)
    if l1 != len(seen):
         process(value, fmt)

Your question presumes that the in test is a costly operation. This turns out not to be the case. Using len() can take more time, although both are quite fast;

In [4]: seen = set(range(10000))

In [5]: %timeit 5995 in seen
10000000 loops, best of 3: 122 ns per loop

In [6]: %timeit len(seen)
1000000 loops, best of 3: 167 ns per loop

(measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)

Autres conseils

I think your first approach is the best. Even the unique_everseen recipe from itertools recipes is using the same approach.

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

You have to use a set:

for (value, fmt) in set(formats):
   process(value, fmt)
from ordereddict import OrderedDict
unique_formats = list(OrderedDict.fromkeys(format))
process(unique_formats)

This will preserve the order and remove duplicates

You can use itertools.groupby to group by the second element of the pair, and then only consider the first value.

>>> from itertools import imap, groupby
>>> from operator import itemgetter
>>> formats = [(1, 'xxx'), (2, 'xxx'), (3, 'yyy'), (4, 'yyy')]
>>> for fmt, value in imap(lambda (x, y): (x, next(y)[0]), groupby(formats, itemgetter(1)):
...    print('%s: %s', fmt, value)
...
xxx: 1
yyy: 3

If your list is in order, you can be sure that identical formats will be adjacent to each other. This means you don't need to use a set to keep track of past value. Just use a single variable to record the last format processed:

last = None
for (value, fmt) in formats:
    if fmt != last:
        last = fmt
        process(value, fmt)
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top