How to process each item in an ordered list with duplicates only once in Python?

Question 1

You could take the length of the set before and after the add(). If it didn't change, the format was already in the set.

seen = set()
for (value, fmt) in formats:
    l1 = len(seen)
    seen.add(fmt)
    if l1 != len(seen):
         process(value, fmt)

Your question presumes that the in test is a costly operation. This turns out not to be the case. Using len() can take more time, although both are quite fast;

In [4]: seen = set(range(10000))

In [5]: %timeit 5995 in seen
10000000 loops, best of 3: 122 ns per loop

In [6]: %timeit len(seen)
1000000 loops, best of 3: 167 ns per loop

(measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)

Question 2

I think your first approach is the best. Even the unique_everseen recipe from itertools recipes is using the same approach.

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

Question 3

You have to use a set:

for (value, fmt) in set(formats):
   process(value, fmt)

Question 4

from ordereddict import OrderedDict
unique_formats = list(OrderedDict.fromkeys(format))
process(unique_formats)

This will preserve the order and remove duplicates

Question 5

You can use itertools.groupby to group by the second element of the pair, and then only consider the first value.

>>> from itertools import imap, groupby
>>> from operator import itemgetter
>>> formats = [(1, 'xxx'), (2, 'xxx'), (3, 'yyy'), (4, 'yyy')]
>>> for fmt, value in imap(lambda (x, y): (x, next(y)[0]), groupby(formats, itemgetter(1)):
...    print('%s: %s', fmt, value)
...
xxx: 1
yyy: 3

Question 6

If your list is in order, you can be sure that identical formats will be adjacent to each other. This means you don't need to use a set to keep track of past value. Just use a single variable to record the last format processed:

last = None
for (value, fmt) in formats:
    if fmt != last:
        last = fmt
        process(value, fmt)