Best way to limit incoming messages to avoid duplicates

Question 1

Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,

from time import time

class DecayingSet:
    def __init__(self, timeout): # timeout in seconds
        from collections import deque
        self.timeout = timeout
        self.d = deque()
        self.present = set()

    def add(self, thing):
        # Return True if `thing` not already in set,
        # else return False.
        result = thing not in self.present
        if result:
            self.present.add(thing)
            self.d.append((time(), thing))
        self.clean()
        return result

    def clean(self):
        # forget stuff added >= `timeout` seconds ago
        now = time()
        d = self.d
        while d and now - d[0][0] >= self.timeout:
            _, thing = d.popleft()
            self.present.remove(thing)

As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.

Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.

Question 2

suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.

now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.

the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists

Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.

If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).

If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.

Further update: Bloom filters are an extension of this scheme.

Question 3

I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:

urls = {}

and then for each URL:

if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
    # Don't submit the data
else:
    urls[url] = time.time()
    # Submit the data