Question

I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.

In order to conserve processing and keep my end presentation efficient.

I don't want duplicate urls being submitted within a certain time range.

so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api

          url = incoming.msg['urls']
          url = urlparse(url)
          if url.netloc  == "instagram.com":                
            r = requests.get("http://api.some.url/show?url=%s"% url)

and then 3 secs later the same url comes in, I don't want it submitted to the api.

What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?

UPDATE USING TIM PETERS METHOD:

         limit = DecayingSet(86400)
         l = limit.add(longUrl)
         if l == False:
           pass
         else:
           r = requests.get("http://api.some.url/show?url=%s"% url)

this snippet is inside a long running process, that is accepting streaming messages via tcp.

every time I pass the same url in, l returns True every time.

But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.

Does it have to do with the fact that the script is running, while the set is being added to?

Instance issues?

Was it helpful?

Solution

Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,

from time import time

class DecayingSet:
    def __init__(self, timeout): # timeout in seconds
        from collections import deque
        self.timeout = timeout
        self.d = deque()
        self.present = set()

    def add(self, thing):
        # Return True if `thing` not already in set,
        # else return False.
        result = thing not in self.present
        if result:
            self.present.add(thing)
            self.d.append((time(), thing))
        self.clean()
        return result

    def clean(self):
        # forget stuff added >= `timeout` seconds ago
        now = time()
        d = self.d
        while d and now - d[0][0] >= self.timeout:
            _, thing = d.popleft()
            self.present.remove(thing)

As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.

Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.

OTHER TIPS

suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.

now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.

the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists

Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.

If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).

If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.

Further update: Bloom filters are an extension of this scheme.

I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:

urls = {}

and then for each URL:

if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
    # Don't submit the data
else:
    urls[url] = time.time()
    # Submit the data
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top