Java Hashmap like data structure that locks the key-value pair on get and unlocks back on put - for crawler politeness

StackOverflow https://stackoverflow.com/questions/23525138

Вопрос

Original Problem: I have a multithreaded crawler application. In this I need to ensure that the target servers are not flooded with requests. So in order to accomplish that I need to ensure consecutive requests to the particular domain is happening with certain interval(say 200ms).

My Solution: As a naive solution to this problem I thought of maintaining a map of domain name to last requested time. And if a new request to certain domain is to be given, I'll check for the last requested time from the map and delay the request by the necessary time to ensure the interval is maintained between requests.

Flaw in current Solution and the fix: But this naive fix will fail if two or more threads request same primary domian at a given point of time. Because all threads will sleep till the interval and give the request at the same time, that will make the target server get flooded with requests.
Fix: The fix I am thinking of is to lock the key of the map on get, so all other threads will wait and unlock on put, so next waiting thread can continue.

Is it the correct way to approach the problem? If so, do we have any map data structure implemented like this?
Or is there a better way to approach this problem?

Это было полезно?

Решение

You can make a simple adaption to the map approach to solve your problem. First, I show some code, and I will explain it afterwards:

class Scheduler {
    private final Map<String, Long> map = new HashMap<>();
    private final long interval = 200; // ms
    public void sleepAsLongAsNecessary(String key) {
        long now = System.currentTimeMillis();
        long deadline = next(now, key);
        while (now < deadline) {
            Thread.sleep(deadline - now);
            now = System.currentTimeMillis();
        }
    }
    private synchronized long next(long now, String key) {
        Long oldValue = map.get(key);
        long newValue = oldValue == null ? now : Math.max(oldValue + interval, now);
        map.put(key, newValue);
        return newValue;
    }
}

This looks pretty similar to the solution you described. The important difference is:

The map does not store the last request time. Instead, it stores the planned request time of the last thread, that asked for a request time. If several threads ask for a request time at the same time, each of them will get a different planned request time.

As soon as the threads knows its planned request time, it will sleep as long as necessary.

Другие советы

can you not use ArrayBlockingQueue or ConcurrentLinkedQueue (or similar queue structures from java.util.concurrent). Add the request to the queue and pull them out at regular intervals?

This is reasonable for small-scale domain names. If your domain set is large, using map is a better idea. The solution your proposed seems fine. you might need to concurentHasHmap,but you would still need to lock individual keys after each request is handled. The only advantage of using concurrentHashMap is that the map itself is not locked, so read operations is fine to do.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top