質問

I have written some PostgreSQL database client code to update a central database with a table of IP addresses and host names from multiple clients. There are two tables: one to hold mappings between IP addresses and host names, and one to hold a queue of IP addresses that have not yet been resolved to host names.

Here is the IP-address-to-host-name mapping table:

CREATE TABLE g_hostmap(
    appliance_id     INTEGER,
    ip               INET,
    fqdn             TEXT,
    resolve_time     TIMESTAMP, 
    expire_time      TIMESTAMP,
    UNIQUE(appliance_id, ip))

Here is the work queue table:

CREATE TABLE g_hostmap_work(
    ip               INET,
    input_table      TEXT)

The database clients each pull requests from a single work queue table. Each request contains a private IPv4 address for which a host name is requested.

The work flow is as follows: each client periodically queries the central database work queue for a list of IP addresses for which host names are needed, performs a reverse DNS look-up on the addresses, and then updates the host name table with the (IP address, host name) pairs, one at a time. I wish to minimize the likelihood that multiple clients will duplicate effort by attempting to resolve the same IP addresses simultaneously.

I limit each batch of updates to the larger of 10 rows or 10% of the size of the work queue in rows. The timing of the clients is somewhat independent. How can I further minimize contention for the DNS name server and host name table during the update process? My customer is concerned that there will be much duplication of effort.

Here is the initial query for a count of items in the work queue:

SELECT COUNT(*)
       FROM g_hostmap_work queued
       LEFT JOIN g_hostmap cached
            ON queued.ip = cached.ip
            AND now() < cached.expire_time

Here is the query to return a subset of items in the work queue:

SELECT queued.ip, queued.input_table, cached.expire_time
       FROM g_hostmap_work queued
       LEFT JOIN g_hostmap cached
            ON queued.ip = cached.ip
            AND now() < cached.expire_time
       LIMIT 10

Here is an example of a single INSERT statement to update the database with a new IP address/host name mapping:

INSERT INTO g_hostmap_20131230 VALUES
(NULL, '192.168.54.133', 'powwow.site', now(), now() + 900 * INTERVAL '1 SECOND')
役に立ちましたか?

解決

I'm gonna make a kind of odd sounding suggestion. Add an auto-inc big int to the source table, and create a set of 10 indexes with modulo division. Here's a simple test case example:

create table queue (id bigserial, input text);
create index q0 on queue (id) where id%10=0;
create index q1 on queue (id) where id%10=1;
create index q2 on queue (id) where id%10=2;
create index q3 on queue (id) where id%10=3;
create index q4 on queue (id) where id%10=4;
create index q5 on queue (id) where id%10=5;
create index q6 on queue (id) where id%10=6;
create index q7 on queue (id) where id%10=7;
create index q8 on queue (id) where id%10=8;
create index q9 on queue (id) where id%10=9;
insert into queue select generate_series(1,50000),'this';

What we've done here is to create a set of indexes that index 1/10th of the table. Next, we'll select a chunk of one of those ranges to work on:

begin;
select * from queue where id%10=0 limit 100 for update;
id  | input 
------+-------
10 | this
20 | this
30 | this
-- do work here --
commit;

Now the interesting part. If you have > 10 workers with this setup, you just cycle them through the numbers and any more than 10 will just wait when the above select for update runs. But any other number (1 through 9) will still work.

begin;
select * from queue where id%10=1 limit 100 for update;
 id  | input 
-----+-------
   1 | this
  11 | this
  21 | this
  31 | this
-- do work here
commit;

This way all the work is divided into 10 buckets. Want more buckets? Change the number after the % and increase the number of indexes to match.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top