Question

I'm re-implementing an app to support a national engineering contest, moving it from a local server to the cloud.

In order to tell a team where they stand at the moment, the query has the form

select 1 + count(*) from team where where score < ?

Teams scores change very dynamically. There may be up to 2 million teams and I need to handle at least 10 of these queries per second.

The original gets the needed performance (actually it already did with 1999 harware) by using a separate Berkeley DB of team/scores records. There is a "record number" feature in Berkeley DB that provides exactly the right functionality, and it's very fast.

Heroku apparently has no way to support Berkeley DB. PostgreSQL, their standard DB, does select count(*) with a full table or index scan, which is way too slow.

Any ideas on how to proceed? I am not wedded to Heroku, but must move to a cloud solution of some kind.

Was it helpful?

Solution

Use redis to store your team data in a sorted set. Then the ZRANK function will return the count you need. Redis is very fast in general, and the ZRANK function is O(log N) expected. It is implemented with skip lists.

OTHER TIPS

Create a rank table and update as frequently as adequate. Include the category (open or official) and score so you don't have to join it to the team table at query time:

create table "rank" (
    team integer primary key, 
    category integer,
    score integer,
    rank_consolidated integer,
    rank_category integer
);

begin;
truncate table "rank"
;
insert into "rank" (team, category, score, rank_consolidated, rank_category)
select 
    team, category, score,
    rank() over(order by score desc) rank_consolidated,
    rank() over(partition by category order by score desc) rank_category
from team
;
commit
;
select * from "rank" where team = 11;

As for the exact ranking behavior look into window functions

Putting an index on score should avoid the full table scan.

If it's read from a lot more heavily than it's written to and it always has to be up to date, then this is an ideal job for a trigger-maintained summary table (sort of a materialized view).

Have a trigger on the team table that, AFTER EACH INSERT OR UPDATE OR DELETE FOR EACH ROW executes a trigger function that updates the team_summary table entry for that team with the new score.

The team_summary table can be accessed via a simple, direct index lookup by equality so it'll be crazy fast. Since Pg supports simultaneous readers and writers the team_summary table will remain responsive even if it's being updated very frequently. The only thing you really need to do for best results is set FILLFACTOR to something like 50 in the team_summary table so that HOT can work well, and make sure autovacuum is set to run quite often to spread the load of the vacuum I/O churn.

Writing the trigger should be pretty trivial. You just have to be careful to write a concurrency-safe trigger that won't break when you have concurrent updates to the same team by multiple concurrent connections. Something like:

UPDATE team_summary SET score = score + 1 WHERE team_id = NEW.team_id;

should be fine under both SERIALIZABLE and READ COMMITTED isolation. See Concurrency control. The only hard bit is that you must make sure to insert a new row into team_summary before inserting the first row for a new team into team so your trigger doesn't have to handle the surprisingly tricky case where the team_summary row might not yet exist in the team table. Getting the upsert/merge right for that is kind of tricky.

If the write rate is also very high and you can get away with the results being updated only every few seconds/minutes, use Clodoaldo's approach instead.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top