Selecting date intervals, doing it fast, and always returning the latest entry with the result

https://stackoverflow.com/questions/785172

16-09-2019
|

Question

I have a database with a table, storing changes in account-balance across a couple of accounts, with three columns;

float   balance, #The account balance after the change
Date    date,    #Date that balance change occurred
int     aid      #Account that the balance change occurred on

It contains a couple of entries for each day of the year, and I want to retrieve the balance of every five days. I also want it to separate between accounts (ie if two changes occurred on the same day, but on separate accounts, return both).

The problem is this: Sometimes there will be several days (or weeks) where there is no data available. When that occurs, I want to make sure to return the latest entry before the "hole" in the dataset. This is a simplified version of the problem, the actual database is big (several gigabytes), the size is the reason why I want to return a subset of the data. It cannot use platform specific methods, because it needs to work on both oracle and mySQL.

My question is: Is there any way to do this fast? I would be able to write a query that gets the job done, but I am hoping there is some devil magic way of doing it that does not require lots of nested queries and aggregate functions..

Solution

I would use Andomar's Period table idea, but I would try a slightly different final query. This assumes that your Account_Balances table has a PK on aid and date. If you ended up with two balances for the same account for the same exact date and time then you would get some duplicate rows.

SELECT
     P.start_date,
     P.end_date,
     AB1.account_id,
     AB1.balance
FROM
     Periods P
LEFT OUTER JOIN Account_Balances AB1 ON
     AB1.date <= P.end_date
LEFT OUTER JOIN Account_Balances AB2 ON
     AB2.aid = AB1.aid AND
     AB2.date > AB1.date AND
     AB2.date <= P.end_date
WHERE
     AB2.aid IS NULL

If the account has no rows before or during the given period you will not get a row back for it.

OTHER TIPS

You can do this in a relatively straightforward way by creating a period table, which you can join with the accounts table to create one row per account per period.

Here's an example. Let's set up some temporary tables:

create table #balance (
    id int identity,
    balance float,
    date datetime,
    aid int
)

create table #period (
    id int identity,
    startdt datetime,
    enddt datetime
)

Enter some test data:

insert into #yourtable (balance, date, aid) values (4,'2009-01-01',1)
insert into #yourtable (balance, date, aid) values (5,'2009-01-10',1)
insert into #yourtable (balance, date, aid) values (6,'2009-01-10',1)
insert into #yourtable (balance, date, aid) values (7,'2009-01-16',1)
insert into #yourtable (balance, date, aid) values (2,'2009-01-01',2)
insert into #yourtable (balance, date, aid) values (3,'2009-01-10',2)
insert into #yourtable (balance, date, aid) values (4,'2009-01-10',2)
insert into #yourtable (balance, date, aid) values (5,'2009-01-16',2)

insert into #period (startdt, enddt) values ('2009-01-01','2009-01-06')
insert into #period (startdt, enddt) values ('2009-01-06','2009-01-11')
insert into #period (startdt, enddt) values ('2009-01-11','2009-01-16')
insert into #period (startdt, enddt) values ('2009-01-16','2009-01-21')

Now let's query all periods:

from #period p

Add one row for each balance before the end of the period:

left join #balance b1 on 
    b1.date <= p.enddt

Search for balances in between the balance from the first join, and the end of the period:

left join #balance b2 on 
    b2.aid = b1.aid
    and b1.id < b2.id
    and b2.date <= p.enddt

Then filter out the rows that are not the last balance for their period.

where
    b2.aid is null

The b2 join basically looks for the "in-between" value, and by saying it's id is null, you say no in-between row exists. The final query looks like this:

select 
    b1.aid
,   p.startdt
,   b1.balance
from #period p
left join #balance b1 on 
    b1.date <= p.enddt
left join #balance b2 on 
    b2.aid = b1.aid
    and b1.id < b2.id
    and b2.date <= p.enddt
where
    b2.aid is null
order by b1.aid, p.startdt

Note: the queries assume a balance with a later date always has a larger id. If you never have to balances with exactly the same end date, you can replace "b1.id < b2.id" with "b1.date < b2.date".

If you wait for postgresql 8.4 you might be able to make use of Window Functions

http://www.postgresql.org/docs/8.4/static/tutorial-window.html

http://www.postgresql.org/docs/8.4/static/functions-window.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow