How do I improve query to select records with MAX date that uses joined tables in PostgreSQL?

https://stackoverflow.com/questions/12381403

01-07-2021
|

Question

I have three large tables as follows...

property
--------
property_id
other_prop_data

transfer_property
-----------------
property_id
transfer_id

transfer
--------
transfer_id
contract_date
transfer_price

I want to return a list of unique property IDs for all Transfers that occurred between '2012-01-01' and '2012-06-30'. Here's the code I have so far...

SELECT *
FROM property p
JOIN
(
  SELECT t.transfer_id, t.contract_date, t.transfer_price::integer, tp.property_id
  FROM transfer t
  LEFT JOIN transfer_property tp ON tp.transfer_id = t.transfer_id
  WHERE t.contract_date BETWEEN '2012-01-01' AND '2012-06-30'
) transfer1 ON transfer1.property_id = p.property_id

AND NOT EXISTS
(
  SELECT transfer2.transfer_id
  FROM
  (
    SELECT t.transfer_id, t.contract_date, t.transfer_price::integer, tp.property_id
    FROM transfer t
    LEFT JOIN transfer_property tp ON tp.transfer_id = t.transfer_id
    WHERE t.contract_date BETWEEN '2012-01-01' AND '2012-06-30'
  ) AS transfer2
  WHERE transfer2.property_id = transfer1.property_id
  AND transfer2.contract_date > transfer1.contract_date
)

This works (I think) but is very slow.

I have found several similar queries in... https://stackoverflow.com/questions/tagged/greatest-n-per-group ...but most of the ones I found were self joins with the same table, not joined relational tables as above.

I know in MySQL you can use User Variables, but I do not know how to do this in PostgreSQL, or if it is the ideal solution in this case.

Does anybody have any suggestions around how to improve this query (or even how to do it using a completely different method than mine above)?

Any help is very much appreciated. Thanks!

Regards,

Chris

PS: have also tried variations on DISTINCT and MAX, but not convinced they were picking records with the most recent date with the way I was using them.

EDIT: Sorry folks, I should also add that I am running my queries in PGADMIN 1.12.3

Solution

"I want to return a list of unique property IDs for all Transfers that occurred between '2012-01-01' and '2012-06-30'."

To me, that appears as:

SELECT DISTINCT tp.property_id
  FROM transfer t
  JOIN transfer_property tp ON tp.transfer_id = t.transfer_id
  WHERE t.contract_date BETWEEN '2012-01-01' AND '2012-06-30'
     ;

Now put that in a CTE or subquery, and you are done:

WITH x1 AS (
      SELECT DISTINCT tp.property_id AS property_id
      FROM transfer t
      JOIN transfer_property tp ON tp.transfer_id = t.transfer_id
      WHERE t.contract_date BETWEEN '2012-01-01' AND '2012-06-30'
      )
SELECT ...
FROM property p
JOIN x1 ON x1.property_id = p.property_id
    ;

I don't understand the purpose of the NOT EXISTS subquery. You are only interested in the MAX?

UPDATE: It appears (from the title) you only want the maxdate. Could be done by your not exist construct, or by this MAX(...) in the subquery; like ... :

WITH m1 AS (
      SELECT DISTINCT tp.property_id AS property_id
        , MAX(t.contract_date) AS contract_date
      FROM transfer t
      JOIN transfer_property tp ON tp.transfer_id = t.transfer_id
      WHERE t.contract_date BETWEEN '2012-01-01' AND '2012-06-30'
        GROUP BY tp.property_id
      )
SELECT ...
FROM property p
JOIN m1 ON m1.property_id = p.property_id
    ;

OTHER TIPS

Try to use ROW_NUMBER() OVER in PostgreSQL. Here is a SQLFiddle example:

SELECT *
FROM property p
JOIN
(
  SELECT t.transfer_id, t.contract_date, 
         t.transfer_price::integer, tp.property_id,
         row_number() over 
           (PARTITION BY tp.property_id 
            ORDER BY t.contract_date desc) as rn
  FROM transfer t
  LEFT JOIN transfer_property tp 
        ON tp.transfer_id = t.transfer_id
  WHERE t.contract_date BETWEEN '2012-01-01' 
                            AND '2012-06-30'
) transfer1 
       ON transfer1.property_id = p.property_id
where transfer1.rn = 1

Given the skeleton tables:

create table property( property_id serial primary key );

create table transfer(
    transfer_id serial primary key,
    contract_date date not null
);

create table transfer_property (
    property_id integer references property(property_id),
    transfer_id integer references transfer(transfer_id)
);

and data:

insert into property
select nextval('property_property_id_seq') 
from generate_series(1,10);

insert into transfer 
select nextval('transfer_transfer_id_seq'), 
       DATE '2012-01-01' + x * INTERVAL '1 month'
from generate_series(1,10) x;

-- Repeat this 4 or 5 times to produce a pile of duplicate entries
insert into transfer_property (transfer_id,property_id)
select transfer_id, property_id
from property cross join transfer
order by random()
limit 40;

use:

select distinct property_id 
from transfer_property tp inner join transfer t on (tp.transfer_id = t.transfer_id)
where t.contract_date between  '2012-01-01' and '2012-06-30';

Inadequate/misinterpreted? Please post sample data and a real schema that shows the meaningful relationships and expected results.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow