Question

I am suck in this one. Wish I could do it in pure sql, but at this point any solution will do.

I have ta and tb tables, containing lists of events that occurred approximately at the same time. The goal is to find "orphan" records from ta on tb. E.g.:

create table ta ( dt date, id varchar(1));
insert into ta values( to_date('20130101 13:01:01', 'yyyymmdd hh24:mi:ss') , '1' );
insert into ta values( to_date('20130101 13:01:02', 'yyyymmdd hh24:mi:ss') , '2' );
insert into ta values( to_date('20130101 13:01:03', 'yyyymmdd hh24:mi:ss') , '3' );


create table tb ( dt date, id varchar(1));
insert into tb values( to_date('20130101 13:01:5', 'yyyymmdd hh24:mi:ss') , 'a' );
insert into tb values( to_date('20130101 13:01:6', 'yyyymmdd hh24:mi:ss') , 'b' );

But let's say I must use a threshold of +-5 seconds. So, the query to find would look something like:

  select
    ta.id ida,
    tb.id idb
  from
    ta, tb
  where 
    tb.dt between (ta.dt - 5/86400) and (ta.dt + 5/86400)
  order by 1,2 

(fiddle: http://sqlfiddle.com/#!4/b58f7c/5)

The rules are:

  • Events are mapped 1 to 1
  • The closest event on tb for a given one in ta will be considered the correct mapping.

That said, the resulting query should return something like

IDA | IDB
1   | a
2   | b
3   | null  <-- orphan event

Though the sample query I've put here shows exactly the issue I am having. When the time overlaps, it is difficult to systematically choose the correct row.

dense_rank() seems to be the answer to select the correct rows, but what partitioning/sorting will place them right?

Worth mentioning, I am doing this on a Oracle 11gR2.

Was it helpful?

Solution

It seems like this should be possible with a single SQL statement using Oracle's analytic functions, perhaps with some combination of row_number(), lag(), and max() over. But I simply couldn't wrap my head around it. I kept on wanting to embed one analytic function within another, and I don't think you can do that. You can go in steps using Common Table Expressions, but I couldn't figure out how to make it work.

But a procedural solution is fairly straight forward using PL*SQL along with an extra table to store your result. I use row_number() to assign a chronological rank to each row in each of your source tables. You want a determinate result, so it's important to have a tie breaker in case you have duplicate date-times, hence my order by of dt, id. Here is a SQL-Fiddle demo.

Or look at the code below:

create table result ( 
  dif number, 
  ida varchar(1),
  idb varchar(1),
  dta date,
  dtb date
);

declare
  prevA integer := 0;
  prevB integer := 0;
begin
  for rec in (
    with 
    ordered_ta as (
      select dt dta,
             id ida,
             row_number() over (order by dt, id) rowNumA
        from ta
    ),
    ordered_tb as (
      select dt dtb,
             id idb, 
             row_number() over (order by dt, id) rowNumB 
        from tb
    )
    select ta.*,
           tb.*,
           abs(dta - dtb) * 86400 dif
      from ordered_ta ta
      join ordered_tb tb
        on dtb between (dta - 5/86400) and (dta + 5/86400)
     order by rowNumA, rowNumB
  )
  loop
    if rec.rowNumA > prevA and rec.rowNumB > prevB then
      prevA := rec.rowNumA;
      prevB := rec.rowNumB;
      insert into result values (
        rec.dif,
        rec.ida,
        rec.idb,
        rec.dta,
        rec.dtb
      );
    end if;
  end loop;
end;
/

select * from result
union all
select null dif, id ida, null idb, dt dta, null dtb
  from ta
 where id not in (select ida from result)
union all
select null dif, null ida, id idb, null dta, dt dtb
  from tb
 where id not in (select idb from result)
;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top