join two Oracle tables, date to not overlapping date ranges

https://stackoverflow.com/questions/10911032

12-06-2021
|

题

I have two tables:

trips: id_trip, id_object, trip_date, delta (8980026 rows)
ranges: id_range, id_object, date_since, date_until (18490 rows)

I need to optimize the following select statement

  select r.id_range, sum(t.delta) sum_deltas
    from trips t,
         ranges r
   where t.id_object = r.id_object
     and t.trip_date between r.date_since and r.date_until
group by r.id_range

according to the condition there is always exactly one matching row for trip in 'ranges'

the trips table is constantly growing, but there are no updates or deletions
table ranges may change from time to time in any way (deletions, updates, inserts), so function based index is not the way :(
there are indexes on id_object (in both tables) and date_since (in trips)

Does anyone have an idea how to speed things up, is it even possible?

解决方案

It's always possible to speed things up; it just may not be worth the time / effort / money / disk-space / additional overheads etc.

Firstly please use the explicit join syntax. It's been the SQL standard for a few decades now and it helps avoid a lot of potential errors. Your query would become:

select r.id_range, sum(t.delta) sum_deltas
  from trips t
  join ranges r
    on t.id_object = r.id_object
   and t.trip_date between r.date_since and r.date_until
 group by r.id_range

This query would imply that you need two indexes - unique if possible. On ranges you should have an index on id_object, date_since, date_until. The index on trips would be id_object, trip_date. If trips were smaller I might consider adding delta on to the end of that index so you never enter the table at all but only do a index scan. As it stands you're going to have to do a table access by index rowid.

Having written all that your problem may be slightly different. You're going to be full-scanning both tables with this query. Your problem might be the indexes. If the optimizer is using the indexes then it's possible you're doing an index unique/range scan for each id_object in trips or ranges and then, because of the use of columns not in the indexes you will be doing an table access by index rowid. This can be massively expensive.

Try adding a hint to force a full-scan of both tables:

select /*+ full(t) full(r) */ r.id_range, sum(t.delta) sum_deltas
  from trips t
  join ranges r
    on t.id_object = r.id_object
   and t.trip_date between r.date_since and r.date_until
 group by r.id_range

其他提示

You may want to look at your data segmentation (i.e. partition your data by certain dates, causing the query to only hit the appropriate partitions) and indices, these could probably speed up the querying process.

Also, you could consider a data warehouse... You say Trips never gets updated or deleted so it is an ideal candidate for denormalization into a data structure more suited to report generation and ad-hoc queries.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow