Left outer join on two columns performance issue

https://stackoverflow.com/questions/444820

22-07-2019
|

Question

I'm using a SQL query that is similar to the following form:

SELECT col1, col2
FROM table1
LEFT OUTER JOIN table2
ON table1.person_uid = table2.person_uid
AND table1.period = table2.period

And it's either way too slow or something's deadlocking because it takes at least 4 minutes to return. If I were to change it to this:

SELECT col1, col2
FROM table1
LEFT OUTER JOIN table2
ON table1.person_uid = table2.person_uid
WHERE table1.period = table2.period

then it works fine (albeit not returning the right number of columns). Is there any way to speed this up?

UPDATE: It does the same thing if I switch the last two lines of the latter query:

SELECT col1, col2
FROM table1
LEFT OUTER JOIN table2
ON table1.period = table2.period
WHERE table1.person_uid = table2.person_uid

UPDATE 2: These are actually views that I'm joining. Unfortunately, they're on a database I don't have control over, so I can't (easily) make any changes to the indexing. I am inclined to agree that this is an indexing issue though. I'll wait a little while before accepting an answer in case there's some magical way to tune this query that I don't know about. Otherwise, I'll accept one of the current answers and try to figure out another way to do what I want to do. Thanks for everybody's help so far.

Solution

Bear in mind that statements 2 and 3 are different to the first one.

How? Well, you're doing a left outer join and your WHERE clause isn't taking that into account (like the ON clause does). At a minimum, try:

SELECT col1, col2
FROM table1, table2
WHERE table1.person_uid = table2.person_uid (+)
AND table1.period = table2.period (+)

and see if you get the same performance issue.

What indexes do you have on these tables? Is this relationship defined by a foreign key constraint?

What you probably need is a composite index on both person_uid and period (on both tables).

OTHER TIPS

I think you need to understand why the last two are not the same query as the first one. If you do a left join and then add a where clause referncing a field in the table on the right side of the join (the one which may not always have a record to match the first table), then you have effectively changed the join to an inner join. There is one exception to this and that is if you reference something like

SELECT col1, col2
FROM table1
LEFT OUTER JOIN table2
ON table1.person_uid = table2.person_uid
WHERE table2.person_uid is null

In this case you asking for the record which don't have a record in the second table. But other than this special case, you are changing the left join to an inner join if you refence a field in table2 in the where clause.

If your query is not fast enough, I would look at your indexing.

Anything anyone tells you based on the information you provided is a guess.

Look at the execution plan for the query. If you don't see a reason for the slowness in the plan, the post the plan here.

http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009

Do you have covering indexes on person_uid and period for both tables?

If not, add them and try again.

Take a look at the execution plan and see what the query is actually doing.

Also: What are the datatypes of the fields? Are they the same in both tables? An implicit cast can really slow things down.

Do these tables have indexes on the columns you're joining? Install Oracle's free SQLDeveloper product and use it to do an "explain" on that query and see if it's doing sequential scans of both tables.

In a left join you'd be scanning table1 for each unique combination of (person_uid,period) then searching table2 for all corresponding records there. If table2 doesn't have an appropriate index, this can involve scanning the whole of that table too.

My best guess, without seeing an execution plan, is that the first query (the only one which seems to be correct) is having to table scan table2 as well as table1.

As you say that you can't change the indexes, you need to change the query. As far as I can tell, there is only one realistic alternative...

SELECT
   col1, col2
FROM
   table2
FULL OUTER JOIN
   table1
      ON table1.person_uid = table2.person_uid
      AND table1.period = table2.period
WHERE
   table1.person_uid IS NOT NULL

The hope here is that you scan table2 for each unique combination of (person_uid, period), but make use of indexes on table1. (As opposed to scanning table1 and making use of indexes on table2, which what I expected from your query.)

If table1 doesn't have appropriate indexes, however, you'll be very unlikely to see any performance improvement at all...

Dems.

In one of the updates the OP states that he is actually querying views not tables. In this case, the performance could well be increased by directly querying the tables he needs especially if the views are complex and join to many other tables that do not contain information he needs or they are views that call views.

ANSI join syntax provides a very clear distinction between JOIN conditions and FILTER predicates; this is very important when writing outer joins. Using the emp/dept tables, look at the results from the following two outer joins

SELECT dname, d.deptno, e.ename, e.mgr, d.loc
FROM dept d
LEFT OUTER JOIN emp e
on  d.deptno = e.deptno
and loc in ('NEW YORK','BOSTON' )
;

DNAME              DEPTNO ENAME             MGR LOC
-------------- ---------- ---------- ---------- -------------
ACCOUNTING             10 CLARK            7839 NEW YORK
ACCOUNTING             10 KING                  NEW YORK
ACCOUNTING             10 MILLER           7782 NEW YORK
RESEARCH               20                       DALLAS
SALES                  30                       CHICAGO
OPERATIONS             40                       BOSTON

====

Q2
SELECT dname, d.deptno, e.ename, e.mgr, d.loc
FROM dept d
LEFT OUTER JOIN emp e
on  d.deptno = e.deptno
where loc in ('NEW YORK','BOSTON' )
;

DNAME              DEPTNO ENAME             MGR LOC
-------------- ---------- ---------- ---------- -------------
ACCOUNTING             10 CLARK            7839 NEW YORK
ACCOUNTING             10 KING                  NEW YORK
ACCOUNTING             10 MILLER           7782 NEW YORK
OPERATIONS             40                       BOSTON

The first example, Q1 shows is an example of "joining on a constant". Essentially, the filter condition is applied prior to performing the outer join. So you eliminate rows, which are subsequently added back as part of the outer join. It's not necessarily wrong, but is that the query that you really asked for? Often it is the results shown in Q2 that are required, where the filter is applied after the (outer) join.

There is also a performance implication too, for large data sets. In many cases, joining on a constant has to be resolved internally by the optimizer by creating a lateral view, which can usually only be optimized via a nested loop join rather than a hash join

For developers who are familiar with the Oracle outer join syntax, the query would probably have been written as

SELECT dname, d.deptno, e.ename, e.mgr, d.loc
FROM dept d
        ,emp e
where  d.deptno = e.deptno(+)
and loc in ('NEW YORK','BOSTON' )

This query is semantically equivalent as Q2 above.

So in summary, it's extremely important the that you understand the different between the JOIN clause and the WHERE clause when writing ANSI outer joins.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow