質問

There is two way to do multi-table query:

Query 1:

select t1.a, t2.b from t1, t2 where t1.a = t2.a

Query 2:

for each row:

select t1.a from t1

do another query:

select t2.b from t2 where t2.a = '??'

which one has better performance when table is very large?

役に立ちましたか?

解決

You should always let the DBMS do as much work in a single query as possible.

The DBMS knows how many tuples in each database there are, and has a way to estimate the number of tuples that the result will have. Modern DBMSs have very complex algorithms that are responsible for finding the most efficient way to execute any query (the planner).

Unless you know what you are doing, and why you are doing (i.e. you know your algorithm will run faster than the DBMS and, more important, why) you should just let the DMBS do its job.

Answering your question more precisely:

Your query #1 can be answer with various methods, depending on the size of the tables. Let us assume that both are HUGE. One way to solve is to use a sort-based join: you sort both tables based on the join attribute and then you merge them. This will basically be equivalent to the time it takes to do merge sort on each table. Each page of each table will be read and written a few times (depending on how much buffer space you have available in the DMBS). So each tuple in T1 and T2 will be read/written say, a dozen times.

If we implement your method, there will be as many queries as tuples in the size of T1. Let us assume T2 does not have an index, therefore the query will read every tuple in T2 T1 times.

If you have an index on T2 you can expect to read for each tuple in T1 a few pages. So the cost of your query is the cost of reading T1 and then for each tuple in T1 you need to read few pages (2-5) to find the matching tuples in T2.

If T1 is very small and T2 is very large, query 2 will be faster! But, the DBMS will discover that, and will execute EXACTLY your algorithm to answer Q1 (it is known as a loop-based join). Furthermore, each query you send to the DBMS takes time to be processed (an overhead that method 1 does not have).

This is a common naive DBMS programmer's mistake: let the DB do a little work, then for every tuple, do some more work.

Instead, you should think in terms of letting the DBMS do all the processing in as few queries as possible. It will pay off in performance.

Finally, if you are really interested in performance, grab the documentation of your favorite DMBS and read how it does query evaluation, and how you can improve it.

--dmg

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top