Question

As title.

i have seen this, people all saying that we should consider union all first for the performance, my question is which one should i use when i want to eliminate duplicate records. i have seen that developer use union all to search the records first and then filter the duplicate records by Java in the source of our project, is that necessary?? Does the union will be the better choice for this case??

Thanks!

Was it helpful?

Solution

In the past (mid-nineties) some RDBMS engines had very poor implementations of UNION, so a suggestion to switch to UNION ALL and filter on the client very often paid off. The performance of UNION has been optimized, however, so with the modern RDBMSs the decision has to be made case-by-case:

  • When you do a UNION, the database must eliminate duplicates for you. If the number of records returned from a query is small (say, a few hundred to a thousand) then it does not matter where to eliminate the duplicates, so you might as well do it on the RDBMS side.
  • When the number of records gets into tens of thousands, you may be able to do elimination of duplicates in a way that is smarter than that of RDBMS by exploiting specific properties of your data. In this case you would use UNION ALL.
  • If the number of rows is large and the share of duplicates is very large (say, you UNION ALL from five tables, with 70% of rows being duplicates) it may be better to save on the network bandwidth and the client memory by having RDBMS eliminate duplicates, and reduce the size of data to be transfered back to you by 70%.

To summarize, there is no universal scenario. You need to do some calculations and profile your queries before making a decision one way or the other.

OTHER TIPS

According to the SQL specification:

  • UNION ALL returns all rows in the selected order
  • UNION removes duplicates, but row order is arbitrary (usually sorted)

So use UNION to remove duplicates.


The reason you should avoid UNION if you can is that the typical way of removing duplicates is for the database to sort the result set. Sorting can be expensive, particularly for large result sets.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top