Should an ORDER BY ever affect the number of rows returned a SELECT query?

https://stackoverflow.com/questions/10524949

sql
ingres

07-06-2021
|

Question

I am going through the pain of normalising a horrible legacy database, and have discovered what I think is a bug with the DBMS.

This query returns the results as I expect:

SELECT DISTINCT RIGHT(SQUEEZE(thing_id), 2) AS thing_id, TRIM(thing_name)
FROM thing
ORDER BY thing_id, thing_name;
(16 rows)

The first time I ran the query, I had inadvertently used the wrong columns in the ORDER BY, as below:

SELECT DISTINCT RIGHT(SQUEEZE(thing_id), 2) AS thing_id, TRIM(thing_name)
FROM thing
ORDER BY thing_name, location;
(33 rows)

Note that the only thing to change is the ORDER BY, and the number of rows returned increases from 16 to 33. The results it gives are not DISTINCT as the query specifies.

I believe this is an outright bug, but a coworker says that it is normal because when we order by "location", it is selected an included invisibly in the results.

Should the ORDER BY ever affect the number of rows returned in a SELECT query?

EDIT: I had another person look at the queries AND I copied the queries to two seperate files, then ran a diff command on them. It is 100% certain that the only difference between the two queries is the columns listed in the ORDER BY.

UPDATE: Ingres have since release patch 14301 with bugfix: "Bug 126640 (GENERIC) Query with order-by expression and distinct aggregate returns more rows than expected. The columns in the order-by expression are not in the select list."

i.e. The query in question will now result in an error as the results are not correct.

Solution

Since the order of operations in SQL is:

FROM > WHERE > GROUP BY > HAVING > SELECT > ORDER BY

this does seem to be a bug indeed. Which DBMS is it?

OTHER TIPS

The issue I see is that the second query has a column (location) in the ORDER BY that is not included in the SELECT DISTINCT list. Actually both of the queries are invalid SQL (despite that Ingres seems to allow them). I simplified them (so the first is ok) to:

Query one (valid SQL):

SELECT DISTINCT 
      thing_id 
    , thing_name
FROM thing
ORDER BY thing_id
       , thing_name ;

Query two (invalid SQL, should produce error):

SELECT DISTINCT 
      thing_id 
    , thing_name
FROM thing
ORDER BY thing_name
       , location;

Why should it give error? Because the ORDER BY should be processed after SELECTand DISTINCT. So, two or more rows in the original table may have same thing_id and thing_name but different location. These rows will be collapsed into one. So, there is no location value to be useed for the ordering. Even if there was (a hidden location valiue) kept, which one of the many should it be?

SELECT DISTINCT queries can be rewritten with SELECT ALL and GROUP BY (also invalid in this case):

SELECT ALL
      thing_id 
    , thing_name
FROM thing
GROUP BY thing_id 
       , thing_name
ORDER BY thing_name
       , location;

The above (query 2) actually does produce error in PostgreSQL, SQL-Server and Oracle. Test it in SQL-Fiddle

From the erroneous number of rows returned by the second query in Ingres, I guess that what happens behind the scenes is that he location is secretly kept in the SELECT list so it can be used for the ORDER BY and then removed. This, in combination with the DISTINCT results in the non-standard buggy behaviour:

SELECT DISTINCT 
      thing_id 
    , thing_name
   (, location         --- hidden column) 
FROM thing
ORDER BY thing_name
       , location;

You can call it either a bug or a feature, it doesn't matter, as long as you know it should really not be allowed in the first place.

It seems that a similar issue has been reported a year ago in Actian forum: Problem with DISTINCT + ORDER BY and supposedly fixed. No idea which version they are refering to or if it has been actually fixed or not (and what "fix" means).

If you want the query to be valid and behave as you expect, you could use somethign like this:

SELECT
      RIGHT(SQUEEZE(thing_id), 2)  AS squeezed_thing_id
    , TRIM(thing_name)             AS trimmed_thing_name 
    , MIN(location)                AS a_location            --- or MAX()
FROM 
    thing
GROUP BY 
      RIGHT(SQUEEZE(thing_id), 2)
    , TRIM(thing_name)     
ORDER BY 
      trimmed_thing_name
    , a_location ;

No, order by should not affect the number of rows returned.

If what you're saying is true that would be a very strange and fundamental bug. I would triple (then quadruple) check the return columns on both queries to make absolutely sure that this isn't a simple user error before emailing the Ingres guys.

Edit

Actually, I would also try running the queries on a backup database. If you don't have a backup available, you might be able to get away with copying the database that you have (if Ingres supports that).

I've had queries returning nonsensical results on machines with faulty hardware before. That's definitely going to be something you want to check.

If you add location to the first query, then both will return the same number of rows. The reason why the second query returns more rows is because in that query you do not select location. As stated previously, some RDBMS do not allow it, Ingres does. Probably because the SQL standard does not forbid it ?.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow