The Relational Model & Queries That Naturally Return Duplicate Rows

https://stackoverflow.com/questions/3891119

28-09-2019
|

Question

It's commonly understood that in the relational model:

Every relational operation should yield a relation.
Relations, being sets, cannot contain duplicate rows.

Imagine a 'USERS' relation that contains the following data.

ID FIRST_NAME LAST_NAME
 1 Mark       Stone
 2 Jane       Stone
 3 Michael    Stone

If someone runs a query select LAST_NAME from USERS, a typical database will return:

LAST_NAME
Stone
Stone
Stone

Since this is not a relation - because it contains duplicate rows - what should an ideal RDBMS return?

Solution

"But some information is lost - that there are 3 users with that last name."

If the count of users with that name is what you are interested in, then the query of your example is not the question you should be asking.

The query of your example will provide the answer to the question "What are all the last names such that there exists a user that has that last name ?".

If the question you want to ask is "how many users are there that are named 'Stone'", then the query you should submit is Select count(...) from users where last_name = 'Stone';

Projection always "looses" information. The information that is tied to the attributes that are projected away. I don't see how a known property of a useful relational operator can be explained as an argument against that operator.

OTHER TIPS

In a RDBMS a relational projection on the last name column alone would return only a set of tuples with distinct values of last name. There would be no duplicate tuples.

In SQL it is true that you would get duplicates unless you specified the DISTINCT keyword. That's because SQL is not a truly relational language - among other things because SQL tables and table expressions are not proper relations. A SQL DBMS is not a RDBMS.

"what should an ideal RDBMS return?"

As David indicated, it should return (in your example) one single row.

An SQL DBMS is only a relational one if it treats every SELECT as if SELECT DISTINCT were requested. (But there are a few tiny additional conditions to be met too.)

The reason this is so is that the "meaning" of that single row is as follows : "There exists some user such that he has a first_name, he has an ID, and his last_name is 'Stone'".

There is never any logical need to repeat that statement a second time. The authoritative reference that you asked for, is Ted Codd himself : "If something is true, then saying it twice won't make it any truer.".

I'm not sure I see a problem with the returned values. There are three records that contain "Stone" for LAST_NAME. This would have been obvious if FIRST_NAME or ID had been included in the query, but it was not. Usually, the DISTINCT keyword is used to handle this and ensure that there will be no duplicates.

In fact, if my database started applying DISTINCT automatically (which it sounds like you think maybe it should), I'd be somewhat annoyed. Seeing duplicate rows when you don't expect to is often the needed break when debugging some weird data problem in a database.

I would argue that your original query did not return duplicate rows. It returned 3 separate rows of data from the database in which you only included the last name column. I would say that your question is not phrased correctly and hence why RDBMS function in the manner they do (which I also argue is the correct manner).

To translate your query:

select LAST_NAME from USERS

into English, it would be:

"tell me the last name of all the users"

If I went into a highschool gym class and asked the teacher "using your class list sheet, tell me the last name of all the students in your class", if there were twin brothers in the class, I would think he would list their last name twice (or he'd at least ask the question to you if he should). He would just go down the list of people in the class and read off their last names.

If you were wanting to ask the question, "what are the different last names of students in the class", he would not list the names duplicated. However that's what the "DISTINCT" key word exists.

So the query would be:

select distinct LAST_NAME from USERS

And if you were actually interested in the number of unique last names in English is "How many different last names are there of the students in the class" or using your example:

select count(distinct LAST_NAME) from USERS

whereas: select count(LAST_NAME) from USERS

would mean in English: "How many people in the class have a last name?"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow