Вопрос

I have a question regarding relational algebra and how the theta-join, equi-join, and natural-join can be classified in relation to each other.

In a comment on stackoverflow.com sqlvogel quotes E.F. Codd, the computer scientist who invented the relational model for database management, "[the] Natural[-join] is a subset of Equi[-join] which is a subset of Theta[-join]."* (stackoverflow.com retrieved on Feb 12 at 9:01PM).

I understand how the equi-join can be seen as a subset of theta-join. They both join and they both use a fixed number of attributes, but the equi-join is restricted to using the equality operator (=). The natural-join on the other hand is the set of all combinations of tuples (not a fixed number of attributes) in joining two relationships and restricted to using the equality operator (=). The equi-join and theta-join differ in what operators they can use, but I see the natural-join as being different in a different dimension as it doesn't need any attributes in it's definition.

Can anyone shed some light on this issue?

* I cannot find the original source for this quote.

Это было полезно?

Решение

Summary: Natural joins are a subset of equi-joins, if you ignore technical details and focus on the intention of the join. If you include these technical details, things get a whole lot more complicated fast. Then, a natural join can be a subset of equi-join, depending on the definitions used.


In my opinion it really depends on with respect to what the subset is taken.

Edit: This paragraph is heavily expanded on in the Wikipedia relational algebra section below, but left here to preserve the context of the original answer.

If we're talking about what the operations technically do, then equi-join is clearly a subset of theta-join, since it can only join on equality, while theta can join on greater than comparisons, less than comparisons, etc. In addition, both equi-join and theta-join operate on arbitrary attributes. However, natural join is dissimilar to the other two joins in this respect. Strictly speaking, equi-join and theta-join can only operate on relations that do not share common attribute names. Natural join will join two relations on equality of the common attribute names and "drop" the duplicate attributes. I believe this is the definition you are going by when you say natural joins are different, in which case you are correct.

If we're talking about what the operations semantically (or practically) do, than I'd say they are subsets of each other. Theta-join joins based on a variety of boolean conditions between arbitrary attributes. Equi-join is restricted to only an equality comparison. Natural join is further restricted in that the attributes aren't arbitrary, they must share names.


Edit: A more technical answer:

Disclaimer: I assume first-year level knowledge of set theory.

Definitions

For a technical answer, we need technical definitions. So what technical definitions do we abide by? If you google relational algebra, on the first page you'll probably get Wikipedia links, Youtube videos, and lecture notes from various university database courses (and notably, no link to a paper by Codd). They all agree on the "basics" of selection, projection, joins, etc. But, on some of the details they differ, such as:

  • Do we allow theta-joins to occur between two relations that share an attribute name, and if so, how do we deal with it?
  • In selections and theta-joins, what predicates and propositions are allowed?
  • Can natural joins be performed when no attributes share a name (in which case it degenerates to a Cartesian product)?

I'm going to look at two definitions of relational algebra: Wikipedia's, and Codd's.

Wikipedia's relational algebra

I start with Wikipedia's relational algebra article, since it is quite close to what universities tend to teach nowadays in an introductory database course. This article, however, fails to cite its sources effectively, especially at the technical bits. It is also fairly mathematical. As such I really don't recommend this article for anyone looking at relational algebra for the first time.

Natural Join

Wikipedia defines a natural join as:

[...] the set of all combinations of tuples in R and S that are equal on their common attribute names.

Also,

It is usually required that R and S must have at least one common attribute, but if this constraint is omitted, and R and S have no common attributes, then the natural join becomes exactly the Cartesian product.

Since, as we will see, Codd requires a common attribute name, we shall go with the definition allowing a degenerate natural join. Note that common attributes only appear once in the resultant relation. So if relation R has m attributes, relation S has n attributes, and they have x attributes in common (x ≥ 0, xm, xn), then the resultant relation R NATURAL JOIN S has m + n - x attributes.

Theta-joins and equi-joins

Wikipedia defines a θ-join between relations R and S as a join satisfying the θ-relation a θ b or a θ v, where a and b are attribute names, v is a constant value and θ is one of {<, ≤, =, ≥, >}.

Also,

The result of this operation consists of all combinations of tuples in R and S that satisfy the relation θ. The result of the θ-join is defined only if the headers of S and R are disjoint, that is, do not contain a common attribute.

An equi-join is simply a θ-join whose operation is "=".

Relationship between the joins

Simply by definition, equi-join is a subset of theta-join.

If R and S have attributes in common, then a theta-join cannot directly operate on the relations. Therefore, natural joins can produce relations that theta-joins cannot. If they do not have attributes in common, then the natural join degenerates to a Cartesian product. The theta-join is now legal. It will be able to join, but can produce different results dependant on the particular θ relation used. In the general case then, theta-joins can produce relations which the natural join cannot.

So in conclusion, we have that equi-joins are a subset of theta-joins, natural joins are not subsets of theta-joins, and theta-joins are not subsets of natural joins either (this is what I originally meant when I said they were dissimilar). It follows that natural joins are not a subset of equi-joins.

Codd's original relational algebra1

Now, let's go to the source and look at an early paper by Codd. Admittedly, this is the first time I've read it, so feel free to point out any mistakes I may have made. However, in my opinion, this paper is a much easier read than the Wikipedia article if you discard any presuppositions you might have about relational algebra.

A few notes:

  • In this paper, Codd never actually terms his operations an "algebra".
  • Codd focuses on binary relations (i.e. relations with only two attributes), but shows at the end of section 2.1.3 that relations of greater degree are easily reducible to binary relations.
  • Really, the only operation that is identical to the Wikipedia definition is projection.

Domains and Roles

In section 1.3 a relation is defined as a subset of the Cartesian product of a finite amount of sets. Each one of these sets is called a domain, and has a domain name2. Relations can have columns with the same domain name (i.e. two of the columns come from the same set). In this case the columns with identical domain names can be distinguished by a unique role name. So, Wikipedia's attribute name is similar to Codd's domain_name.role_name3.

Joins

In section 2.1.3, Codd defines a join as possible only if the two relations share a domain. Furthermore he states,

A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that π12(U) = R and π23(U) = S. Any such ternary relation is called a join of R with S. If R, S are binary relations such that π2(R) = π1(S), then R is joinable with S.

That's quite theoretic, but what it essentially means is that a join can only happen between R and S if the relations R and S can be completely recovered from the joined relation. This implies that the set of values of the column being joined on must be the same in R and S.

This is more restrictive than SQL's joins and the joins defined in the Wikipedia article (natural join included).

Natural Join

Given the above, the definition for Natural Join falls out easily:

R x S = { (a, b, c): R (a, b) ⋀ S (b, c) }

Restrictions

We'll take a quick detour from joins to look at section 2.1.5 which describes Restriction. It is an operation similar to Selection (as defined by Wikipedia). A relation R can be restricted by relation S in the following manner: A new relation is formed, R', consists of the tuples of R where a subset of its column's value's is an element of a subset of S's columns. I'm not sure I did a good job of explaining that, so here's some equivalent SQL:

SELECT * FROM R 
WHERE some_attributes IN (
    SELECT some_comparable_attributes FROM S)

Theta-joins and equi-joins

The paper does not provide an explicit definition for theta and equi-joins. Also, it doesn't appear possible to simply define a theta-join (though please correct me if I'm wrong). We can, however, define a series of operations on relations R, and S, that gives a result pretty much identical4 to the equi-join defined in the Wikipedia article.

We will take advantage of the fact that we only have to consider binary relations. So let R have the columns ( a, b ) and S have the columns ( b, c ). They are not necessarily joinable since πb(R) is not necessarily equal to πb(S). Then form the relations R' by restricting R with S with respect to attribute b, and S' by restricting S by R again with respect to attribute b. Now, πb(R') = πb(S'), so we can take the Natural Join of R' and S'.

Relationship between the joins

It falls out from the definition of the equi-join given, but if πb(R) = πb(S), then the equi-join is equivalent to the Natural Join. Therefore, natural join is a subset of the equi-join.


Further reading: I find this answer gives more helpful information on theta-joins and relational algebras.


Footnotes

  1. I'm not sure how fair it is to call this Codd's original relational algebra as I found out in a CS.SE answer that two years later, he would define a relational algebra very similar to the Wikipedia one (though thankfully allowing ≠ in θ-joins). Nevertheless, a large point of this answer is to emphasize that there are differing definitions of relational algebra, and a definition must be established before any form of proof can be done.

  2. I'm glossing over the the step where domain position is defined first, and the distinction between a relation and a relationship.

  3. Codd defines an additional generation identifier, which we'll again gloss over.

  4. In Codd's theory, the columns to be joined on must have the same domain. This is not a requirement according to the theory presented in Wikipedia, though really, you shouldn't be making those kinds of joins anyway.

Другие советы

He probably referred to the following :

Natural joins are equi-joins where the equality comparison is to be applied on attributes that have the same name.

TABLE_A JOIN TABLE_B ON TABLE_A.X = TABLE_B.Y is an equi-join, but not a natural join.

Hence all natural joins are indeed equi-joins, but not all equi-joins are natural joins.

Hence the set of all possible natural joins between tables is a subset (often proper) of the set of all possible equi-joins between those same tables.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top