Does the 'difference' operation add expressiveness to a query language that already includes 'join'?

https://cs.stackexchange.com/questions/2

15-10-2019
|

Pergunta

The set difference operator (e.g., EXCEPT in some SQL variants) is one of the many fundamental operators of relational algebra. However, there are some databases that do not support the set difference operator directly, but which support LEFT JOIN (a kind of outer join), and in practice this can be used instead of a set difference operation to achieve the same effect.

Does this mean that the expressive power of a query language is the same even without the set difference operator, so long as the LEFT JOIN operator is maintained? How would one prove this fact?

Solução

In relational algebra, we shall first provide an informal definition of left (outer) join, and proceed to prove that it, renaming, selection, join and projection can construct difference, as well as that difference, selection and union can be used to construct left (outer) join. Actually, we'll end up doing it in the reverse order: we'll show how to construct left joins using differences, and then we'll show how to construct differences using left joins.

Let $R$ and $S$ have schemata $(R', T)$ and $(T, S')$, respectively, where $R'$ and $S'$ are the sets of attributes in one schema but not the other, and $T$ is the set of common attributes.

Let $w = (\epsilon, \epsilon, ..., \epsilon)$ be the null tuple for schema $S'$. That is, it is the tuple consisting of all null values for each attribute of $S'$. Then, we define left outer join as follows: R LEFT JOIN S is the set of all tuples $(r, t, s)$ belonging to schema $(R', T, S')$ where...

$(r, t)$ is a tuple in $R$;
(a) $(t, s)$ is a tuple of $S$ or (b) $s = w$;
If $(r, t, s)$ is in the set for $s \neq w$, then $(r, t, w)$ is not in the set.

Example: $R$'s schema is $(A_{1}, A_{2}, A_{3})$, $S$'s schema is $(A_{2}, A_{3}, A_{4})$, and we have that $R = \{(1, 2, 3), (4, 5, 6)\}$ and $S = \{(2, 3, 4), (2, 3, 6)\}$. By (1) and (2) we get the intermediate result $\{(1, 2, 3, 4),(1, 2, 3, 6), (1, 2, 3, \epsilon),(4, 5, 6, \epsilon)\}$. By (3) we must remove $(1, 2, 3, \epsilon)$, since we have (for instance) $(1, 2, 3, 4)$ and $s = 4 \neq \epsilon = w$. We are thus left with $\{(1, 2, 3, 4), (1, 2, 3, 6), (4, 5, 6, \epsilon)\}$, the expected result for a left join.

Theorem: R LEFT JOIN S is equivalent to (R EQUIJOIN S) UNION ((((PROJECT_T R) DIFFERENCE (PROJECT_T S)) EQUIJOIN R) JOIN w).

Proof: (R EQUIJOIN S) gives us everything required by (1) and (2a). We claim that ((((PROJECT_T R) DIFFERENCE (PROJECT_T S)) EQUIJOIN R) JOIN w) gives us everything of the form (r, t, w) required by (2b) and (3).

To see this, first notice that (((PROJECT_T R) DIFFERENCE (PROJECT_T S)) EQUIJOIN R) is the set of all tuples in $R$ for which there is no corresponding tuple in $S$. To see that, it suffices to note that by projecting the common attributes out of $R$ and $S$ (the attribute set $T$) and taking the difference, one is left with all and only those tuples (with schema $T$) which are represented in $R$ but not $S$. By the EQUIJOIN with $R$, we recover all and only those tuples in $R$ which have values for attributes in $T$ which are present in $R$ but not in $S$; namely, precisely the set of tuples we have so far claimed.

Next, notice that the schema of (((PROJECT_T R) DIFFERENCE (PROJECT_T S)) is the same as that of $R$ (namely, $(R', T)$), while the schema of $w$ is $S'$. The JOIN operation is therefore a Cartesian product, we we get all tuples of the form $(r, t, w)$ where there is no $(t, s)$ in $S$ corresponding to $(r, t)$ in $R$.

To see that this is precisely the set of tuples we needed to add to R EQUIJOIN S in order to construct R LEFT JOIN S, consider the following: by construction, (3) is satisfied, since R EQUIJOIN S cannot contain $(r, t, s)$ if ((((PROJECT_T R) DIFFERENCE (PROJECT_T S)) EQUIJOIN R) JOIN w) contains $(r, t, w)$ (if it did, then the second part's containing $(r, t, w)$ would be a contradiction); if we were to add another $(r, t, w)$ not in ((((PROJECT_T R) DIFFERENCE (PROJECT_T S)) EQUIJOIN R) JOIN w), then there would be a $(t, s)$ in $S$ corresponding to $(r, t)$ in $R$, and by the definition of EQUIJOIN, $(r, t, s)$ would also be in R LEFT JOIN S, a contradiction of (3). This completes the proof.

Now we show that left join can be used to construct difference:

Theorem: R DIFFERENCE S is equivalent to PROJECT_T(SELECT_{t'=w}(R LEFT JOIN (SELECT_{s=s'}(((S JOIN RENAME_{T->T'}(S)))))))

Proof: Notice that here, $R'$ and $S'$ are empty, since all attributes are shared for DIFFERENCE to make sense. First, we create a new relation from $S$ by duplicating the attribute set in $S$ (handled by RENAME and JOIN) so that it consists of tuples $(t, t')$ on attribute set $(T, T')$ where $t = t'$ (handled by the SELECT). The left join leaves us with tuples of the form $(t, t')$ where $t=t'$ or $t'=w$. Now, to get rid of entries which do also appear in $S$, we must keep only the tuples of the form $(t, w)$, which is handled by the outermost SELECT. The last PROJECT gets rid of the temporary attribute set $T'$ and leaves us with the difference in terms of the original schema.

Example: Let $R = \{(1, 2), (3, 4), (5, 6)\}$ and $S = \{(3, 4), (5, 6), (7, 8)\}$. We first get $S$ with the RENAMEd attribute set $T'$: $\{(3, 4), (5, 6), (7, 8)\}$. The JOIN operation gives us the Cartesian product with all nine possible pairings; this set is not written here for reasons of formatting. The SELECT then pares this down to $\{(3, 4, 3, 4), (5, 6, 5, 6), (7, 8, 7, 8)\}$. The LEFT JOIN with $R$ gives $\{(1, 2, \epsilon, \epsilon), (3, 4, 3, 4), (5, 6, 5, 6)\}$. The SELECT gives $\{(1, 2, \epsilon, \epsilon)\}$. The PROJECT gives $\{(1, 2)\}$, the desired answer.

Outras dicas

LEFT JOIN as implemented by SQL, does not produce relations as its result (because some attributes of the result will not have a value).

Ergo, LEFT JOIN as implemented by SQL, is not a direct counterpart of any relational algebra operator.

Ergo, The relational difference operator cannot be expressed in terms of LEFT JOIN (because LEFT JOIN cannot possibly be part of the relational algebra, this being because LEFT JOIN produces something that is not a relation, thus violating closure of the algebra).

Any set of primitive operators of a relational algebra that satisfies closure that I know of, always includes either relational difference or else relational semidifference.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange