Question

It comes from big data perspective. Basically, many frameworks (like Apache Spark) "compensate" lack of relational operations by providing Functor/Monad-like interfaces and there is a similar movement towards cats-to-SQL conversions (Slick in Scala). For instance, we need natural join (assuming no repetitions on indexes) for element-wise multiplication of vectors from SQL-perspective, which could be considered as zip + map(multiply) (Spark's MLib, however, already has ElementwiseProduct) in Category Theory's applications.

Simply saying (following examples are in Scala):

  • the referenced subcase of join can be thought as applicative functor (over sorted collection), which in its turn gives us zip: List(1,2,3).ap(List(2,4,8).map(a => (b: Int) => a * b)) --> (List(1,2,3) zip List(2,4,8)).map(x => x._1 * x._2). Moreover, we can induce it to some other joins, assuming some preprocessing (groupBy operator or just surjection, or generally - an epimorphism).

  • other joins and selection can be thought as monad. For instance, WHERE is just: List(1,2,2,4).flatMap(x => if (x < 3) List(x) else List.empty) --> List(1,2,2,4).filter(_ < 3)

  • data itself is just ADT (GADT too?), which in its turn looks like a simple Set-category (or more generally speaking - Cartesian-closed), so it should (I suppose) cover Set-based operations (due to Curry-Howard-Lambek itself) and also operations like RENAME (at least in practice).

  • aggregation corresponds to fold/reduce (catamorphism)

So, what I'm asking is can we build an isomorphism between (maybe subset of) category theory and (the whole) relational algebra or is there something uncovered? If it works, what exact "subset" of categories is isomorphic to relalgebra?

You can see that my own assumptions are quite broad while formal solutions like Curry-Howard-Lambek correspondence for logic-cats-lambda are more precise - so actually, I'm asking for a reference to an accomplished study (that shows a direct relationship) with more examples in Scala/Haskell.

Edit: the accepted answer made me think that I went too far representing joins and conditions as a monad (especially using an empty value that effectively instantiates FALSE), I think pullbacks should suffice at least for relalgebra subset of SQL. Monads are better for higher order (nesting) stuff like GROUP BY, which isn’t part of relalgebra.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top