Scala for comprehension efficiency?

https://stackoverflow.com/questions/4211103

25-09-2019
|

Question

In the book "Programming In Scala", chapter 23, the author give an example like:

case class Book(title: String, authors: String*)
val books: List[Book] = // list of books, omitted here
// ﬁnd all authors who have published at least two books

for (b1 <- books; b2 <- books if b1 != b2;
    a1 <- b1.authors; a2 <- b2.authors if a1 == a2)
yield a1

The author said, this will translated into:

books flatMap (b1 =>
   books filter (b2 => b1 != b2) flatMap (b2 =>
      b1.authors flatMap (a1 =>
        b2.authors filter (a2 => a1 == a2) map (a2 =>
           a1))))

But if you look into the map and flatmap method definition(TraversableLike.scala), you may find, they are defined as for loops:

   def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
    val b = bf(repr)
    b.sizeHint(this) 
    for (x <- this) b += f(x)
    b.result
  }

  def flatMap[B, That](f: A => Traversable[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = {
    val b = bf(repr)
    for (x <- this) b ++= f(x)
    b.result
  }

Well, I guess this for will continually be translated to foreach and then translated to while statement which is a construct not an expression, scala doesn't have a for construct, because it wants the for always yield something.

So, what I want to discuss with you is that, why does Scala do this "For translation" ? The author's example used 4 generators, which will be translated into 4 level nested for loop in the end, I think it'll have really horrible performance when the books is large.

Scala encourage people to use this kind of "Syntactic Sugar", you can always see codes that heavily make use of filter, map and flatmap, which seems programmers are forgetting what they really do is nesting one loop inside another, and what achieved is only to make codes looks a bit shorter. What's your idea?

Solution

For comprehensions are syntactic sugar for monadic transformation, and, as such, are useful in all sorts of places. At that, they are much more verbose in Scala than the equivalent Haskell construct (of course, Haskell is non-strict by default, so one can't talk about performance of the construct like in Scala).

Also important, this construct keeps what is being done clear, and avoids quickly escalating indentation or unnecessary private method nesting.

As to the final consideration, whether that hides the complexity or not, I'll posit this:

for {
  b1 <- books
  b2 <- books
  if b1 != b2
  a1 <- b1.authors
  a2 <- b2.authors 
  if a1 == a2
} yield a1

It is very easy to see what is being done, and the complexity is clear: b^2 * a^2 (the filter won't alter the complexity), for number of books and number of authors. Now, write the same code in Java, either with deep indentation or with private methods, and try to ascertain, in a quick look, what the complexity of the code is.

So, imho, this doesn't hide the complexity, but, on the contrary, makes it clear.

As for the map/flatMap/filter definitions you mention, they do not belong to List or any other class, so they won't be applied. Basically,

for(x <- List(1, 2, 3)) yield x * 2

is translated into

List(1, 2, 3) map (x => x * 2)

and that is not the same thing as

map(List(1, 2, 3), ((x: Int) => x * 2)))

which is how the definition you passed would be called. For the record, the actual implementation of map on List is:

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
  val b = bf(repr)
  b.sizeHint(this) 
  for (x <- this) b += f(x)
  b.result
}

OTHER TIPS

I write code so that it's easy to understand and maintain. I then profile. If there's a bottleneck that's where I devote my attention. If it's in something like you've described I'll attack the problem in a different manner. Until then, I love the "sugar." It saves me the trouble of writing things out or thinking hard about it.

There are actually 6 loops. One loop for each filter/flatMap/map

The filter->map pairs can be done in one loop by using lazy views of the collections (iterator method)

In general, tt is running 2 nested loops for books to find all book pairs and then two nested loops to find if the author of one book is in the list of authors of the other.

Using simple data structures, you would do the same when coding explicitly.

And of course, the example here is to show a complex 'for' loop, not to write the most efficient code. E.g., instead of a sequence of authors, one could use a Set and then find if the intersection is non empty:

for (b1 <- books; b2 <- books; a <- (b1.authors & b2.authors)) yield a

Note that in 2.8, the filter call was changed to withFilter which is lazy and would avoid constructing an intermediate structure. See guide to move from filter to withFilter?.

I believe the reason that for is translated to map, flatMap and withFilter (as well as value definitions if present) is to make the use of monads easier.

In general I think if the computation you are doing involves looping 4 times, it is fine using the for loop. If the computation can be done more efficiently and performance is important then you should use the more efficient algorithm.

One follow-up to @IttayD's answer on the algorithm's efficiency. It's worth noting that the algorithm in the original post (and in the book) is a nested loop join. In practice, this isn't an efficient algorithm for large datasets, and most databases would use a hash aggregate here instead. In Scala, a hash aggregate would look something like:

(for (book <- books;
      author <- book.authors) yield (book, author)
).groupBy(_._2).filter(_._2.size > 1).keys

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow