Tail optimization guarantee - loop encoding in Haskell

https://stackoverflow.com/questions/9149183

22-04-2021
|

Question

So the short version of my question is, how are we supposed to encode loops in Haskell, in general? There is no tail optimization guarantee in Haskell, bang patterns aren't even a part of the standard (right?), and fold/unfold paradigm is not guaranteed to work in all situation. Here's case in point were only bang-patterns did the trick for me of making it run in constant space (not even using $! helped ... although the testing was done at Ideone.com which uses ghc-6.8.2).

It is basically about a nested loop, which in list-paradigm can be stated as

prod (sum,concat) . unzip $ 
    [ (c, [r | t]) | k<-[0..kmax], j<-[0..jmax], let (c,r,t)=...]
prod (f,g) x = (f.fst $ x, g.snd $ x)

Or in pseudocode:

let list_store = [] in
for k from 0 to kmax
    for j from 0 to jmax
        if test(k,j) 
            list_store += [entry(k,j)]
        count += local_count(k,j)
result = (count, list_store)

Until I added the bang-pattern to it, I got either a memory blow-out or even a stack overflow. But bang patterns are not part of the standard, right? So the question is, how is one to code the above, in standard Haskell, to run in constant space?

Here is the test code. The calculation is fake, but the problems are the same. EDIT: The foldr-formulated code is:

testR m n = foldr f (0,[]) 
               [ (c, [(i,j) | (i+j) == d ])
                 | i<- [0..m], j<-[0..n], 
                   let c = if (rem j 3) == 0 then 2 else 1 ]
  where d = m + n - 3
    f (!c1, [])     (!c, h) = (c1+c,h) 
    f (!c1, (x:_))  (!c, h) = (c1+c,x:h)

Trying to run print $ testR 1000 1000 produces stack overflow. Changing to foldl only succeeds if using bang-patterns in f, but it builds the list in reversed order. I'd like to build it lazily, and in the right order. Can it be done with any kind of fold, for the idiomatic solution?

EDIT: to sum up the answer I got from @ehird: there's nothing to fear using bang pattern. Though not in standard Haskell itself it is easily encoded in it as f ... c ... = case (seq c False) of {True -> undefined; _ -> ...}. The lesson is, only pattern match forces a value, and seq does NOT force anything by itself, but rather arranges that when seq x y is forced - by a pattern match - x will be forced too, and y will be the answer. Contrary to what I could understand from the Online Report, $! does NOT force anything by itself, though it is called a "strict application operator".

And the point from @stephentetley - strictness is very important in controlling the space behaviour. So it is perfectly OK to encode loops in Haskell with proper usage of strictness annotations with bang patterns, where needed, to write any kind of special folding (i.e. structure-consuming) function that is needed - like I ended up doing in the first place - and rely on GHC to optimize the code.

Thank you very much to all for your help.

Solution

Bang patterns are simply sugar for seq — whenever you see let !x = y in z, that can be translated into let x = y in x `seq` z. seq is standard, so there's no issue with translating programs that use bang patterns into a portable form.

It is true that Haskell makes no guarantees about performance — the report does not even define an evaluation order (only that it must be non-strict), let alone the existence or behaviour of a runtime stack. However, while the report doesn't specify a specific method of implementation, you can certainly optimise for one.

For example, call-by-need (and thus sharing) is used by all Haskell implementations in practice, and is vital for optimising Haskell code for memory usage and speed. Indeed, the pure memoisation trick¹ (as relies on sharing (without it, it'll just slow things down).

This basic structure lets us see, for example, that stack overflows are caused by building up too-large thunks. Since you haven't posted your entire code, I can't tell you how to rewrite it without bang patterns, but I suspect [ (c, [r | t]) | ... ] should become [ c `seq` r `seq` t `seq` (c, [r | t]) | ... ]. Of course, bang patterns are more convenient; that's why they're such a common extension! (On the other hand, you probably don't need to force all of those; knowing what to force is entirely dependent on the specific structure of the code, and wildly adding bang patterns to everything usually just slows things down.)

Indeed, "tail recursion" per se does not mean all that much in Haskell: if your accumulator parameters aren't strict, you'll overflow the stack when you later try to force them, and indeed, thanks to laziness, many non-tail-recursive programs don't overflow the stack; printing repeat 1 won't ever overflow the stack, even though the definition — repeat x = x : repeat x — clearly has recursion in a non-tail position. This is because (:) is lazy in its second argument; if you traverse the list, you'll have constant space usage, as the repeat x thunks are forced, and the previous cons cells are thrown away by the garbage collector.

On a more philosophical note, tail-recursive loops are generally considered suboptimal in Haskell. In general, rather than iteratively computing a result in steps, we prefer to generate a structure with all the step-equivalents at the leaves, and do a transformation (like a fold) on it to produce the final result. This is a much higher-level view of things, made efficient by laziness (the structure is built up and garbage-collected as it's processed, rather than all at once).²

This can take some getting used to at first, and it certainly doesn't work in all cases — extremely complicated loop structures might be a pain to translate efficiently³ — but directly translating tail-recursive loops into Haskell can be painful precisely because it isn't really all that idiomatic.

As far as the paste you linked to goes, id $! x doesn't work to force anything because it's the same as x `seq` id x, which is the same as x `seq` x, which is the same as x. Basically, whenever x `seq` y is forced, x is forced, and the result is y. You can't use seq to just force things at arbitrary points; you use it to cause the forcing of thunks to depend on other thunks.

In this case, the problem is that you're building up a large thunk in c, so you probably want to make auxk and auxj force it; a simple method would be to add a clause like auxj _ _ c _ | seq c False = undefined to the top of the definition. (The guard is always checked, forcing c to be evaluated, but always results in False, so the right-hand side is never evaluated.)

Personally, I would suggest keeping the bang pattern you have in the final version, as it's more readable, but f c _ | seq c False = undefined would work just as well too.

¹ See Elegant memoization with functional memo tries and the data-memocombinators library.

² Indeed, GHC can often even eliminate the intermediate structure entirely using fusion and deforestation, producing machine code similar to how the computation would be written in a low-level imperative language.

³ Although if you have such loops, it's quite possible that this style of programming will help you simplify them — laziness means that you can easily separate independent parts of a computation out into separate structures, then filter and combine them, without worrying that you'll be duplicating work by making intermediate computations that will later be thrown away.

OTHER TIPS

OK let's work from the ground up here.

You have a list of entries

entries = [(k,j) | j <- [0..jmax], k <- [0..kmax]]

And based on those indexes, you have tests and counts

tests m n = map (\(k,j) -> j + k == m + n - 3) entries
counts = map (\(_,j) -> if (rem j 3) == 0 then 2 else 1) entries

Now you want to build up two things: a "total" count, and the list of entries that "pass" the test. The problem, of course, is that you want to generate the latter lazily, while the former (to avoid exploding the stack) should be evaluated strictly.

If you evaluate these two things separately, then you must either 1) prevent sharing entries (generate it twice, once for each calculation), or 2) keep the entire entries list in memory. If you evaluate them together, then you must either 1) evaluate strictly, or 2) have a lot of stack space for the huge thunk created for the count. Option #2 for both cases is rather bad. Your imperative solution deals with this problem simply by evaluating simultaneously and strictly. For a solution in Haskell, you could take Option #1 for either the separate or the simultaneous evaluation. Or you could show us your "real" code and maybe we could help you find a way to rearrange your data dependencies; it may turn out you don't need the total count, or something like that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow