Question

I'm an intermediate Haskell programmer with tons of experience in strict FP and non-FP languages. Most of my Haskell code analyzes moderately large datasets (10^6..10^9 things), so laziness is always lurking. I have a reasonably good understanding of thunks, WHNF, pattern matching, and sharing, and I've been able to fix leaks with bang patterns and seq, but this profile-and-pray approach feels sordid and wrong.

I want to know how experienced Haskell programmers approach laziness at design time. I'm not asking about easy items like Data.ByteString.Lazy or foldl'; rather, I want to know how you think about the lower-level lazy machinery that causes runtime memory problems and tricky debugging.

How do you think about thunks, pattern matching, and sharing during design time?

What design patterns and idioms do you use to avoid leaks?

How did you learn these patterns and idioms, and do you have some good refs?

How do you avoid premature optimization of non-leaking non-problems?

(Amended 2014-05-15 for time budgeting):

Do you budget substantial project time for finding and fixing memory problems?

Or, do your design skills typically circumvent memory problems, and you get the expected memory consumption very early in the development cycle?

Was it helpful?

Solution

I think most of the trouble with "strictness leaks" happens because people don't have a good conceptual model. Haskellers without a good conceptual model tend to have and propagate the superstition that stricter is better. Perhaps this intuition comes from their results from toying with small examples & tight loops. But it is incorrect. It's just as important to be lazy at the right times as to be strict at the right times.

There are two camps of data types, usually referred to as "data" and "codata". It is essential to respect the patterns of each one.

  • Operations which produce "data" (Int, ByteString, ...) must be forced close to where they occur. If I add a number to an accumulator, I am careful to make sure that it will be forced before I add another one. A good understanding of laziness is very important here, especially its conditional nature (i.e. strictness propositions don't take the form "X gets evaluated" but rather "when Y is evaluated, so is X").
  • Operations which produce and consume "codata" (lists most of the time, trees, most other recursive types) must do so incrementally. Usually codata -> codata transformation should produce some information for each bit of information they consume (modulo skipping like filter). Another important piece for codata is that you use it linearly whenever possible -- i.e. use the tail of a list exactly once; use each branch of a tree exactly once. This ensures that the GC can collect pieces as they are consumed.

Things take a special amount of care when you have codata that contains data. E.g. iterate (+1) 0 !! 1000 will end up producing a size-1000 thunk before evaluating it. You need to think about conditional strictness again -- the way to prevent this case is to ensure that when a cons of the list is consumed, the addition of its element occurs. iterate violates this, so we need a better version.

iterate' :: (a -> a) -> a -> [a]
iterate' f x = x : (x `seq` iterate' f (f x))

As you start composing things, of course it gets harder to tell when bad cases happen. In general it is hard to make efficient data structures / functions that work equally well on data and codata, and it's important to keep in mind which is which (even in a polymorphic setting where it's not guaranteed, you should have one in mind and try to respect it).

Sharing is tricky, and I think I approach it mostly on a case-by-case basis. Because it's tricky, I try to keep it localized, choosing not to expose large data structures to module users in general. This can usually be done by exposing combinators for generating the thing in question, and then producing and consuming it all in one go (the codensity transformation on monads is an example of this).

My design goal is to get every function to be respectful of the data / codata patterns of my types. I can usually hit it (though sometimes it requires some heavy thought -- it has become natural over the years), and I seldom have leak problems when I do. But I don't claim that it's easy -- it requires experience with the canonical libraries and patterns of the language. These decisions are not made in isolation, and everything has to be right at once for it to work well. One poorly tuned instrument can ruin the whole concert (which is why "optimization by random perturbation" almost never works for these kinds of issues).

Apfelmus's Space Invariants article is helpful for developing your space/thunk intuition further. Also see Edward Kmett's comment below.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top