Is everything in Haskell stored in thunks, even simple values?

https://stackoverflow.com/questions/8478412

14-03-2021
|

Pergunta

What do the thunks for the following value/expression/function look like in the Haskell heap?

val = 5                -- is `val` a pointer to a box containing 5?
add x y = x + y        
result = add 2 val     
main = print $ result

Would be nice to have a picture of how these are represented in Haskell, given its lazy evaluation mode.

Solução

Official answer

It's none of your business. Strictly implementation detail of your compiler.

Short answer

Yes.

Longer answer

To the Haskell program itself, the answer is always yes, but the compiler can and will do things differently if it finds out that it can get away with it, for performance reasons.

For example, for '''add x y = x + y''', a compiler might generate code that works with thunks for x and y and constructs a thunk as a result. But consider the following:

foo :: Int -> Int -> Int
foo x y = x * x + y * y

Here, an optimizing compiler will generate code that first takes x and y out of their boxes, then does all the arithmetic, and then stores the result in a box.

Advanced answer

This paper describes how GHC switched from one way of implementing thunks to another that was actually both simpler and faster: http://research.microsoft.com/en-us/um/people/simonpj/papers/eval-apply/

Outras dicas

In general, even primitive values in Haskell (e.g. of type Int and Float) are represented by thunks. This is indeed required by the non-strict semantics; consider the following fragment:

bottom :: Int
bottom = div 1 0

This definition will generate a div-by-zero exception only if the value of bottom is inspected, but not if the value is never used.

Consider now the add function:

add :: Int -> Int -> Int
add x y = x+y

A naive implementation of add must force the thunk for x, force the thunk for y, add the values and create an (evaluated) thunk for the result. This is a huge overhead for arithmetic compared to strict functional languages (not to mention imperative ones).

However, an optimizing compiler such as GHC can mostly avoid this overhead; this is a simplified view of how GHC translates the add function:

add :: Int -> Int -> Int
add (I# x) (I# y) = case# (x +# y) of z -> I# z

Internally, basic types like Int is seen as datatype with a single constructor. The type Int# is the "raw" machine type for integers and +# is the primitive addition on raw types. Operations on raw types are implemented directly on bit-patterns (e.g. registers) --- not thunks. Boxing and unboxing are then translated as constructor application and pattern matching.

The advantage of this approach (not visible in this simple example) is that the compiler is often capable of inlining such definitions and removing intermediate boxing/unboxing operations, leaving only the outermost ones.

It would be absolutely correct to wrap every value in a thunk. But since Haskell is non-strict, compilers can choose when to evaluate thunks/expressions. In particular, compilers can choose to evaluate an expression earlier than strictly necessary, if it results in better code.

Optimizing Haskell compilers (GHC) perform Strictness analysis to figure out, which values will always be computed.

In the beginning, the compiler has to assume, that none of a function's arguments are ever used. Then it goes over the body of the function and tries to find functions applications that 1) are known to be strict in (at least some of) their arguments and 2) always have to be evaluated to compute the function's result.

In your example, we have the function (+) that is strict in both it's arguments. Thus the compiler knows that both x and y are always required to be evaluated at this point. Now it just so happens, that the expression x+y is always necessary to compute the function's result, therefore the compiler can store the information that the function add is strict in both x and y.

The generated code for add* will thus expect integer values as parameters and not thunks. The algorithm becomes much more complicated when recursion is involved (a fixed point problem), but the basic idea remains the same.

Another example:

mkList x y = 
    if x then y : []
         else []

This function will take x in evaluated form (as a boolean) and y as a thunk. The expression x needs to be evaluated in every possible execution path through mkList, thus we can have the caller evaluate it. The expression y, on the other hand, is never used in any function application that is strict in it's arguments. The cons-function : never looks at y it just stores it in a list. Thus y needs to be passed as a thunk in order to satisfy the lazy Haskell semantics.

mkList False undefined -- absolutely legal

*: add is of course polymorphic and the exact type of x and y depends on the instantiation.

Short answer: Yes.

Long answer:

val = 5

This has to be stored in a thunk, because imagine if we wrote this anywhere in our code (like, in a library we imported or something):

val = undefined

If this has to be evaluated when our program starts, it would crash, right? If we actually use that value for something, that would be what we want, but if we don't use it, it shouldn't be able to influence our program so catastrophically.

For your second example, let me change it a little:

div x y = x / y

This value has to be stored in a thunk as well, because imagine some code like this:

average list =
  if null list
     then 0
     else div (sum list) (length list)

If div was strict here, it would be evaluated even when the list is null (aka. empty), meaning that writing the function like this wouldn't work, because it wouldn't have a chance to return 0 when given the empty list, even though this is what we would want in this case.

Your final example is just a variation of example 1, and it has to be lazy for the same reasons.

All this being said, it is possible to force the compiler to make some values strict, but that goes beyond the scope of this question.

I think the others answered your question nicely, but just for completeness's sake let me add that GHC offers you the possibility of using unboxed values directly as well. This is what Haskell Wiki says about it:

When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.

This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments. Strict and unpacked constructor fields can also help a lot. Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.

One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.

As mentioned, it's non-portable, so you need a GHC language extension. See here for their docs.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow