Question

In order to compare performance with lists being slow in this GHC bug I'm trying to get the following loop as fast as possible:

{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Control.Monad
import Data.Word


main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      print (fromIntegral i / fromIntegral (maxBound :: Word32))


loop :: Word32 -> (Word32 -> IO ()) -> IO ()
loop n f = go 0
  where
    go !i | i == n = return ()
    go !i          = f i >> go (i + 1)

compiled with ghc -O loop.hs.

However, running this takes 50 seconds on my computer - 10 times slower than the equivalent C program:

#include "limits.h"
#include "stdint.h"
#include "stdio.h"

int main(int argc, char const *argv[])
{
  for (uint32_t i = 0; i < UINT_MAX; ++i)
  {
    if (i % 100000000 == 0) printf("%f\n", (float) i / (float) UINT_MAX );
  }
  return 0;
}

compiled with gcc -O2 -std=c99 -o testc test.c.


Using the freshly released GHC 7.8 or using -O2 did not improve the performance.

However, compiling with the -fllvm flag (on either ghc version) brought a 10x speed improvement, bringing the performance on par with C.

Questions:

  1. Why is GHC's native codegen so much slower for my loop?
  2. Is there a way to improve my loop so that it is fast also without -fllvm, or is this already the fastest IO loop over Word32 one can achive?
Was it helpful?

Solution

Let's inspect the assembly. I modified the main function a bit so that the output becomes a bit clearer (but the performance remains identical). I used GHC 7.8.2 with -O2.

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

There is a lot of clutter, so I try to only include the interesting parts:

Native Codegen

Main_zdwa_info:
.Lc3JD: /* check if there's enough space for stack growth */
    leaq -16(%rbp),%rax
    cmpq %r15,%rax
    jb .Lc3JO /* this jumps to some GC code that grows the stack, then
                 reenters the main closure */
.Lc3JP:
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
.Lc3JC:
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax /* issue: unnecessary moves */
    movq %rax,%rbx      
    movq %r14,%rax
    xorq %rdx,%rdx
    divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU
.Lc3JV: 
   /* do the printing. Code omitted. */
.Lc3JN:
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
.Lc3JU:
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

LLVM

 Main_zdwa_info:
/* code omitted: the same stack-checking stuff as in native */
.LBB1_1:
    movl    $4294967295, %esi /* load the bound */
    movabsq $-6067343680855748867, %rdi /*load a magic number for the modulus */
    jmp .LBB1_2
.LBB1_4:              
    incl    %ecx
.LBB1_2:  
    cmpq    %rsi, %rcx
    je  .LBB1_6 /* check bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4 
    /* Code omitted: print, then return to loop beginning */
.LBB1_6:                       
    /* Code omitted: return from main */

Observations

  • IO overhead is nonexistent in both assemblies. The zero-byte RealWorld state token is conspicuously absent.

  • Native codegen doesn't do much strength reduction, in contrast to LLVM, which readily converts the modulus into multiplication, shift and magic numbers.

  • Native codegen redoes the stack space checking on each iteration, while LLVM doesn't. It doesn't seem to be a significant overhead, however.

  • Native codegen is just plain bad here at looping and register allocation. It shuffles around registers and loads the bound on each iteration. LLVM emits code comparable to hand-written code in tidiness.

As to your question:

Is there a way to improve my loop so that it is fast also without -fllvm, or is this >already the fastest IO loop over Word32 one can achieve?

The best you can do here is manual strength reduction, I think, though I personally find that option unacceptable. However, after doing that your code will be still significantly slower. I also ran the following trivial loop, and it's twice as fast with LLVM than with native:

import Data.Word
main = go 0 where
    go :: Word32 -> IO ()
    go i | i == maxBound = return ()
    go i = go (i + 1)

The culprit is again unnecessary register-shuffling and bound-loading. There isn't really any way to remedy these kind of low level issues, aside from switching to LLVM.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top