Pourquoi `(carte digitToInt). show` est si rapide?

https://stackoverflow.com/questions/4841078

27-10-2019
|

Question

Conversion Integer non négatif à sa liste de chiffres est souvent fait comme ceci:

import Data.Char

digits :: Integer -> [Int]
digits = (map digitToInt) . show

Je tente de trouver une façon plus directe pour effectuer la tâche, sans impliquer une conversion de chaîne, mais je suis incapable de trouver quelque chose de plus rapide.

Ce que j'ai essayé jusqu'à présent:

La ligne de base:

digits :: Int -> [Int]
digits = (map digitToInt) . show

A été celui d'une autre question sur StackOverflow:

digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)

Essayer de rouler mon propre:

digits3 :: Int -> [Int]
digits3 = reverse . revDigits3

revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
               (0, digit) -> [digit]
               (rest, digit) -> digit:(revDigits3 rest)

Celui-ci a été inspiré par showInt dans Numeric:

digits4 n0 = go n0 [] where
    go n cs
        | n < 10    =  n:cs
        | otherwise =  go q (r:cs)
        where
        (q,r) = n `quotRem` 10

Maintenant, l'indice de référence. Note:. Je forçant l'évaluation à l'aide filter

λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)

Ceci est la référence. Maintenant, pour digits2:

λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)

C'est 3,46 fois plus longtemps.

λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)

digits3 4,89 temps plus lent. Juste pour le plaisir, j'ai essayé d'utiliser seulement revDigits3 et éviter le reverse.

λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)

Etrangement, ce qui est encore plus lent, 5.24 fois plus lent.

Et la dernière:

λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)

10,43 temps plus lent.

J'avais l'impression que seulement en utilisant l'arithmétique et les inconvénients serait surperformer tout ce qui concerne une conversion de chaîne. Apparemment, il y a quelque chose que je ne peut pas saisir.

Alors, que l'affaire? Pourquoi est-digits si vite?

J'utilise GHC 6.12.3.

La solution

Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)

Analysis:

Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.

Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)

Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).

Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.

This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.

Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.

Data:

I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.

Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.

Dumping to -sstderr gives us output that looks like this, in the case of digits1:

digits1 +RTS -sstderr
12000000
   2,885,827,628 bytes allocated in the heap
         446,080 bytes copied during GC
           3,224 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  5504 collections,     0 parallel,  0.06s,  0.03s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.61s  (  1.66s elapsed)
  GC    time    0.06s  (  0.03s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.67s  (  1.69s elapsed)

  %GC time       3.7%  (1.5% elapsed)

  Alloc rate    1,795,998,050 bytes per MUT second

  Productivity  96.3% of total user, 95.2% of total elapsed

There are three key statistics here:

Total memory in use: only 1MB means this version is very space-efficient.
Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..

Alright, let's move on to version 2.

digits2 +RTS -sstderr
12000000
   5,512,869,824 bytes allocated in the heap
       1,312,416 bytes copied during GC
           3,336 bytes maximum residency (1 sample(s))
          13,048 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 10515 collections,     0 parallel,  0.06s,  0.04s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.20s  (  3.25s elapsed)
  GC    time    0.06s  (  0.04s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.26s  (  3.29s elapsed)

  %GC time       1.9%  (1.2% elapsed)

  Alloc rate    1,723,838,984 bytes per MUT second

  Productivity  98.1% of total user, 97.1% of total elapsed

Alright, so we're seeing an interesting pattern.

Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
It takes twice as long. We'll come back to some speculation as to why this is later.
It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.

Version 3:

digits3 +RTS -sstderr
12000000
   3,231,154,752 bytes allocated in the heap
         832,724 bytes copied during GC
           3,292 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  6163 collections,     0 parallel,  0.02s,  0.02s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.09s  (  2.08s elapsed)
  GC    time    0.02s  (  0.02s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.11s  (  2.10s elapsed)

  %GC time       0.7%  (1.0% elapsed)

  Alloc rate    1,545,701,615 bytes per MUT second

  Productivity  99.3% of total user, 99.3% of total elapsed

Alright, so we're seeing some strange patterns.

We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
We're not quite at digits1, but we've got digits2 beat pretty easily.
Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)

And finally, version 4:

digits4 +RTS -sstderr
12000000
   1,347,856,636 bytes allocated in the heap
         270,692 bytes copied during GC
           3,180 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  2570 collections,     0 parallel,  0.00s,  0.01s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.09s  (  1.08s elapsed)
  GC    time    0.00s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.09s  (  1.09s elapsed)

  %GC time       0.0%  (0.8% elapsed)

  Alloc rate    1,234,293,036 bytes per MUT second

  Productivity 100.0% of total user, 100.5% of total elapsed

Wowza! Let's break it down:

We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
We cut off about 32% of our original time, which is pretty impressive.
I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.

Autres conseils

Answering the question "why rem instead of mod?" in the comments. When dealing with positive values rem x y === mod x y so the only consideration of note is performance:

> import Test.QuickCheck
> quickCheck (\x y -> x > 0 && y > 0 ==> x `rem` y == x `mod` y)

So what is the performance? Unless you have a good reason not to (and being lazy isn't a good reason, neither is not knowing Criterion) then use a good benchmark tool, I used Criterion:

$ cat useRem.hs 
import Criterion
import Criterion.Main

list :: [Integer]
list = [1..10000]

main = defaultMain
        [ bench "mod" (nf (map (`mod` 7)) list)
        , bench "rem" (nf (map (`rem` 7)) list)
        ]

Running this shows rem is measurably better than mod (compiled with -O2):

$ ./useRem 
...
benchmarking mod
...
mean: 590.4692 us, lb 589.2473 us, ub 592.1766 us, ci 0.950

benchmarking rem
...
mean: 394.1580 us, lb 393.2415 us, ub 395.4184 us, ci 0.950

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow