Question

I was reading some random blog where someone tries to perform a simple string processing operation in Haskell and gets rather slow code. Some of the problems with his (final, down the page a ways) code:

  1. The whole file is read in at once.
  2. He uses the relatively expensive isSpace and then compares the resulting program to C code that only considers simple spaces and newlines.
  3. The way he uses scanl looks extremely pipeline-unfriendly, using a computed character as input to each step when that is not necessary.

The most natural approach, I think, is to use lazy ByteStrings (as some of his earlier attempts do) and to scrap the scanl in favor of zipWith', zipping the string with the string shifted over one: zipWith f s (cons ' ' s)

The problem

Zipping a lazy ByteString with a shifted version of itself doesn't take advantage of the relationship between the two strings. It performs many unnecessary checks for end-of-chunk and end-of-string. I'm sure I could write a specialized function that traverses a ByteString with a two-character "window", and I'm sure a better programmer than I could write one that takes advantage of the details of the chunk representation, but I'd prefer to find a more accessible approach. Any ideas?

Edited to add: another approach might be to use foldr to produce a ByteString builder, following the same general approach but using (hopefully unboxed) tuples to avoid data dependency; I'm not sure I quite understand those builders or their efficiency.

Was it helpful?

Solution

I'll be using the following imports.

import Data.Char 
import Data.List           
import qualified Data.Text.Lazy as T                      

import Criterion.Main
import Test.QuickCheck

I managed to get blazing speeds compared to this reference implementation from the blog post:

capitalize :: T.Text -> T.Text
capitalize = T.tail . T.scanl (\a b -> if isSpace a then toUpper b else b) ' '

Using mapAccumL is much faster. Here's String and Text versions.

{-# INLINE f #-}
f a b = (b, if isSpace a then toUpper b else b)

string :: String -> String
string = snd . mapAccumL f ' '

text :: T.Text -> T.Text
text = snd . T.mapAccumL f ' '

First, let's make sure that the optimization is valid

λ. quickCheck $ \xs -> 
    capitalize (T.pack xs) == text (T.pack xs)
+++ OK, passed 100 tests.

Now for some benchmark results from criterion, running each function on a 3.2 M file of Lorem Ipsum. Here's our reference speed.

benchmarking reference
collecting 100 samples, 1 iterations each, in estimated 56.19690 s
mean: 126.4616 ms, lb 126.0039 ms, ub 128.6617 ms, ci 0.950
std dev: 4.432843 ms, lb 224.7290 us, ub 10.55986 ms, ci 0.950

String is only about 30% slower than the optimized reference Text version and the mapAccumL version using Text is almost twice as fast!

benchmarking string
collecting 100 samples, 1 iterations each, in estimated 16.45751 s
mean: 165.1451 ms, lb 165.0927 ms, ub 165.2112 ms, ci 0.950
std dev: 301.0338 us, lb 250.2601 us, ub 370.2991 us, ci 0.950

benchmarking text
collecting 100 samples, 1 iterations each, in estimated 16.88929 s
mean: 67.67978 ms, lb 67.65432 ms, ub 67.72081 ms, ci 0.950
std dev: 162.8791 us, lb 114.9346 us, ub 246.0348 us, ci 0.950

But there are even more easy gains to be had. Data.Char.isSpace is known for its performance issues, so lets try the fast Data.Attoparsec.Char8.isSpace instead. Our quickcheck test won't pass, but the performance is great.

benchmarking string/atto
collecting 100 samples, 1 iterations each, in estimated 12.91881 s
mean: 129.2176 ms, lb 129.1328 ms, ub 129.4941 ms, ci 0.950
std dev: 705.3433 us, lb 238.2757 us, ub 1.568524 ms, ci 0.950

benchmarking text/atto
collecting 100 samples, 1 iterations each, in estimated 15.76300 s
mean: 38.63183 ms, lb 38.62850 ms, ub 38.63730 ms, ci 0.950
std dev: 21.41514 us, lb 15.27777 us, ub 33.98801 us, ci 0.950

We're now about 3x faster than the original reference. For comparison, the very fast python code (which is just calling out to C),

print open('lorem.txt').read().title()

rips through the text file in 30ms.

OTHER TIPS

Lazy I/O can be a problem, but it is the simplest way to approach this small task.

import Data.Text.Lazy (toTitle)
import Data.Text.Lazy.IO (readFile, putStr)
import Prelude hiding (readFile, putStr)

main = readFile "file" >>= putStr . toTitle

It will actually spend time doing Unicode (word-splitting and title-casing) correctly, but it's probably what you want. If you want to avoid Lazy I/O, the pipes-text package should produce something that's not much bigger.

If you really want to treat everything as ASCII and assume all the words start with a letter, I still think lazy I/O is a win here, but it's a bit more complex.

import Data.Bits (.&.)
import Data.ByteString.Lazy (ByteString, cons', putStrLn, readFile, uncons)
import Data.ByteString.Lazy.Char8 (lines, unlines, unwords, words)
import Data.Word (Word8)
import Prelude hiding (putStrLn, readFile, lines, unlines, unwords, words)

capitalize :: ByteString -> ByteString
capitalize word = case uncons word of
  Just (h, t) -> cons' (h .|. complement 32) t
  Nothing     -> word

main = readFile "file"
   >>= putStrLn . unlines
                . map (unwords . map capitalize . words)
                . lines

Again, avoiding lazy I/O is as simple as using pipes-bytestring.

There was also a reddit thread about that post here and they seem to get great performance out of the Builder abstraction, plus a better way of upper-casing. The builder abstraction probably will be faster than my bytestring hack because it will better chunk the output data before writing it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top