Traversing ByteStrings

Question 1

I'll be using the following imports.

import Data.Char 
import Data.List           
import qualified Data.Text.Lazy as T                      

import Criterion.Main
import Test.QuickCheck

I managed to get blazing speeds compared to this reference implementation from the blog post:

capitalize :: T.Text -> T.Text
capitalize = T.tail . T.scanl (\a b -> if isSpace a then toUpper b else b) ' '

Using mapAccumL is much faster. Here's String and Text versions.

{-# INLINE f #-}
f a b = (b, if isSpace a then toUpper b else b)

string :: String -> String
string = snd . mapAccumL f ' '

text :: T.Text -> T.Text
text = snd . T.mapAccumL f ' '

First, let's make sure that the optimization is valid

λ. quickCheck $ \xs -> 
    capitalize (T.pack xs) == text (T.pack xs)
+++ OK, passed 100 tests.

Now for some benchmark results from criterion, running each function on a 3.2 M file of Lorem Ipsum. Here's our reference speed.

benchmarking reference
collecting 100 samples, 1 iterations each, in estimated 56.19690 s
mean: 126.4616 ms, lb 126.0039 ms, ub 128.6617 ms, ci 0.950
std dev: 4.432843 ms, lb 224.7290 us, ub 10.55986 ms, ci 0.950

String is only about 30% slower than the optimized reference Text version and the mapAccumL version using Text is almost twice as fast!

benchmarking string
collecting 100 samples, 1 iterations each, in estimated 16.45751 s
mean: 165.1451 ms, lb 165.0927 ms, ub 165.2112 ms, ci 0.950
std dev: 301.0338 us, lb 250.2601 us, ub 370.2991 us, ci 0.950

benchmarking text
collecting 100 samples, 1 iterations each, in estimated 16.88929 s
mean: 67.67978 ms, lb 67.65432 ms, ub 67.72081 ms, ci 0.950
std dev: 162.8791 us, lb 114.9346 us, ub 246.0348 us, ci 0.950

But there are even more easy gains to be had. Data.Char.isSpace is known for its performance issues, so lets try the fast Data.Attoparsec.Char8.isSpace instead. Our quickcheck test won't pass, but the performance is great.

benchmarking string/atto
collecting 100 samples, 1 iterations each, in estimated 12.91881 s
mean: 129.2176 ms, lb 129.1328 ms, ub 129.4941 ms, ci 0.950
std dev: 705.3433 us, lb 238.2757 us, ub 1.568524 ms, ci 0.950

benchmarking text/atto
collecting 100 samples, 1 iterations each, in estimated 15.76300 s
mean: 38.63183 ms, lb 38.62850 ms, ub 38.63730 ms, ci 0.950
std dev: 21.41514 us, lb 15.27777 us, ub 33.98801 us, ci 0.950

We're now about 3x faster than the original reference. For comparison, the very fast python code (which is just calling out to C),

print open('lorem.txt').read().title()

rips through the text file in 30ms.

Question 2

Lazy I/O can be a problem, but it is the simplest way to approach this small task.

import Data.Text.Lazy (toTitle)
import Data.Text.Lazy.IO (readFile, putStr)
import Prelude hiding (readFile, putStr)

main = readFile "file" >>= putStr . toTitle

It will actually spend time doing Unicode (word-splitting and title-casing) correctly, but it's probably what you want. If you want to avoid Lazy I/O, the pipes-text package should produce something that's not much bigger.

If you really want to treat everything as ASCII and assume all the words start with a letter, I still think lazy I/O is a win here, but it's a bit more complex.

import Data.Bits (.&.)
import Data.ByteString.Lazy (ByteString, cons', putStrLn, readFile, uncons)
import Data.ByteString.Lazy.Char8 (lines, unlines, unwords, words)
import Data.Word (Word8)
import Prelude hiding (putStrLn, readFile, lines, unlines, unwords, words)

capitalize :: ByteString -> ByteString
capitalize word = case uncons word of
  Just (h, t) -> cons' (h .|. complement 32) t
  Nothing     -> word

main = readFile "file"
   >>= putStrLn . unlines
                . map (unwords . map capitalize . words)
                . lines

Again, avoiding lazy I/O is as simple as using pipes-bytestring.

There was also a reddit thread about that post here and they seem to get great performance out of the Builder abstraction, plus a better way of upper-casing. The builder abstraction probably will be faster than my bytestring hack because it will better chunk the output data before writing it.

Traversing ByteStrings

The problem