Writing “wc -l” using Iteratee library - how to filter for newline?

https://stackoverflow.com/questions/7986307

20-02-2021
|

Question

I am trying to come up with equivalent of "wc -l" using Haskell Iteratee library. Below is the code for "wc" (which just counts the words - similar to the code in iteratee example on hackage), and runs very fast:

{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString


length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
length1 = liftI (step 0)
  where
    step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs))
    step !i stream     = idone i stream
{-# INLINE length1 #-}
main = do
  i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int)
  result <- run i'
  print result
  {- Time measured on a linux x86 box: 
  $ time ./test ## above haskell compiled code
  4950996

  real    0m0.013s
  user    0m0.004s
  sys     0m0.007s

  $  time wc -c /usr/share/dict/words
  4950996 /usr/share/dict/words

  real    0m0.003s
  user    0m0.000s
  sys     0m0.002s
  -}

Now, how do you extend it to count the number of lines that too runs fast? I did a version using Prelude.filter to filter only "\n" to length but it is slower than linux "wc -l" because of too much memory, and gc (lazy evaluation, I guess). So, I wrote another version using Data.ListLike.filter but it won't compile because it doesn't type check - help here would be appreciated:

  {-# LANGUAGE BangPatterns #-}
  import Data.Iteratee as I
  import Data.ListLike as LL
  import Data.Iteratee.IO
  import Data.ByteString
  import Data.Char
  import Data.ByteString.Char8 (pack)

  numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
  numlines = liftI $ step 0
    where
      step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x ->  x == Data.ByteString.Char8.pack "\n")  xs))
      step !i stream = idone i stream
  {-# INLINE numlines #-}

  main = do
    i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
    result <- run i'
    print result

Solution

There are a lot of good answers already; I have very little to offer performance-wise but a few style points.

First, I would write it this way:

import Prelude as P
import Data.Iteratee
import qualified Data.Iteratee as I
import qualified Data.Iteratee.IO as I
import qualified Data.ByteString as B
import Data.Char
import System.Environment

-- numLines has a concrete stream type so it's not necessary to provide an
-- annotation later.  It could have a more general type.
numLines :: Monad m => I.Iteratee B.ByteString m Int
numLines = I.foldl' step 0
 where
  --step :: Int -> Word8 -> Int
  step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc

main = do
  f:_   <- getArgs
  words <- run =<< I.enumFile 65536 f numLines
  print words

The biggest difference is that this uses Data.Iteratee.ListLike.foldl'. Note that only the individual stream elements matter to the step function, not the stream type. It's exactly the same function as you would use with e.g. Data.ByteString.Lazy.foldl'.

Using foldl' also means that you don't need to manually write iteratees with liftI. I would discourage users from doing so unless absolutely necessary. The result is usually longer and harder to maintain with little to no benefit.

Finally, I've increased the buffer size significantly. On my system this is marginally faster than enumerators default of 4096, which is again marginally faster (with iteratee) than your choice of 1024. YMMV with this setting of course.

OTHER TIPS

So I did some experimenting and I got a wc -l that is only about twice as slow as "wc -l" This is better performance than even the wc -c version shown above.

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import Control.Monad.IO.Class (liftIO)
import Data.Int

numlines :: Int64 -> E.Iteratee BS.ByteString IO ()
numlines n = do 
    chunk <- EB.take 1024
    case chunk of 
        "" -> do liftIO $ print n
                 return ()
        a -> do let ct = BSL.count '\n' a
                numlines (n+ct)

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 0
    E.run_ i

Running it vs. native:

Eriks-MacBook-Air:skunk erikhinton$ time wc -l "/usr/share/dict/words"
  235886 /usr/share/dict/words

real    0m0.009s
user    0m0.006s
sys 0m0.002s
Eriks-MacBook-Air:skunk erikhinton$ time ./wcl
235886

real    0m0.019s
user    0m0.013s
sys 0m0.005s

[EDIT]

Here's an even faster, smaller footprint and far more concise/expressive way of doing it. These enumerators are starting to get fun.

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import qualified Data.Enumerator.List as EL
import Control.Monad.IO.Class (liftIO)
import Data.Int

numlines :: E.Iteratee BS.ByteString IO ()
numlines = do
           num <- EL.fold (\n b -> (BS.count '\n' b) + n ) 0
           liftIO . print $ num

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines
    E.run_ i

And the timing

Eriks-MacBook-Air:skunk erikhinton$ time ./wcl2
235886

real    0m0.015s
user    0m0.010s
sys 0m0.004s

If you're reading ByteString chunks, you can use the count function from Data.ByteString, the relevant step would then be

step !i (Chunk xs) = liftI (step $ i + count 10 xs)

(perhaps with a fromIntegral). Data.ByteString.count is pretty fast, that shouldn't be too much slower than wc -l.

I figured out how to fix the type error. The key to fixing type error is understanding the relationship between Data.ListLike.filter and ByteString input that is being passed to that filter. Here is the type of Data.ListLike.filter:

Data.ListLike.filter
:: Data.ListLike.Base.ListLike full item =>
 (item -> Bool) -> full -> full

full refers to the stream in the context of an enumerator/iteratee, if I understand it correctly. item refers to the element of the stream.

Now, if we want to filter on newline in the input file, we have to know the type of input file stream, and the type of elements in that stream. In this case, input file is being read as ByteString stream. ByteString is documented as a space-efficient representation of a Word8 vector. So, item type here is Word8.

So, when we write the filter, in the step function, we have to make sure that Bool operation is defined for Word8 since that is the type of the item being passed to the filter (as explained above). We are filtering for newline. So, the bool function like the one below which builds a Word8 representation of newline, and check for equality against x of type Word8, should work:

\x ->  x ==  Data.ByteString.Internal.c2w '\n'

There is still one more missing piece - for some reasons, the compiler (v7.0.3 Mac) is unable to deduce the type of el in numfile type signature (if anyone has ideas on why it is so, please do discuss). So, telling it explicitly that it is Word8 solves the compilation issue:

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a

Full code below - it compiles, and runs quite fast.

{-# LANGUAGE BangPatterns,FlexibleContexts #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
import GHC.Word (Word8)
import Data.ByteString.Internal (c2w)

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a
numlines = liftI $ step 0
  where
    step !i (Chunk xs) = let newline = c2w '\n' in liftI (step $i + fromIntegral (LL.length $ LL.filter (\x ->  x == newline) xs))
    step !i stream = idone i stream
{-# INLINE numlines #-}


main = do
  i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
  result <- run i'
  print result
{- Time to run on mac OSX:

$ time ./test ## above compiled program: ghc --make -O2 test.hs
235886

real  0m0.011s
user  0m0.007s
sys 0m0.004s
$ time wc -l /usr/share/dict/words 
235886 /usr/share/dict/words

real  0m0.005s
user  0m0.002s
sys 0m0.002s
-}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow