Parsing PPM images in Haskell

Question

Lets start with question 2 because it is easier to answer. Your approach is correct: as you parse things, you remove those characters from the input string, and return a tuple containing the result of the parse, and the remaining string. However, thereis no reason to write all this from scratch (except perhaps as an academic exercise) - there are plenty of parsers which will take care of this issue for you. The one I will use is Parsec. If you are new to monadic parsing you should first read the section on Parsec in RWH.

As for question 1, if you use ByteString instead of String, then parsing single bytes is easy since single bytes are the atomic elements of ByteStrings!

There is also the issue of the Char/ByteString interface. With Parsec, this is a non-issue since you can treat a ByteString as a sequence of Byte or Char - we will see this later.

I decided to just write the full parser - this is a very simple language so with all the primitives defined for you in the Parsec library, it is very easy and very concise.

The file header:

import Text.Parsec.Combinator
import Text.Parsec.Char
import Text.Parsec.ByteString
import Text.Parsec 
import Text.Parsec.Pos

import Data.ByteString (ByteString, pack)
import qualified Data.ByteString.Char8 as C8

import Control.Monad (replicateM)
import Data.Monoid

First, we write the 'primitive' parsers - that is, parsing bytes, parsing textual numbers, and parsing whitespace (which the PPM format uses as a seperator):

parseIntegral :: (Read a, Integral a) => Parser a
parseIntegral = fmap read (many1 digit)

digit parses a single digit - you'll notice that many function names explain what the parser does - and many1 will apply the given parser 1 or more times. Then we read the resulting string to return an actual number (as opposed to a string). In this case, the input ByteString is being treated as text.

parseByte :: Integral a => Parser a
parseByte = fmap (fromIntegral . fromEnum) $ tokenPrim show (\pos tok _ -> updatePosChar pos tok) Just

For this parser, we parse a single Char - which is really just a byte. It is just returned as a Char. We could safely make the return type Parser Word8 because the universe of values that can be returned is [0..255]

whitespace1 :: Parser ()
whitespace1 = many1 (oneOf "\n ") >> return ()

oneOf takes a list of Char and parses any one of the characters in the order given - again, the ByteString is being treated as Text.

Now we can write the parser for the header.

parseHeader :: Parser Header 
parseHeader = do
  f <- choice $ map try $ 
         [string "P3" >> return TextualBitmap
         ,string "P6" >> return BinaryBitmap]
  w <- whitespace1 >> parseIntegral
  h <- whitespace1 >> parseIntegral
  d <- whitespace1 >> parseIntegral
  return $ Header f w h d

A few notes. choice takes a list of parsers and tries them in order. try p takes the parser p, and 'remembers' the state before p starts parsing. If p succeeds, then try p == p. If p fails, then the state before p started is restored and you pretend you never tried p. This is necessary due to how choice behaves.

For the pixels, we have two choices as of now:

parseTextual :: Header -> Parser [Pixel]
parseTextual h = do
  xs <- replicateM (3 * width h * height h) (whitespace1 >> parseIntegral)
  return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs

We could use many1 (whitespace 1 >> parseIntegral) - but this wouldn't enforce the fact that we know what the length should be. Then, converting the list of numbers to a list of pixels is trivial.

For binary data:

parseBinary :: Header -> Parser [Pixel]
parseBinary h = do
  whitespace1
  xs <- replicateM (3 * width h * height h) parseByte
  return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs

Note how the two are almost identical. You could probably generalize this function (it would be especially useful if you decided to parse the other types of pixel data - monochrome and greyscale).

Now to bring it all together:

parsePPM :: Parser PPM
parsePPM = do
  h <- parseHeader
  fmap (PPM h) $ 
       case format h of
         TextualBitmap -> parseTextual h
         BinaryBitmap  -> parseBinary  h

This should be self-explanatory. Parse the header, then parse the body based on the format. Here are some examples to try it on. They are the ones from the specification page.

example0 :: ByteString
example0 = C8.pack $ unlines 
  ["P3"
  , "4 4"
  , "15"
  , " 0  0  0    0  0  0    0  0  0   15  0 15"
  , " 0  0  0    0 15  7    0  0  0    0  0  0"
  , " 0  0  0    0  0  0    0 15  7    0  0  0"
  , "15  0 15    0  0  0    0  0  0    0  0  0" ]

example1 :: ByteString
example1 = C8.pack ("P6 4 4 15 ") <> 
  pack [0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 15, 0, 0, 0, 0, 15, 7, 
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 7, 0, 0, 0, 15,
        0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Several notes: this doesn't handle comments, which are part of the spec. The error messages are not very useful; you can use the <?> function to create your own error messages. The spec also indicates 'The lines should not be longer than 70 characters.' - this is also not enforced.

edit:

Just because you see do-notation, doesn't necessarily mean that you are working with impure code. Some monads (like this parser) are still pure - they are just used for convenience. For example, you can write your parser with the type parser :: String -> (a, String), or, what we have done here, is we use a new type: data Parser a = Parser (String -> (a, String)) and have parser :: Parser a; we then write a monad instance for Parser to get the useful do-notation. To be clear, Parsec supports monadic parsing, but our parser is not monadic - or rather, uses the Identity monad, which is just newtype Identity a = Identity { runIdentity :: a }, and is only necessary because if we used type Identity a = a we would have 'overlapping instances' errors everywhere, which is not good.

>:i Parser
type Parser = Parsec ByteString ()
        -- Defined in `Text.Parsec.ByteString'
>:i Parsec
type Parsec s u = ParsecT s u Data.Functor.Identity.Identity
        -- Defined in `Text.Parsec.Prim'

So then, the type of Parser is really ParsecT ByteString () Identity. That is, the parser input is ByteString, the user state is () - which just means we aren't using the user state, and the monad in which we are parsing is Identity. ParsecT is itself just a newtype of:

forall b.
    State s u
    -> (a -> State s u -> ParseError -> m b)
    -> (ParseError -> m b)
    -> (a -> State s u -> ParseError -> m b)
    -> (ParseError -> m b)
    -> m b

All those functions in the middle are just used to pretty-print errors. If you are parsing 10's of thousands of characters and an error occurs, you won't be able to just look at it and see where that happened - but Parsec will tell you the line and column. If we specialize all the types to our Parser, and pretend that Identity is just type Identity a = a, then all the monads disappear and you can see that the parser is not impure. As you can see, Parsec is a lot more powerful than is required for this problem - I just used it due to familiarity, but if you were willing to write your own primitive functions like many and digit, then you could get away with using newtype Parser a = Parser (ByteString -> (a, ByteString)).