How to Lex, Parse, and Serialize-to-XML Email Messages using Alex and Happy

Question

Split the task into five steps:

Step #1: specify the complete, authoritative BNF for the From Header

Step #2: create a lexical analysis function, lex, that breaks the From Header into a sequence of small chunks, such as from:, displayName, angleAddress, and so on. These small chunks are called tokens

lex :: String -> [Token]

Step #3: define a data type, From, to represent the From Header

Step #4: create a parser function, parser, that consumes the sequence of tokens and produces a parse tree of type From

parse :: [Token] -> From

Step #5: create a function, serialize, that walks the parse tree and generates XML

serialize :: From -> XML

Step #1: specify the complete, authoritative BNF for the data format

The complete, authoritative BNF for the From header is specified in RFC 5322. I extracted the portions applicable to the From header:

http://www.xfront.com/parsing/RFC-5322/From-Header/From-Header-BNF.pdf

Step #2: create a lexer that breaks up the From Headers into tokens

Here is an example that shows how From headers will be tokenized:

Tokenize this From header:

From: "John Doe" <john@doe.org>

The output of the lexer is this list of tokens:

[ 
  TokenFrom (AlexPn 0 1 1)
  , TokenDisplayName (AlexPn 6 1 7) "\"John Doe\""
  , TokenAngleAddress (AlexPn 17 1 18) "<john@doe.org>"
]

Each item in the list consists of a label for the token, position information, and then optionally a value. The position information is the stuff in parentheses. The "AlexPn" is a label that indicates this is position information. The next three numbers indicate the location of the token: start location, line number, and column number.

Below is the lexer for the BFN. Observe the one-to-one mapping between the BNF and the token definitions. For example, the BNF has this production rule:

qcontent  = ( qtext  |  quoted-pair )

The lexer has this token definition:

@qcontent = ( $qtext | @quoted_pair )

Aside from minor syntactic differences, they are identical. That is really powerful. Assuming the definition of the email “From header” is correct (i.e., the BNF is correct), then we can be pretty certain that the lexer will be correct.

Here is the lexer:

http://www.xfront.com/parsing/RFC-5322/From-Header/Lexer.x.txt

Step #3: define a data type to represent the From Header

The sequence of tokens will be internally represented using this from data type:

data From
    = From MailboxList
    deriving Show

type MailboxList
    = [ Mailbox ]

data Mailbox
    = LongMailbox DisplayName AngleAddress
    | AngleMailbox AngleAddress
    | ShortMailbox AddressSpecification
    deriving Show

data DisplayName
    = DisplayName String
    deriving Show

data AngleAddress
    = AngleAddress String
    deriving Show

data AddressSpecification
    = AddressSpecification String
    deriving Show

Step #4: create a parser -- consume the sequence of tokens and produce a parse tree of type "From"

Here is an example that shows how From Headers will be parsed:

Parse this From header:

From: "John Doe" <john@doe.org>

The output of the parser is this parse tree:

From 
    [
        LongMailbox 
            (DisplayName "John Doe") 
            (AngleAddress "john@doe.org")
    ]

Here is the parser:

http://www.xfront.com/parsing/RFC-5322/From-Header/Parser.y.txt

Step #5: walk the parse tree and add XML start-tag, end-tag pairs around values

There is a function for every grammar production. For example, here is the function for the From grammar production:

serialize :: From -> String
serialize (From mailboxList) = "<From>" ++ serializeMailboxList mailboxList ++ "</From>"

The function's argument is the root of the parse tree, which has the label, From. The function calls another function, serializeMailboxList, to process the children of the root. The result is wrapped within From start-tag, end-tag pairs.

Here is the XML serializer:

http://www.xfront.com/parsing/RFC-5322/From-Header/serialize.hs.txt