Split the task into five steps:
Step #1: specify the complete, authoritative BNF for the From Header
Step #2: create a lexical analysis function, lex
, that breaks the From Header into a sequence of small chunks, such as from:
, displayName
, angleAddress
, and so on. These small chunks are called tokens
lex :: String -> [Token]
Step #3: define a data type, From
, to represent the From Header
Step #4: create a parser function, parser
, that consumes the sequence of tokens and produces a parse tree of type From
parse :: [Token] -> From
Step #5: create a function, serialize
, that walks the parse tree and generates XML
serialize :: From -> XML
Step #1: specify the complete, authoritative BNF for the data format
The complete, authoritative BNF for the From header is specified in RFC 5322. I extracted the portions applicable to the From header:
http://www.xfront.com/parsing/RFC-5322/From-Header/From-Header-BNF.pdf
Step #2: create a lexer that breaks up the From Headers into tokens
Here is an example that shows how From headers will be tokenized:
Tokenize this From header:
From: "John Doe" <john@doe.org>
The output of the lexer is this list of tokens:
[
TokenFrom (AlexPn 0 1 1)
, TokenDisplayName (AlexPn 6 1 7) "\"John Doe\""
, TokenAngleAddress (AlexPn 17 1 18) "<john@doe.org>"
]
Each item in the list consists of a label for the token, position information, and then optionally a value. The position information is the stuff in parentheses. The "AlexPn" is a label that indicates this is position information. The next three numbers indicate the location of the token: start location, line number, and column number.
Below is the lexer for the BFN. Observe the one-to-one mapping between the BNF and the token definitions. For example, the BNF has this production rule:
qcontent = ( qtext | quoted-pair )
The lexer has this token definition:
@qcontent = ( $qtext | @quoted_pair )
Aside from minor syntactic differences, they are identical. That is really powerful. Assuming the definition of the email “From header” is correct (i.e., the BNF is correct), then we can be pretty certain that the lexer will be correct.
Here is the lexer:
http://www.xfront.com/parsing/RFC-5322/From-Header/Lexer.x.txt
Step #3: define a data type to represent the From Header
The sequence of tokens will be internally represented using this from data type:
data From
= From MailboxList
deriving Show
type MailboxList
= [ Mailbox ]
data Mailbox
= LongMailbox DisplayName AngleAddress
| AngleMailbox AngleAddress
| ShortMailbox AddressSpecification
deriving Show
data DisplayName
= DisplayName String
deriving Show
data AngleAddress
= AngleAddress String
deriving Show
data AddressSpecification
= AddressSpecification String
deriving Show
Step #4: create a parser -- consume the sequence of tokens and produce a parse tree of type "From"
Here is an example that shows how From Headers will be parsed:
Parse this From header:
From: "John Doe" <john@doe.org>
The output of the parser is this parse tree:
From
[
LongMailbox
(DisplayName "John Doe")
(AngleAddress "john@doe.org")
]
Here is the parser:
http://www.xfront.com/parsing/RFC-5322/From-Header/Parser.y.txt
Step #5: walk the parse tree and add XML start-tag, end-tag pairs around values
There is a function for every grammar production. For example, here is the function for the From grammar production:
serialize :: From -> String
serialize (From mailboxList) = "<From>" ++ serializeMailboxList mailboxList ++ "</From>"
The function's argument is the root of the parse tree, which has the label, From. The function calls another function, serializeMailboxList, to process the children of the root. The result is wrapped within From start-tag, end-tag pairs.
Here is the XML serializer:
http://www.xfront.com/parsing/RFC-5322/From-Header/serialize.hs.txt