Question

I am working toward being able to input any email message and output an equivalent XML encoding.

I am starting small, with one of the email headers -- the "From Header"

Here is an example of a From Header:

From: John Doe <john@doe.org>

I want it transformed into this XML:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

I want to use the lexical analyzer "Alex" (http://www.haskell.org/alex/doc/html/) to break apart (tokenize) the From Header.

I want to use the parser "Happy" (http://www.haskell.org/happy/) to process the tokens and generate a parse tree.

Then I want to use a serializer to walk the parse tree and output XML.

The format of the From Header is specified by the Internet Message Format (IMF), RFC 5322 (https://www.rfc-editor.org/rfc/rfc5322).

Here are a few more examples of From Headers and the desired XML output:

From Header with no display name:

From: <john@doe.org>

Desired XML output:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

From Header with no display name and no angle brackets around the address:

From: john@doe.org

Desired XML output:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

From Header with multiple mailboxes, each separated by a comma:

From: <john@doe.org>, "Simon St. John" <simon@stjohn.org>, sally@smith.org

Desired XML output:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
    <Mailbox>
        <DisplayName>Simon St. John</DisplayName>
        <Address>simon@stjohn.org</Address>
    </Mailbox>
    <Mailbox>
        <Address>sally@smith.org</Address>
    </Mailbox>
</From>

RFC 5322 says that the syntax for comment is: ( … ). Here is a From Header containing a comment:

From: (this is a comment) "John Doe" <john@doe.org>

I want all comments removed during lexing.

The desired XML output is this:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

The RFC says that there can be "folding whitespace" scattered throughout the From Header. Here is a From Header with the From: token on the first line, the display name on the second line, and the address on the third line:

From: 
    "John Doe" 
    <john@doe.org>

The XML output should not be affected by the folding whitespace:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

The RFC says that after the @ character in the address can be a string enclosed in brackets, such as this:

From: "John Doe" <john@[website]>

I must admit that I have never seen emails with that. Nonetheless, the RFC says it is allowed, so I certainly want my lexer and parser to handle such inputs. Here is the desired output:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@[website]</Address>
    </Mailbox>
</From>

Error Handling

I want an error generated if the From Header is incorrect. Here are a couple examples of erroneous From Headers and the desired output:

The display name is erroneously placed after the address:

From: <john@doe.org> "John Doe"

The output should specify the location that the error was discovered:

serialize: parse error at line 1 and column 22. Error occurred at "John Doe"

This From Header has an erroneous "23" before the display name:

From: 23 "John Doe" <john@doe.org>

Again, the output should specify the location that the error was discovered:

serialize: parse error at line 1 and column 10. Error occurred at "John Doe"

Would you please show how to implement the lexer, parser, and serializer?

Was it helpful?

Solution

Split the task into five steps:

Step #1: specify the complete, authoritative BNF for the From Header

Step #2: create a lexical analysis function, lex, that breaks the From Header into a sequence of small chunks, such as from:, displayName, angleAddress, and so on. These small chunks are called tokens

lex :: String -> [Token]

Step #3: define a data type, From, to represent the From Header

Step #4: create a parser function, parser, that consumes the sequence of tokens and produces a parse tree of type From

parse :: [Token] -> From

Step #5: create a function, serialize, that walks the parse tree and generates XML

serialize :: From -> XML

Step #1: specify the complete, authoritative BNF for the data format

The complete, authoritative BNF for the From header is specified in RFC 5322. I extracted the portions applicable to the From header:

http://www.xfront.com/parsing/RFC-5322/From-Header/From-Header-BNF.pdf

Step #2: create a lexer that breaks up the From Headers into tokens

Here is an example that shows how From headers will be tokenized:

Tokenize this From header:

From: "John Doe" <john@doe.org>

The output of the lexer is this list of tokens:

[ 
  TokenFrom (AlexPn 0 1 1)
  , TokenDisplayName (AlexPn 6 1 7) "\"John Doe\""
  , TokenAngleAddress (AlexPn 17 1 18) "<john@doe.org>"
]

Each item in the list consists of a label for the token, position information, and then optionally a value. The position information is the stuff in parentheses. The "AlexPn" is a label that indicates this is position information. The next three numbers indicate the location of the token: start location, line number, and column number.

Below is the lexer for the BFN. Observe the one-to-one mapping between the BNF and the token definitions. For example, the BNF has this production rule:

qcontent  = ( qtext  |  quoted-pair )

The lexer has this token definition:

@qcontent = ( $qtext | @quoted_pair )

Aside from minor syntactic differences, they are identical. That is really powerful. Assuming the definition of the email “From header” is correct (i.e., the BNF is correct), then we can be pretty certain that the lexer will be correct.

Here is the lexer:

http://www.xfront.com/parsing/RFC-5322/From-Header/Lexer.x.txt

Step #3: define a data type to represent the From Header

The sequence of tokens will be internally represented using this from data type:

data From
    = From MailboxList
    deriving Show

type MailboxList
    = [ Mailbox ]

data Mailbox
    = LongMailbox DisplayName AngleAddress
    | AngleMailbox AngleAddress
    | ShortMailbox AddressSpecification
    deriving Show

data DisplayName
    = DisplayName String
    deriving Show

data AngleAddress
    = AngleAddress String
    deriving Show

data AddressSpecification
    = AddressSpecification String
    deriving Show

Step #4: create a parser -- consume the sequence of tokens and produce a parse tree of type "From"

Here is an example that shows how From Headers will be parsed:

Parse this From header:

From: "John Doe" <john@doe.org>

The output of the parser is this parse tree:

From 
    [
        LongMailbox 
            (DisplayName "John Doe") 
            (AngleAddress "john@doe.org")
    ]

Here is the parser:

http://www.xfront.com/parsing/RFC-5322/From-Header/Parser.y.txt

Step #5: walk the parse tree and add XML start-tag, end-tag pairs around values

There is a function for every grammar production. For example, here is the function for the From grammar production:

serialize :: From -> String
serialize (From mailboxList) = "<From>" ++ serializeMailboxList mailboxList ++ "</From>"

The function's argument is the root of the parse tree, which has the label, From. The function calls another function, serializeMailboxList, to process the children of the root. The result is wrapped within From start-tag, end-tag pairs.

Here is the XML serializer:

http://www.xfront.com/parsing/RFC-5322/From-Header/serialize.hs.txt

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top