Using the Alex lexer I am creating a lexer to tokenize email "From headers." Here is an example header:
From: "John Doe" <john@doe.org>
"John Doe" is called the "display name" and let's assume that it can consist of any ASCII characters.
Likewise let's assume that the the parts of the email address can consist of any ASCII characters.
Below is my Alex program. When I run it on the above "From header" I just get one token:
[TokenString "From: \"John Doe\" <john@doe.org>"]
Apparently this rule:
$us_ascii_character+ { \s -> TokenString s }
takes precedence over all the other rules. Why?
I thought that precedence was based on the order in which the rules are physically listed in my program: Check to see if the input string matches the first rule, if it doesn't match then check to see if the input string matches the second rule, and so forth. No?
How do I express my rules such that the lexer tokenizes the "From header" into these tokens:
From, :, "John Doe", <, john, @, doe, ., org, >
and the display name and email parts can consist of any ASCII characters?
Here is my Alex lexer:
{
module Main (main) where
}
%wrapper "posn"
$digit = 0-9
$alpha = [a-zA-Z]
$us_ascii_character = [\t\n\r\ !\"\#\$\%\&\'\(\)\*\+\,\-\.\/0123456789\:\;\<\=\>\?\@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\}~\DEL]
tokens :-
$white+ ;
\(.*\) ;
From { \s -> TokenFrom }
: { \s -> TokenColon }
" { \s -> TokenQuote }
\< { \s -> TokenLeftAngleBracket }
> { \s -> TokenRightAngleBracket }
@ { \s -> TokenAtSign }
\. { \s -> TokenPeriod }
$us_ascii_character+ { \s -> TokenString s }
{
-- Each action has type :: String -> Token
-- The token type:
data Token =
TokenFrom |
TokenColon |
TokenQuote |
TokenLeftAngleBracket |
TokenRightAngleBracket |
TokenAtSign |
TokenPeriod |
TokenString String
deriving (Eq,Show)