Why does the lexer rule for strings takes precedence over all my other rules?

https://stackoverflow.com/questions/17224214

01-06-2022
|

题

Using the Alex lexer I am creating a lexer to tokenize email "From headers." Here is an example header:

From: "John Doe" <john@doe.org>

"John Doe" is called the "display name" and let's assume that it can consist of any ASCII characters.

Likewise let's assume that the the parts of the email address can consist of any ASCII characters.

Below is my Alex program. When I run it on the above "From header" I just get one token:

[TokenString "From: \"John Doe\" <john@doe.org>"]

Apparently this rule:

$us_ascii_character+    { \s -> TokenString s }

takes precedence over all the other rules. Why?

I thought that precedence was based on the order in which the rules are physically listed in my program: Check to see if the input string matches the first rule, if it doesn't match then check to see if the input string matches the second rule, and so forth. No?

How do I express my rules such that the lexer tokenizes the "From header" into these tokens:

From, :, "John Doe", <, john, @, doe, ., org, >

and the display name and email parts can consist of any ASCII characters?

Here is my Alex lexer:

{
module Main (main) where
}

%wrapper "posn"

$digit      = 0-9           
$alpha      = [a-zA-Z]      
$us_ascii_character     = [\t\n\r\ !\"\#\$\%\&\'\(\)\*\+\,\-\.\/0123456789\:\;\<\=\>\?\@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\}~\DEL]

tokens :-

  $white+           ;
  \(.*\)           ;
  From             { \s -> TokenFrom }
  :                { \s -> TokenColon }
  "                { \s -> TokenQuote }
  \<               { \s -> TokenLeftAngleBracket }
  >                { \s -> TokenRightAngleBracket }
  @                { \s -> TokenAtSign }
  \.               { \s -> TokenPeriod }
  $us_ascii_character+     { \s -> TokenString s }

{
-- Each action has type :: String -> Token

-- The token type:
data Token =
    TokenFrom                 |
    TokenColon                |
    TokenQuote                |
    TokenLeftAngleBracket     |
    TokenRightAngleBracket    |
    TokenAtSign               |
    TokenPeriod               |
    TokenString String      
    deriving (Eq,Show)

解决方案

You misunderstood the rules for choosing the rule:

When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.

as stated in the documentation. Only if several rules match an equally long prefix does the order in which they are specified matter.

Since

$us_ascii_character+

matches the entire input, you get only a single [TokenString "From: \"John Doe\" <john@doe.org>"].

To tokenise the input as desired, if I understand correctly, you need to use a rule like

\" [^\"]* \"      { \s -> TokenString s }

(disclaimer: I don't know alex' syntax, it will probably be different actually).

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow