How to tokenize a string with an embedded string?

https://stackoverflow.com/questions/16570708

29-05-2022
|

Pregunta

I am learning how to use the Haskell lexical analyzer tool called Alex¹.

I am trying to implement a lexical analyzer for this string (an email "From:" header):

From: "John Doe" <john@doe.org>

I want to break it up into this list of tokens:

[
  From,
  DisplayName "John Doe",
  Email,
  LocalName "john",
  Domain "doe.org"
]

Below is my implementation. It works fine if the string doesn't contain a display name. That is, this works fine:

let s = "From: <john@doe.org>"
alexScanTokens s

However, when I include a display name, I get this error message:

[From*** Exception: lexical error

That is, this results in an error:

let s = "From: \"John Doe\" <john@doe.org>"
alexScanTokens s

I am guessing that this part of my Alex program is causing the error:

\"[a-zA-Z ]+\"      { \s -> DisplayName (init (tail s)) }

In Alex the left side is a regular expression:

\"[a-zA-Z ]+\"

and the right side is the action to be taken when a string is found that matches the regular expression:

{ \s -> DisplayName (init (tail s)) }

Any ideas on what the problem might be?

Here is my lexical analyzer program:

{
module Main (main) where
}

%wrapper "basic"

$digit = 0-9            -- digits
$alpha = [a-zA-Z]       -- alphabetic characters

tokens :-

  $white+                    ;
  From:                     { \s -> From }
  \"[a-zA-Z ]+\"            { \s -> DisplayName (init (tail s)) }
  \<                        { \s -> Email }
  [$alpha]+@                 { \s -> LocalPart (init s) }
  [$alpha\.]+>               { \s -> Domain (init s) }

{
-- Each action has type :: String -> Token

-- The token type:
data Token =
    From                               |
    DisplayName String                 |
    Email                              |
    LocalPart String                   |
    Domain String       
    deriving (Eq,Show)

main = do
  s <- getContents
  print (alexScanTokens s)
}

¹ The "Alex" lexical analyzer tool may be found at this URL: http://www.haskell.org/alex/doc/html/introduction.html

Solución

It's the space in "John Doe" that's causing trouble.

Whitespace is ignored in character sets like [a-zA-Z ]. To include the space, you need to escape it with a backslash, e.g. [a-zA-Z\ ].

_{Also, I can't help but note that a lexer might be the wrong tool for this job. Consider writing a proper parser using e.g. Parsec.}

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow