How to capture a string without quote characters

https://stackoverflow.com/questions/8216412

05-03-2021
|

Domanda

I'm trying to capture quoted strings without the quotes. I have this terminal

%token <string> STRING

and this production

constant:
    | QUOTE STRING QUOTE { String($2) }

along with these lexer rules

| '\''       { QUOTE }
| [^ '\'']*  { STRING (lexeme lexbuf) } //final regex before eof

It seems to be interpreting everything leading up to a QUOTE as a single lexeme, which doesn't parse. So maybe my problem is elsewhere in the grammar--not sure. Am I going about this the right way? It was parsing fine before I tried to exclude quotes from strings.

Update

I think there may be some ambiguity with the following lexer rules

let name = alpha (alpha | digit | '_')*
let identifier = name ('.' name)*

The following rule is prior to STRING

| identifier    { ID (lexeme lexbuf) }

Is there any way to disambiguate these without including quotes in the STRING regex?

Soluzione

It's pretty normal to do semantic analysis in the lexer for constants like strings and numeric literals, so you might consider a lex rule for your string constants like

| '\'' [^ '\'']* '\'' 
    { STRING (let s = lexeme lexbuf in s.Substring(1, s.Length - 2)) }

Altri suggerimenti

You can use lexeme with quotes, but trim quotes in parser

Lexer:

let constant       = ("'" ([^ '\''])* "'")
...
| constant         { STRING(lexeme lexbuf) }

Parser:

%token <string> STRING

...
constant:
    | STRING { ($1).Trim([|'''|]) }

Or if you want to extract quotes from string:

Lexer:

let name = alpha (alpha | digit | '_')*
let identifier = name ('.' name)*
...

| '\''       { QUOTE }
| identifier { ID (lexeme lexbuf) }
| _          { STRING (lexeme lexbuf) }

identifier will take away symbols from STRING, so your lexeme stream can be like: QUOTE ID STRING ID .. QUOTE, and you have to handle this in parser:

Parser:

constant:
     | QUOTE content QUOTE     { String($2) }

content:
     | ID content      { $1+$2 }
     | STRING content  { $1+$2 }
     | ID              { $1 }
     | STRING          { $1 }

I had a similar problem. I capture them in the "lexic.l" file using states. Here my autoanswer

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow