Jison / Flex: Trying to capture anything (.*) between two tokens but having problems

https://stackoverflow.com/questions/23399201

13-07-2023
|

Question

I'm currently working on a small little dsl, not unlike rabl. I'm struggling with the implementation of one of my rules. Before we get to the problem, I'll explain a bit about my syntax/grammar. In my little language you can define properties, object/array blocks, or custom blocks (these are all used to build a json object/array). A "custom block" can either be one that contains my standard expressions (property, object/array block, etc) or some JavaScript. These expressions are written as such -

-- An object block
object @model

    -- A property node
    property some, key(name="value")

    -- A custom node
    object custom_obj as 
        property some(name="key")
    end

    -- A custom script node
    property full_name as (u)
        // This is JavaScript
        return u.first_name + ' ' + u.last_name;
    end
end

The problem I'm running into is with my custom script node. I'm having a real hard defining the script token so that JISON can properly capture the stuff inside the block. In my lexer, I currently have...

# script_param is basically a regex to match "(some_ident)"
{script_param}  { this.begin('js'); return 'SCRIPT_PARAM'; }
<js>(.|\n|\r)*?"end" %{ 
    this.popState();
    yytext = yytext.substr(0, yyleng - 3).trim();
    return 'SCRIPT';
%}

That SCRIPT token will basically match anything after (u) up to (and including) the end token (which usually ends a block). I really dislike this because my usual block terminator (end) is actually part of the script token, which feels totally hacky to me. Unfortunately, I'm not able to find a better way to capture ANYTHING between (..) and end. I've tried writing a regex that captures anything that ends with a ";", but that poses problems when I have multiple script nodes in my dsl code. I've only been able to make this work by including the "end" keyword as part of my capture.

Here are the links to my grammar and lexer files.

I'd greatly appreciate any insight into solving my problem! If I didn't explain my problem clearly, let me know and I'll try my best to clarify! Many thanks in advance!!

I will also happily accept any advice as to how to clean up my grammar. I'm still fairly new at this stuff and feel like my stuff is a mess right now :)

Solution

It's easy enough to match a string up to but not including the first instance of end:

([^e]|e[^n]|en[^d])*

(And it doesn't even need non-greedy repetition.)

However, that's not what you want. The included JavaScript might include:

variables whose names happen to include the characters end (tendency)
comments (/* Take the values up to the end of the line */)
character strings (if (word == "end"))
and, indeed, the word end itself, which is not a reserved word in js.

Really, the only clean solution is to be able to lex javascript. Fortunately, you don't have to do it precisely, because you're not interpreting it, but even so it is a bit of work. The most annoying part of javascript lexing, like other similar languages, is identifying when / is the beginning of a regular expression, and when it is just division; getting that right requires most of a javascript parser, particularly since it also interacts with the semicolon rule.

To deal with the fact that the included javascript might actually use a variable named end, you have a couple of choices, as far as I can see:

Document the fact that end is a reserved word.
Only recognize end when it appears outside of parentheses and in a place where a statement might start (not too difficult if you end up building enough of a JS parser to correctly identify regular expressions)
Only recognize end when it appears by itself on a line.

This last choice would really simplify your problem a lot, so you might want to think about it, although it's not really very elegant.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow