質問

I've been looking at parsing a key-value data format with ANTLR. Pretty straightforward, but the keys represent a hierarchy.

A simplified example of my input syntax:

/a/b/c=2
/a/b/d/e=3
/a/b/d/f=4

In my mind, this represents a tree structured as follows:

(a (b (= c 2) (d (= e 3) (= f 4))))

The nearest I can get is to use the following grammar:

/* Parser Rules */
start: (component NEWLINE?)* EOF -> (component)*;

component: FORWARD_SLASH ALPHA_STRING component -> ^(ALPHA_STRING component)
  | FORWARD_SLASH ALPHA_STRING EQUALS value -> ^(EQUALS ALPHA_STRING value);

value: ALPHA_STRING;

/* Lexer Rules */
NEWLINE : '\r'? '\n';
ALPHA_STRING : ('a'..'z'|'A'..'Z'|'0'..'9')+;
EQUALS : '=';
FORWARD_SLASH : '/';

Which produces:

(a (b (= c 2))) (a (b (d (= e 3)))) (a (b (d (= f 4))))

I'm not sure whether I'm asking too much from a generic tool such as ANTLR here, and this is as close I can get with this approach. That is, from here I consume the parts of the tree and create the data structure I want by hand.

So - can I produce the tree structure I want directly from a grammar? If so, how? If not, why not - is it a technical limitation in ANTLR or is it something more CS-y to do with the type of language involved?

役に立ちましたか?

解決

I'm not sure whether I'm asking too much from a generic tool such as ANTLR here...

I think you're asking too much of the token parser. For input /a/b/c=2, the token parser sees this:

FORWARD_SLASH ALPHA_STRING FORWARD_SLASH ALPHA_STRING FORWARD_SLASH ALPHA_STRING EQUALS ALPHA_STRING

The interesting stuff in this case is the text in the tokens themselves, and the token parser couldn't care less about it. You'd need to use manually-coded actions at a minimum to dig into the tokens, store them, organize them, and spit them out in the desired arrangement.

... That is, from here I consume the parts of the tree and create the data structure I want by hand.

You have the option of using one or more ANTLR tree parsers to aid you on your quest, but they too are concerned with token types and not token text. Ultimately I think you'd still have to code an action somewhere along the way.

Using your grammar and a custom tree grammar using the same token vocab, I was able to reduce this (with a root node for assistance):

(START (a (b (= c 2))) (a (b (d (= e 3)))) (a (b (d (= f 4)))))

to this:

(START (a (b (= c 2) (d (= e 3)))) (a (b (d (= f 4)))))

Not a bad start (and if you're interested I can post the tree grammar) but this required semantic predicates. ANTLR couldn't do this without some coding on my part.

So - can I produce the tree structure I want directly from a grammar? ... If not, why not - is it a technical limitation in ANTLR or is it something more CS-y to do with the type of language involved?

It's a technical limitation of sorts: After lexing, ANTLR proper (i.e., not the code you can inject) operates on tokens and not the text they may contain1. If text "a" mapped to token A and text "b" mapped to token B (and so on), a tree parser would give you some leverage that it couldn't now, but I think you'd still have to code some actions and/or semantic predicates to get what you want.


1 With the exception of being able to create a token with custom text, but that's not relevant to this problem.

他のヒント

What you can do is not to use AST, but define your own tree and operations. Then, you don't need to create a new tree every time when the rule "component" fires, but only append new nodes to it. I hope the idea is clear?

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top