Using ANTLR to parse JavaDoc comments
-
26-09-2019 - |
Question
I'm attempting to parse one particular (home grown) JavaDoc tag in my JavaScript file and I'm struggling to understand how I can achieve this. Antlr is complaining as documented below:
jsDocComment
: '/**' (importJsDocCommand | ~('*/'))* '*/' <== See note 1
;
importJsDocCommand
: '@import' gav
;
gav
: gavGroup ':' gavArtifact
-> ^(IMPORT gavGroup gavArtifact)
;
gavGroup
: gavIdentifier
;
gavArtifact
: gavIdentifier
;
gavIdentifier
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'.')* <== See note 2
;
Note 1: The following alternatives can never be matched: 1
Note 2: Decision can match input such as "'_'..'.'" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input
Here's what I'm trying to parse:
/** a */
/** @something */
/** @import com.jquery:jquery */
All lines should parse ok, with just the @import statement (along with its Maven group:artifact value) created under an AST tree element named "IMPORT".
Thanks for your assistance.
Solution 2
My solution to this problem was to use ANTLR's Lexer without the parser and filter out stuff that I'm not interested in. Here's what I came up with (it also looks for globally defined variables as well as imports):
lexer grammar ECMAScriptLexer;
options {filter=true;}
@lexer::header {
package com.classactionpl.mojo.javascript;
}
@members {
int scopeLevel = 0;
}
IMPORTDOC
: '/**' .* IMPORT .* (IMPORT)* '*/'
;
fragment
IMPORT
: '@import' WS groupId=GAVID ':' artifactId=GAVID
{System.out.println("found import: " + $groupId.text + ":" + $artifactId.text);}
;
fragment
GAVID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'-'|'0'..'9'|'.')*
;
COMMENT
: '/*' .* '*/'
;
SL_COMMENT
: '//' .* '\n'
;
ENTER_SCOPE
: '{' {++scopeLevel;}
;
EXIT_SCOPE
: '}' {--scopeLevel;}
;
WINDOW_VAR
: 'window.' name=ID WS? value=(';' | '=') ~('=')
{
System.out.println("found window var " + $name.text + " = " + ($value == ';'));
}
;
GLOBAL_VAR
: 'var' WS name=ID WS? value=(';' | '=') ~('=')
{
if (scopeLevel == 0) {
System.out.println("found global var " + $name.text + " = " + ($value == ';'));
}
}
;
fragment
ID : ('a'..'z'|'A'..'Z'|'$'|'_') ('a'..'z'|'A'..'Z'|'$'|'_'|'0'..'9')*
;
fragment
WS : (' '|'\t'|'\n')+
;
OTHER TIPS
Christopher Hunt wrote:
- Note 1: The following alternatives can never be matched: 1
~('*/')
is incorrect: you can only negate single characters in lexer rules (!). In your snippet, you're trying to negate something in a parser rule. In parser rules, you're not negating character(s), but tokens. For example:
parse : ~A;
foo : .;
A : 'A';
B : 'B';
C : 'C';
the parse
rule will not match any character except 'A'
, but matches either 'B'
or 'C'
. And foo
does not match any character, but matches any token (or lexer rule).
Christopher Hunt wrote:
- Note 2: Decision can match input such as "'_'..'.'" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input
Two questions:
- did you post the entire grammar?
- are you trying to parse the entire JS file or are you just "filtering" JS files and pulling out the JavaDoc comments?
If it's the latter, there is a much easier way to do this using ANTLR (and can give an explanation if this is the case).
EDIT
It's easiest to just add a new DocComment
rule the lexer and to palce it just above the (existing) Comment
rule:
DocComment
: '/**' (options {greedy=false;} : .)* '*/'
;
Comment
: '/*' (options {greedy=false;} : .)* '*/' {$channel=HIDDEN;}
;