Splitting by regex or ebnf

https://stackoverflow.com/questions/16533505

29-05-2022
|

Question

I've got a string like:

create Person +fname : String, +lname: String, -age:int;

Is there any possibility to split it by regex or ebnf? I mean all the things like [a-zA-Z0-9] (things we don't know) will be stored in array?

In other words, by using this regexp:

^create [a-zA-Z][a-zA-Z0-9]* [s|b]?[+|[-]|=][a-zA-Z][a-zA-Z0-9]*[ ]?:[ ]?[a-zA-Z][a-zA-Z0-9]*(, [s|b]?[+|[-]|=][a-zA-Z][a-zA-Z0-9]*[ ]?:[ ]?[a-zA-Z][a-zA-Z0-9]*)*;

I want to obtain array:

Person
+
fname
String
+
lname
String
-
age
int

Solution

You can try to split it this way

String[] tokens = "create Person +fname : String, +lname: String, -age:int;"
        .split("[\\s:;,]+|(?<=[+\\-])");
        //split on set of characters containing spaces:;, OR after + or -. 
for (String s : tokens)
    System.out.println("=> " + s);

output:

=> create
=> Person
=> +
=> fname
=> String
=> +
=> lname
=> String
=> -
=> age
=> int

As you can see it will put create at start of your array so just start iterating from tokens[1].

You could try do add ^create\\s as part of splitting rule, but this will produce empty string at start of tokens array, so won't solve anything.

OTHER TIPS

Regex is fine for lots of things, but sometimes you need a real lexer. JFlex is great. There's no tokenization task it can't handle. If you need to go a little further and create a parse tree, JavaCC or ANTLR are good choices.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow