Question

I have the string like this:

String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";

or like this:

String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";

or like this:

String s3 = "=PS3=";

So, in formal - string contains some word definitions in dictionary, splitted by "|" symbol.

here:

  • word - word in the dictionary (optional, like in S2 or S3)

  • PS1, PS2, PS3 - Part of speech tag (required)

  • p1,p2,... - some parameters (optional)

  • q1, q2, q3, ... - some another parameters (also optional)

I want to build regex, which finds all occurrences of such strings in the text and gives me the groups:

  • group1 - word
  • group2 - part of speech tag
  • group3, group4, ... - parameters p
  • group(k), group(k+1), ... - another parameters (q)

I don't care for index of group of the last p parameter and first q parameter. I should know, that first group - is word (may be null), second group - part of speech, and other groups - parameters p and q.

Now I have such regex:

"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")

But it doesn't work correctly. It shows me only the last parameters p and q. I.e. (for S2) :

  • group1 = word3 - OK
  • group2 = PS2 - OK
  • group3 = p8 - NOT OK (only last p-parameter)
  • group4 = q6 - NOT OK (also last q-parameter)

Could you help me?

UPDATE: "="-character only the split-character between p-parameters and q-parameters. It's not necessary in my problem. You should think, that p-parameters and q-parameters are not different.

example of real input:

String s = "bread=NOUN,plur,link=form|=VERB="
Was it helpful?

Solution

You can't have a variable number of capture-groups in Regex. In .Net you could have multiple captures for each group, but not in Java. The problem for you is that the regex engine only stores the last successful match for each group. The best you could do is to match all p- and q- parameters into two big groups, and then split them up.

Pattern pattern1 = Pattern.compile(
    "([^|=,]*)" +                // Group 1: The word. Zero or more characters.
    "=([^|=,]*)" +               // Group 2: The part of speech.
    ",?([^|=,]*(?:,[^|=,]*)*)" + // Group 3: The p-params
    "=([^|=,]*(?:,[^|=,]*)*)"    // Group 4: The q-params
);
Matcher matcher = pattern1.matcher("word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3");
while (matcher.find()) {
  String word = matcher.group(1);
  String partOfSpeech = matcher.group(2);
  String pParamString = matcher.group(3);
  String qParamString = matcher.group(4);
  String[] pParams = pParamString.split(",");
  String[] qParams = qParamString.split(",");
  // Do something with the above variables...
}

I used [^|=,]* to match any non-special character.

OTHER TIPS

When I have problems like that I look to the modifiers on the quantifiers. You may want some of the quantifiers to be modified to be greedy, e.g.

(,?[a-z]+)+*

This difference, above, is that the final zero or more quantifier now grabs as much as it can. This is just an example and I'm not at all sure that that particular modifier is what you need but, given that your expression works as you reported, it seems likely that these modifiers will get it the rest of the way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top