Java regex to parse any number of Markdown-style links

https://stackoverflow.com/questions/23657575

22-07-2023
|

Pregunta

I'm trying to parse a string for any occurrences of markdown style links, i.e. [text](link). I'm able to get the first of the links in a string, but if I have multiple links I can't access the rest. Here is what I've tried, you can run it on ideone:

Pattern p;
try {
    p = Pattern.compile("[^\\[]*\\[(?<text>[^\\]]*)\\]\\((?<link>[^\\)]*)\\)(?:.*)");
} catch (PatternSyntaxException ex) {
    System.out.println(ex);
    throw(ex);
}
Matcher m1 = p.matcher("Hello");
Matcher m2 = p.matcher("Hello [world](ladies)");
Matcher m3 = p.matcher("Well, [this](that) has [two](too many) keys.");
System.out.println("m1 matches: " + m1.matches());  // false
System.out.println("m2 matches: " + m2.matches());  // true
System.out.println("m3 matches: " + m3.matches());  // true
System.out.println("m2 text: " + m2.group("text")); // world
System.out.println("m2 link: " + m2.group("link")); // ladies
System.out.println("m3 text: " + m3.group("text")); // this
System.out.println("m3 link: " + m3.group("link")); // that
System.out.println("m3 end: " + m3.end());          // 44 - I want 18
System.out.println("m3 count: " + m3.groupCount()); // 2 - I want 4
System.out.println("m3 find: " + m3.find());        // false - I want true

I know I can't have repeating groups, but I figured find would work, however it does not work as I expected it to. How can I modify my approach so that I can parse each link?

Solución

Can't you go through the matches one by one and do the next match from an index after the previous match? You can use this regex:

\[(?<text>[^\]]*)\]\((?<link>[^\)]*)\)

The method Find() tries to find all matches even if the match is a substring of the entire string. Each call to find gets the next match. Matches() tries to match the entire string and fails if it doesn't match. Use something like this:

while (m.find()) {
    String s = m.group(1);
    // s now contains "BAR"
}

Otros consejos

The regular expression I've used to match what you need (without groups) is \[\w+\]\(.+\)

It is just to show you it simple. Basically it does:

Filter a square: \[
Followed by any word char (at least 1): \w+
Then the square: \]

This will look for these pattern [blabla]

Then the same with parenthesis...

Filter a parenthesis: \(
Followed by any char (at least 1): .+
Then the parenthesis: \)

So it filters (ble...ble...)

Now if you want to store the matches on groups you can use additional parenthesis like this:

(\[\w+\])(\(.+\)) in this way you can have stored the words and links.

Hope to help.

I've tried on regexplanet.com and it's working

Update: workaround .*(\[\w+\])(\(.+\))*.*

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow