Java: Regex to delete wiki markup of lists

https://stackoverflow.com/questions/19196892

30-06-2022
|

Question

I am reading a wikipedia XML file, in which i have to delete anything which is a list item. E.g. For the following string:

String text = ": definition list\n
** some list item\n
# another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

Here, i want to delete the *,# and :, but not the one where it says category. Output should look like:

String outtext = "definition list\n
some list item\n
another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

I am using the following code:

Pattern pattern = Pattern.compile("(^\\*+|#+|;|:)(.+)$");
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                String outtext = matcher.group(0);
                outtext = outtext.replaceAll("(^\\*+|#+|;|:)\\s", "");
                return(outtext);
                }

This is not working. Can you please indicate how i should do it?

Solution

This should work:

text = text.replaceAll("(?m)^[*:#]+\\s*", "");

Important is using (?m) for MULTILINE mode here that lets you use line start/end anchors for each line.

OUTPUT:

definition list
some list item
another list item
[[Category:1918 births]]
[[Category:2005 deaths]]
[[Category:Scottish female singers]]
[[Category:Billy Cotton Band Show]]
[[Category:Deaths from Alzheimer's disease]]
[[Category:People from Glasgow]]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow