Question

I have a string :

0000000000<table blalba>blaalb<tr>gfdg<td>kgdfkg</td></tr>fkkkkk</table>5555

I want to replace the text between table and /table with : "", to delete this text to display only 00000000005555.

When it is on one line, it works:

chaineHtml = chaineHtml.replaceFirst("[^<title>](.*)[</title>$", "");

But the same with table fails.

Was it helpful?

Solution 2

try this

s = s.replaceAll("<table.+/table>", "");

OTHER TIPS

This regex should work:

html = html.replaceAll("(?is)<table.+?/table>", "");

Where (?is) will make it match across multiple lines and ignore case.

But I suggest you should not manipulate HTML using regex as it can be error prone.

 [^<table>]

I don't think that means what you think it means.

It is not "a string not equal to <table>". Rather, it means "a character not equal to < or t or a or b or l or e or >". "[^...]" is called a negative character class.

Change your regex to

 (.*?)<table>.*?</table>(.*?)

replace it with

$1$2

and it will give you the result you wish.


Please consider bookmarking The Stack Overflow Regular Expeession FAQ for future reference. The bottom section contains a list of online regex testers where you can try things out yourself. You may also want to check out the sections named "Character Classes" and, as mentioned by @anubhava: "General Information > Do not use regex to parse HTML"

Don't use regex if you are not familiar with its concepts!

There is a simple plain java solution for your problem:

String begin = "<table";
String end = "</table>";
String s = "0000000001<table blalba>blaalb<tr>gfdg<td>kgdfkg</td></tr>fkkkkk</table>4555";
int tableIndex = s.indexOf(begin);
int tableEndIndex = s.indexOf(end, tableIndex);

while (tableIndex > -1) {
    s = s.substring(0, tableIndex) + s.substring(tableEndIndex + end.length());
    tableIndex = s.indexOf("<table");
    tableEndIndex = s.indexOf("</table>", tableIndex);
}
String resultString = subjectString.replaceAll("<table.*?table>", "");

Explanation:

Match the characters “<table” literally «<table»
Match any single character that is not a line break character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “table>” literally «table>»

Here is a brilliant solution I found somewhere: Using the Regex

[\s\S]

to fit any character, including newlines because it fits any space or non-space characters. So in your case that would give:

s = s.replaceAll("<table[\\s\\S]+/table>", "");

the double backslashes are to escape the backslash.

Another possiblity is

(.|\n)

which is any character (except newline) or newline which gives:

s = s.replaceAll("<table(.|\n)+/table>", "");

For some reason, on my computer, in certain combinations, when I use (.|\n)+ regex runs into a weird loop and goes into a stackoverflow:

Exception in thread "main" java.lang.StackOverflowError at java.lang.Character.codePointAt(Character.java:4668) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3693)

which doesn't happen when I use [\s\S\]+ instead. I have no idea why though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top