Question

I am removing control characters from a string as I load and deserialise it. I do this with the following regex, which is fine:

\\p{C}

The issue is part of the text is meant to have new lines in it. So what I need to do is remove all control characters unless they fall between <Text> and </Text>.

How can do I do this with a regex?

Was it helpful?

Solution

You could use

replaceAll("(?s)(<Text>.*?</Text>)|\\p{C}", "$1")

The idea is to skip Text tags contents and leave them alone (replace them with themselves). So if we encounter a \\p{C}, we know it's not inside one.

Explanation:

  • (?s) activates "dot match all", so . will match newline as well
  • (<Text>.*?</Text>) captures the text node in the first group. We replace with the result of this capture through $1
  • If we match \\p{C}, this means we are not in a Text node. So we replace with $1, which is empty since (<Text>.*?</Text>) didn't match in the alternation.

Ideone illustration: http://ideone.com/xKZgsn

OTHER TIPS

You could use this regex :

/(?!<text[^>]*?>)(\p{C}+)(?![^<]*?<\/text>)/gi

But, as mentioned by @fge, would be better to cleanly parse your input.

Here is a string I have to test regex patterns that remove control characters.

AAU?Aasddsaustw3h,kdf134dfswdesdfent?�sdfsadfa45678r?w3h,kdf134dfswdesdfawh,kdf134dfswdesdfsurew3h,kdf134dfswdesdfent??3asdfliit/123423defwecty ?�STasd?Pawh,kdf134dfswdesdfks?Hw3rsdfsd134dfswdet

It seems regex pattern "[[:cntrl:]]" works well. string.replaceAll("[\u0000-\u001f]", "") just replace part of them. "\p{Cntrl}" just replace empty string after "wecty".

Can anyone told me what's those control characters are? I can replace them but could not figure out what are they. The jave online regex test show there are 11 control characters matched. https://www.freeformatter.com/java-regex-tester.html#ad-output

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top