Regex to remove control characters except in a certain tag

Question 1

You could use

replaceAll("(?s)(<Text>.*?</Text>)|\\p{C}", "$1")

The idea is to skip Text tags contents and leave them alone (replace them with themselves). So if we encounter a \\p{C}, we know it's not inside one.

Explanation:

(?s) activates "dot match all", so . will match newline as well
(<Text>.*?</Text>) captures the text node in the first group. We replace with the result of this capture through $1
If we match \\p{C}, this means we are not in a Text node. So we replace with $1, which is empty since (<Text>.*?</Text>) didn't match in the alternation.

Ideone illustration: http://ideone.com/xKZgsn

Question 2

You could use this regex :

/(?!<text[^>]*?>)(\p{C}+)(?![^<]*?<\/text>)/gi

But, as mentioned by @fge, would be better to cleanly parse your input.

Question 3

Here is a string I have to test regex patterns that remove control characters.

AAU?Aasddsaustw3h,kdf134dfswdesdfent?�sdfsadfa45678r?w3h,kdf134dfswdesdfawh,kdf134dfswdesdfsurew3h,kdf134dfswdesdfent??3asdfliit/123423defwecty ?�STasd?Pawh,kdf134dfswdesdfks?Hw3rsdfsd134dfswdet

It seems regex pattern "[[:cntrl:]]" works well. string.replaceAll("[\u0000-\u001f]", "") just replace part of them. "\p{Cntrl}" just replace empty string after "wecty".

Can anyone told me what's those control characters are? I can replace them but could not figure out what are they. The jave online regex test show there are 11 control characters matched. https://www.freeformatter.com/java-regex-tester.html#ad-output