Question

I need some help figuring out the regex for XML character references to control characters, in decimal or hex.

These sequences look like the following:

�





In other words, they are an ampersand, followed by a pound, followed by an optional 'x' to denote hexadecimal mode, followed by 1 to 4 decimal (or hexadecimal) digits, followed by a semicolon.

I'm specifically trying to identify those sequences where they contain (inclusive) numbers from decimal 0 to 31, or hexadecimal 0 to 1F.

Can anyone figure out the regex for this??

Was it helpful?

Solution

If you use a zero-width lookahead assertion to restrict the number of digits, you can write the rest of the pattern without worrying about the length restriction. Try this:

&#(?=x?[0-9A-Fa-f]{1,4})0*([12]?\d|3[01]|x0*1?[0-9A-Fa-f]);

Explanation:

(?=x?[0-9A-Fa-f]{1,4})  #Restricts the numeric portion to at most four digits, including leading zeroes.
0*                      #Consumes leading zeroes if there is no x.
[12]?\d                 #Allows decimal numbers 0 - 29, inclusive.
3[01]                   #Allows decimal 30 or 31.
x0*1?[0-9A-Fa-f]        #Allows hexadecimal 0 - 1F, inclusive, regardless of case or leading zeroes.

This pattern allows leading zeroes after the x, but the (?=x?[0-9A-Fa-f]{1,4}) part prevents them from occurring before an x.

OTHER TIPS

&#(0{0,2}[1-2]\d|000\d|0{0,2}3[01]|x0{0,2}[01][0-9A-Fa-f]);

It's not the most elegant, but it should work.

Verified in RegexBuddy.

results

I think the following should work:

&#(?:x0{0,2}[01]?[0-9a-fA-F]|0{0,2}(?:[012]?[0-9]|3[01]));

Here is a Rubular:
http://www.rubular.com/r/VEYx25Fdpj

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top