Looking for a regular expression including alphanumeric + “&” and “;”

https://stackoverflow.com/questions/152218

02-07-2019
|

Question

Here's the problem:

split=re.compile('\\W*')

This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like k&auml;ytt&auml;j&aml;auml;.

What should I add to the regex to include the & and ; characters?

Solution

You probably want to take the problem reverse, i.e. finding all the character without the spaces:

[^ \t\n]*

Or you want to add the extra characters:

[a-zA-Z0-9&;]*

In case you want to match HTML entities, you should try something like:

(\w+|&\w+;)*

OTHER TIPS

I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:

(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+

This matches

either a word character (including “_”), or
an HTML entity consisting of
- the character “&”,
  - the character “#”,
    - the character “x” followed by at least one hexadecimal digit, or
    - at least one decimal digit, or
  - at least one letter (= named entity),
- a semicolon
at least once.

/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.

you should make a character class that would include the extra characters. For example:

split=re.compile('[\w&;]+')

This should do the trick. For your information

\w (lower case 'w') matches word characters (alphanumeric)
\W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
* matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).

Looks like this did the trick:

split=re.compile('(\\W+&\\W+;)*')

Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow