Looking for a regular expression including alphanumeric + “&” and “;”
Question
Here's the problem:
split=re.compile('\\W*')
This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like käyttäj&aml;auml;
.
What should I add to the regex to include the &
and ;
characters?
Solution
You probably want to take the problem reverse, i.e. finding all the character without the spaces:
[^ \t\n]*
Or you want to add the extra characters:
[a-zA-Z0-9&;]*
In case you want to match HTML entities, you should try something like:
(\w+|&\w+;)*
OTHER TIPS
I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:
(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+
This matches
- either a word character (including “
_
”), or - an HTML entity consisting of
- the character “
&
”,- the character “
#
”,- the character “
x
” followed by at least one hexadecimal digit, or - at least one decimal digit, or
- the character “
- at least one letter (= named entity),
- the character “
- a semicolon
- the character “
- at least once.
/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.
you should make a character class that would include the extra characters. For example:
split=re.compile('[\w&;]+')
This should do the trick. For your information
\w
(lower case 'w') matches word characters (alphanumeric)\W
(capital W) is a negated character class (meaning it matches any non-alphanumeric character)*
matches 0 or more times and+
matches one or more times, so*
will match anything (even if there are no characters there).
Looks like this did the trick:
split=re.compile('(\\W+&\\W+;)*')
Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.