Question

Here's the problem:

split=re.compile('\\W*')

This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like käyttäj&aml;auml;.

What should I add to the regex to include the & and ; characters?

Was it helpful?

Solution

You probably want to take the problem reverse, i.e. finding all the character without the spaces:

[^ \t\n]*

Or you want to add the extra characters:

[a-zA-Z0-9&;]*

In case you want to match HTML entities, you should try something like:

(\w+|&\w+;)*

OTHER TIPS

I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:

(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+

This matches

  • either a word character (including “_”), or
  • an HTML entity consisting of
    • the character “&”,
      • the character “#”,
        • the character “x” followed by at least one hexadecimal digit, or
        • at least one decimal digit, or
      • at least one letter (= named entity),
    • a semicolon
  • at least once.

/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.

you should make a character class that would include the extra characters. For example:

split=re.compile('[\w&;]+')

This should do the trick. For your information

  • \w (lower case 'w') matches word characters (alphanumeric)
  • \W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
  • * matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).

Looks like this did the trick:

split=re.compile('(\\W+&\\W+;)*')

Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top