C# Regex - How to parse string for Swedish letters åäöÅÄÖ?
-
21-09-2019 - |
Question
I'm trying to parse an HTML file for strings in this format:
<a href="/userinfo/userinfo.aspx?ID=305157" target="main">MyUsername</a> O22</td>
I want to retrieve the information where "305157", "MyUsername" and the first letter in "O22" (which can be either T, K or O).
I'm using this regex; <a href="/userinfo/userinfo\.aspx\?ID=\d*" target="helgonmain">\w*</a> \w\d\d
and it works fine, as long as there aren't any åäöÅÄÖ's where the "\w" are.
What should I do?
Solution
Firstly: DON'T USE REGULAR EXPRESSIONS TO PARSE HTML. USE AN HTML PARSER.
Secondly: if you really want to do this (and you don't) then instead of \w you could match any character apart from '<':
<a href="/userinfo/userinfo\.aspx\?ID=\d*" target="helgonmain">[^<]*</a> \w\d\d
OTHER TIPS
You can use a character class which specifically includes those things:
[\wåäöÅÄÖ]*
Or you can use the Unicode character class for letters:
\p{L}
or specifically for Latin:
\p{InBasicLatin}
You can use \p{L}
to match any 'letter', which will support all letters in all languages, as suggested in this SO question.
Or, you can simply replace \w*
with [^<]*
, to match all characters that are not the opening of an HTML tag.
But as said by others, parsing HTML using regex is a first step towards insanity...