\w
does not match unicode. To match unicode in regex, you can use \p{L}
:
String pattern = "\\p{L}+(?=\\<)";
Although for this type of work I would recommend using an XML parser as regular expressions are completely unsuitable for parsing HTML/XML as described in this post