Question

How do I convert string to upper case String.toUpperCase() ignoring special characters like   and all others. The problem is that it becomes   and browser does not recognize them as special HTML characters.

I came up with this but it does not cover all special characters:

public static String toUpperCaseIgnoreHtmlSymbols(String str){
    if(str == null) return "";
        str = str.trim();
    str = str.replaceAll("(?i) "," ");
    str = str.replaceAll(""",""");
    str = str.replaceAll("&","&");
    //etc.
    str = str.toUpperCase();
    return str;
}
Was it helpful?

Solution

Are you only interested in skipping HTML Entities, or do you also want to skip tags? What about chunks of javascript? URL's in links?

If you need to support that kind of stuff, you won't be able to avoid using a 'real' HTML parser instead of a regex. For example, parse the document using jsoup, manipulate the resulting Document, and convert it back to HTML:

private String upperCase(String str) {
    Document document = Jsoup.parse(str);
    upperCase(document.body());
    return document.html();
}

private void upperCase(Node node) {
    if (node instanceof TextNode) {
        TextNode textnode = (TextNode) node;
        textnode.text(textnode.text().toUpperCase());
    }
    for (Node child : node.childNodes()) {
        upperCase(child);
    }
}

now:

upperCase("This is some <a href=\"http://arnout.engelen.eu\">text&nbsp;with&nbsp;entities</a>");

will produce:

<html>
  <head></head>
  <body>
    THIS IS SOME 
    <a href="http://arnout.engelen.eu">TEXT&nbsp;WITH&nbsp;ENTITIES</a>
  </body>
</html>

OTHER TIPS

You could split the string in different groups with this regex

(.+?)(&[^ ]+?;)

The first part matches text before the special character, the second part matches the special character.

Once you have done that you can convert to uppercase the first group only, repeating for all the matches of the string.

I think you have the right idea, replacing all named entities with their numeric equivalents.

Here's the W3C's list of entities for HTML4: http://www.w3.org/TR/html4/sgml/entities.html

You could format that into a single two-column table without too much work. (Note that there's three tables at that link.) I'd do that, then read the table in and you can easily convert from named to numeric and back.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top