Cleanning a String from html code and accents with java

https://stackoverflow.com/questions/20449401

30-08-2022
|

Domanda

I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.

This file contains words like Postulación Ayudantías and also Gestión or Árbol

I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings

I am really lost here and I need help please!

This are the codes I tried and didnt work

Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)

and I used regular expression to remove the html accent code but neither is working:

string=string.replaceAll("&aacute;","a");
string=string.replaceAll("&eacute;","e");
string=string.replaceAll("&iacute;","i");
string=string.replaceAll("&oacute;","o");
string=string.replaceAll("&uacute;","u");
string=string.replaceAll("&ntilde;","n");

Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)

Any help or ideas?

Soluzione

I think there are several options that would work. I would suggest that you first use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form). Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.

Altri suggerimenti

You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.

If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.

However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.

Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow