Rimuovi i tag HTML da una stringa

https://stackoverflow.com/questions/240546

04-07-2019
|

Domanda

Esiste un buon modo per rimuovere HTML da una stringa Java? Una regex semplice come

 replaceAll("\\<.*?>","")

funzionerà, ma cose come & non verranno convertite correttamente e il non-HTML tra le due parentesi angolari verrà rimosso (ovvero .*? nella regex scomparirà).

Soluzione

Usa un parser HTML invece di regex. Questo è semplicissimo con Jsoup .

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup inoltre supporta la rimozione di tag HTML da una whitelist personalizzabile, che è molto utile se si desidera consentire solo ad es <b>, <i> e <u>.

Vedi anche:

Altri suggerimenti

Se stai scrivendo per Android puoi farlo ...

android.text.Html.fromHtml(instruction).toString()

Se l'utente inserisce <b>hey!</b>, vuoi visualizzare hey! o <bhey!</b>? Se il primo, scappa meno di, e codifica HTML (e opzionalmente virgolette) e stai bene. Una modifica al tuo codice per implementare la seconda opzione sarebbe:

replaceAll("\\<[^>]*>","")

ma si verificheranno problemi se l'utente inserisce qualcosa di malformato, come <=>.

Puoi anche dare un'occhiata a JTidy che analizzerà " dirty " input HTML, e dovrebbe darti un modo per rimuovere i tag, mantenendo il testo.

Il problema con il tentativo di eliminare html è che i browser hanno parser molto indulgenti, più indulgenti di qualsiasi libreria tu possa trovare, quindi anche se fai del tuo meglio per eliminare tutti i tag (usando il metodo di sostituzione sopra, una libreria DOM, o JTidy), dovrai ancora assicurarti di codificare tutti i caratteri speciali HTML rimanenti per proteggere il tuo output.

Un altro modo è usare javax.swing.text.html.HTMLEditorKit per estrarre il testo.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref: Rimuovi i tag HTML da un file per estrarre solo il TESTO

Penso che il modo più semplice per filtrare i tag html sia:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

Anche molto semplice usando Jericho , e puoi conservare parte della formattazione (riga interruzioni e collegamenti, ad esempio).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

Su Android, prova questo:

String result = Html.fromHtml(html).toString();

L'escaping HTML è davvero difficile da fare nel modo giusto, suggerirei sicuramente di usare il codice della libreria per farlo, poiché è molto più sottile di quanto pensi. Scopri StringEscapeUtils di Apache per una libreria abbastanza buona per gestirla in Java.

La risposta accettata di fare semplicemente Jsoup.parse(html).text() ha 2 potenziali problemi (con JSoup 1.7.3):

Rimuove le interruzioni di riga dal testo
Converte il testo <script> in <script>

Se lo usi per proteggerti da XSS, questo è un po 'fastidioso. Ecco il mio miglior scatto con una soluzione migliorata, usando sia JSoup che Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Nota che l'ultimo passaggio è perché ho bisogno di usare l'output come testo normale. Se hai bisogno solo di output HTML, dovresti riuscire a rimuoverlo.

Ed ecco un sacco di casi di test (input to output):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

Se trovi un modo per migliorarlo, faccelo sapere.

Potresti voler sostituire i tag <br/> e </p> con newline prima di rimuovere l'HTML per evitare che diventi un pasticcio illeggibile come suggerisce Tim.

L'unico modo in cui riesco a pensare di rimuovere i tag HTML ma di lasciare un codice non HTML tra parentesi angolari sarebbe un controllo a elenco di tag HTML . Qualcosa del genere ...

replaceAll("\\<[\s]*tag[^>]*>","")

Quindi decodifica i caratteri speciali HTML come &. Il risultato non deve essere considerato sterilizzato.

Questo dovrebbe funzionare -

usa questo

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

e questo

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

La risposta accettata non ha funzionato per me nel caso di test che ho indicato: il risultato di " a < b !> gt; & C quot; è " a b ! gt; c quot &;.

Quindi, ho usato TagSoup invece. Ecco uno scatto che ha funzionato per il mio caso di test (e un paio di altri):

import java.io.IOException; import java.io.StringReader; import java.util.logging.Logger; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.Attributes; import org.xml.sax.ContentHandler; import org.xml.sax.InputSource; import org.xml.sax.Locator; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; /** * Take HTML and give back the text part while dropping the HTML tags. * * There is some risk that using TagSoup means we'll permute non-HTML text. * However, it seems to work the best so far in test cases. * * @author dan * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> */ public class Html2Text2 implements ContentHandler { private StringBuffer sb; public Html2Text2() { } public void parse(String str) throws IOException, SAXException { XMLReader reader = new Parser(); reader.setContentHandler(this); sb = new StringBuffer(); reader.parse(new InputSource(new StringReader(str))); } public String getText() { return sb.toString(); } @Override public void characters(char[] ch, int start, int length) throws SAXException { for (int idx = 0; idx < length; idx++) { sb.append(ch[idx+start]); } } @Override public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { sb.append(ch); } // The methods below do not contribute to the text @Override public void endDocument() throws SAXException { } @Override public void endElement(String uri, String localName, String qName) throws SAXException { } @Override public void endPrefixMapping(String prefix) throws SAXException { } @Override public void processingInstruction(String target, String data) throws SAXException { } @Override public void setDocumentLocator(Locator locator) { } @Override public void skippedEntity(String name) throws SAXException { } @Override public void startDocument() throws SAXException { } @Override public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { } @Override public void startPrefixMapping(String prefix, String uri) throws SAXException { } }

So che è vecchio, ma stavo solo lavorando a un progetto che mi richiedeva di filtrare HTML e che funzionava bene:

noHTMLString.replaceAll("\\&.*?\\;", "");

invece di questo:

html = html.replaceAll(" ",""); html = html.replaceAll("&"."");

Ecco un aggiornamento leggermente più elaborato per provare a gestire un po 'di formattazione per interruzioni ed elenchi. Ho usato l'output di Amaya come guida.

import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.Stack; import java.util.logging.Logger; import javax.swing.text.MutableAttributeSet; import javax.swing.text.html.HTML; import javax.swing.text.html.HTMLEditorKit; import javax.swing.text.html.parser.ParserDelegator; public class HTML2Text extends HTMLEditorKit.ParserCallback { private static final Logger log = Logger .getLogger(Logger.GLOBAL_LOGGER_NAME); private StringBuffer stringBuffer; private Stack<IndexType> indentStack; public static class IndexType { public String type; public int counter; // used for ordered lists public IndexType(String type) { this.type = type; counter = 0; } } public HTML2Text() { stringBuffer = new StringBuffer(); indentStack = new Stack<IndexType>(); } public static String convert(String html) { HTML2Text parser = new HTML2Text(); Reader in = new StringReader(html); try { // the HTML to convert parser.parse(in); } catch (Exception e) { log.severe(e.getMessage()); } finally { try { in.close(); } catch (IOException ioe) { // this should never happen } } return parser.getText(); } public void parse(Reader in) throws IOException { ParserDelegator delegator = new ParserDelegator(); // the third parameter is TRUE to ignore charset directive delegator.parse(in, this, Boolean.TRUE); } public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { log.info("StartTag:" + t.toString()); if (t.toString().equals("p")) { if (stringBuffer.length() > 0 && !stringBuffer.substring(stringBuffer.length() - 1) .equals("\n")) { newLine(); } newLine(); } else if (t.toString().equals("ol")) { indentStack.push(new IndexType("ol")); newLine(); } else if (t.toString().equals("ul")) { indentStack.push(new IndexType("ul")); newLine(); } else if (t.toString().equals("li")) { IndexType parent = indentStack.peek(); if (parent.type.equals("ol")) { String numberString = "" + (++parent.counter) + "."; stringBuffer.append(numberString); for (int i = 0; i < (4 - numberString.length()); i++) { stringBuffer.append(" "); } } else { stringBuffer.append("* "); } indentStack.push(new IndexType("li")); } else if (t.toString().equals("dl")) { newLine(); } else if (t.toString().equals("dt")) { newLine(); } else if (t.toString().equals("dd")) { indentStack.push(new IndexType("dd")); newLine(); } } private void newLine() { stringBuffer.append("\n"); for (int i = 0; i < indentStack.size(); i++) { stringBuffer.append(" "); } } public void handleEndTag(HTML.Tag t, int pos) { log.info("EndTag:" + t.toString()); if (t.toString().equals("p")) { newLine(); } else if (t.toString().equals("ol")) { indentStack.pop(); ; newLine(); } else if (t.toString().equals("ul")) { indentStack.pop(); ; newLine(); } else if (t.toString().equals("li")) { indentStack.pop(); ; newLine(); } else if (t.toString().equals("dd")) { indentStack.pop(); ; } } public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) { log.info("SimpleTag:" + t.toString()); if (t.toString().equals("br")) { newLine(); } } public void handleText(char[] text, int pos) { log.info("Text:" + new String(text)); stringBuffer.append(text); } public String getText() { return stringBuffer.toString(); } public static void main(String args[]) { String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol> <li>This</li> <li>is</li> <li>an</li> <li>ordered</li> <li>list <p>with</p> <ul> <li>another</li> <li>list <dl> <dt>This</dt> <dt>is</dt> <dd>sdasd</dd> <dd>sdasda</dd> <dd>asda <p>aasdas</p> </dd> <dd>sdada</dd> <dt>fsdfsdfsd</dt> </dl> <dl> <dt>vbcvcvbcvb</dt> <dt>cvbcvbc</dt> <dd>vbcbcvbcvb</dd> <dt>cvbcv</dt> <dt></dt> </dl> <dl> <dt></dt> </dl></li> <li>cool</li> </ul> <p>stuff</p> </li> <li>cool</li></ol><p></p></body></html>"; System.out.println(convert(html)); } }

In alternativa, è possibile utilizzare HtmlCleaner :

private CharSequence removeHtmlFrom(String html) { return new HtmlCleaner().clean(html).getText(); }

Usa Html.fromHtml

HTML i tag sono

<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn> <div align=”…”>, <em>, <font size=”…” color=”…” face=”…”> <h1>, <h2>, <h3>, <h4>, <h5>, <h6> <i>, <p>, <small> <strike>, <strong>, <sub>, <sup>, <tt>, <u>

Secondo Android & # 8217; s Documentazioni ufficiali tutti i tag nel HTML verranno visualizzati come una sostituzione generica String che il tuo programma può quindi passare e sostituire con stringhe .
Il metodo
Html.formHtml accetta un Html.TagHandler e un Html.ImageGetter come argomenti e come testo da analizzare.

Esempio

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Poi

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

Questo è il mio testo che l'utente può inserire nel suo profilo

Un altro modo può essere quello di utilizzare la classe com.google.gdata.util.common.html.HtmlToText come

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

Questo non è un codice a prova di proiettile e quando lo eseguo su voci di Wikipedia ricevo anche informazioni sullo stile. Tuttavia, credo che per piccoli / semplici lavori sarebbe efficace.

Sembra che tu voglia passare dall'HTML al semplice testo.
In tal caso, consultare www.htmlparser.org. Ecco un esempio che rimuove tutti i tag dal file html trovato in un URL.
Utilizza org.htmlparser.beans.StringBean .

static public String getUrlContentsAsText(String url) { String content = ""; StringBean stringBean = new StringBean(); stringBean.setURL(url); content = stringBean.getStrings(); return content; }

Ecco un altro modo per farlo:

public static String removeHTML(String input) { int i = 0; String[] str = input.split(""); String s = ""; boolean inTag = false; for (i = input.indexOf("<"); i < input.indexOf(">"); i++) { inTag = true; } if (!inTag) { for (i = 0; i < str.length; i++) { s = s + str[i]; } } return s; }

Ecco un'altra variante di come sostituire tutto (Tag HTML | Entità HTML | Spazio vuoto nel contenuto HTML)

content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); dove il contenuto è una stringa.

Si potrebbe anche usare Apache Tika per questo scopo. Per impostazione predefinita, conserva gli spazi bianchi dal codice HTML rimosso, che può essere desiderato in determinate situazioni:

InputStream htmlInputStream = .. HtmlParser htmlParser = new HtmlParser(); HtmlContentHandler htmlContentHandler = new HtmlContentHandler(); htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata()) System.out.println(htmlContentHandler.getBodyText().trim())

Un modo per conservare le informazioni di nuova riga con JSoup è quello di precedere tutti i nuovi tag di linea con una stringa fittizia, eseguire JSoup e sostituire la stringa fittizia con " \ n " ;.

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc."; String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND"; for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) { html = html.replace(tag, NEW_LINE_MARK+tag); } String text = Jsoup.parse(html).text(); text = text.replace(NEW_LINE_MARK + " ", "\n\n"); text = text.replace(NEW_LINE_MARK, "\n\n");

Puoi semplicemente utilizzare il filtro HTML predefinito di Android

public String htmlToStringFilter(String textToFilter){ return Html.fromHtml(textToFilter).toString(); }

Il metodo sopra restituirà la stringa filtrata HTML per l'input.

I miei 5 centesimi:

String[] temp = yourString.split("&"); String tmp = ""; if (temp.length > 1) { for (int i = 0; i < temp.length; i++) { tmp += temp[i] + "&"; } yourString = tmp.substring(0, tmp.length() - 1); }

Per ottenere testo HTML semplice formattato puoi farlo:

String BR_ESCAPED = "<br/>"; Element el=Jsoup.parse(html).select("body"); el.select("br").append(BR_ESCAPED); el.select("p").append(BR_ESCAPED+BR_ESCAPED); el.select("h1").append(BR_ESCAPED+BR_ESCAPED); el.select("h2").append(BR_ESCAPED+BR_ESCAPED); el.select("h3").append(BR_ESCAPED+BR_ESCAPED); el.select("h4").append(BR_ESCAPED+BR_ESCAPED); el.select("h5").append(BR_ESCAPED+BR_ESCAPED); String nodeValue=el.text(); nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>"); nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

Per ottenere testo semplice formattato cambia < br / > di \ n e cambia l'ultima riga di:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()

puoi semplicemente creare un metodo con più replAll () come

String RemoveTag(String html){ html = html.replaceAll("\\<.*?>","") html = html.replaceAll(" ",""); html = html.replaceAll("&".""); ---------- ---------- return html; }

Usa questo link per i sostituti più comuni di cui hai bisogno: http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html

È semplice ma efficace. Uso prima questo metodo per rimuovere la posta indesiderata ma non la prima riga, ovvero sostituire All (& Quot; \ & Lt;. *? & Gt; & Quot;, & Quot; quot;) e successivamente utilizzo parole chiave specifiche per cercare gli indici e quindi uso il metodo .substring (inizio, fine) per eliminare le cose non necessarie. Poiché questo è più robusto e puoi individuare esattamente ciò di cui hai bisogno nell'intera pagina html.

Rimuovi i tag HTML dalla stringa. Da qualche parte dobbiamo analizzare alcune stringhe che vengono ricevute da alcune risposte come Httpresponse dal server.

Quindi dobbiamo analizzarlo.

Qui mostrerò come rimuovere i tag html dalla stringa.

// sample text with tags string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>"; // regex which match tags System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>"); // replace all matches with empty strin str = rx.Replace(str, ""); //now str contains string without html tags

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow