Formatting text output of html with jSoup

https://stackoverflow.com/questions/19644633

01-07-2022
|

문제

I have a document I want to parse it contains html, I want to convert if from html to plaintext but with formatting.

Example extract

<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>

I can do the simple example quite easily by doing (maybe not efficently)

StringBuilder sb = new StringBuilder();

for(Element element : doc.getAllElements()){
    if(element.tag().getName().equals("p")){
        sb.append(element.text());
        sb.append("\n\n");
    }
}

Is it possible (and how would I do it) to insert output for an inline element in the correct place. An example:

<p>My paragragh with <a>Link</a> in the middle</p>

would become:

My paragragh with (Location: http://mylink.com) in the middle

해결책

You can replace each link-tag with a TextNode:

final String html = "<p>My simple paragragh</p>\n"
        + "<p>My paragragh with <a>Link</a></p>\n"
        + "<p>My paragragh with an <img/></p>";

Document doc = Jsoup.parse(html, "");

// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
    element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}


StringBuilder sb = new StringBuilder();

// Format as needed
for( Element element : doc.select("*") )
{
    // An alternative to the 'if'-statement
    switch(element.tagName())
    {
        case "p":
            sb.append(element.text()).append("\n\n");
            break;
        // Maybe you have to format some other tags here too ...
    }
}

System.out.println(sb);

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow