how to get proper formatted text from html when tags don't have line breaks

https://stackoverflow.com/questions/21991542

15-10-2022
|

Question

I am trying to parse this sample html file with the help of Jsoup HTML parsing Library.

 <html>
 <body>
 <p> this is sample text</p>
 <h1>this is heading sample</h1>
 <select name="car" size="1">
 <option  value="Ford">Ford</option><option  value="Chevy">Chevy</option><option selected value="Subaru">Subaru</option>
 </select>
 <p>this is second sample text</p>
 </body>
 </html>

And I am getting the following when I extract only text.

this is sample text this is heading sample FordChevySubaru this is second sample text

There is no spaces or line breaks in option tag text.

Whereas If the html had been like this

<html>
 <body>
 <p> this is sample text</p>
 <h1>this is heading sample</h1>
 <select name="car" size="1">
 <option  value="Ford">Ford</option>
 <option  value="Chevy">Chevy</option>
 <option selected value="Subaru">Subaru</option>
 </select>
 <p>this is second sample text</p>
 </body>
 </html>

now in this case the text is like this

this is sample text this is heading sample Ford Chevy Subaru this is second sample text

with proper spaces in the text of option tag. How do I get the second output with the first html file. i.e. if there is no linebreak in the tags how is it possible that string does not get concatenated.

I am using the following code in Java.

 public static String extractText(File file) throws IOException {

    Document document = Jsoup.parse(file,null);
    Element body=document.body();
    String textOnly=body.text();
    return textOnly;
    }

Solution

I think only solution that achieves your requirements is traversing the DOM and print the textnodes:

public static String extractText(File file) throws IOException {
    StringBuilder sb = new StringBuilder();
    Document document = Jsoup.parse(file, null);
    Elements body = document.getAllElements();
    for (Element e : body) {
        for (TextNode t : e.textNodes()) {
            String s = t.text();
            if (StringUtils.isNotBlank(s))
                sb.append(t.text()).append(" ");
        }
    }
    return sb.toString();
}

Hope it helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow