I am trying to parse this sample html file with the help of Jsoup HTML parsing Library.
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option><option value="Chevy">Chevy</option><option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
And I am getting the following when I extract only text.
this is sample text this is heading sample FordChevySubaru this is second sample text
There is no spaces or line breaks in option tag text.
Whereas If the html had been like this
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option>
<option value="Chevy">Chevy</option>
<option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
now in this case the text is like this
this is sample text this is heading sample Ford Chevy Subaru this is second sample text
with proper spaces in the text of option tag. How do I get the second output with the first html file. i.e. if there is no linebreak in the tags how is it possible that string does not get concatenated.
I am using the following code in Java.
public static String extractText(File file) throws IOException {
Document document = Jsoup.parse(file,null);
Element body=document.body();
String textOnly=body.text();
return textOnly;
}