Question

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.

This is the Java code I am using:

String resultText = scrapePage(htmldoc);

private String scrapePage(Document doc) {
    Element allHTML = doc.select("html").first();
    return allHTML.text();
}

Run against the following HTML:

<html>
  <body>
    <h1>Title</h1>
    <p>here is para1</p>
    <p>here is para2</p>
  </body>
</html>

Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".

I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching

(e.g. data1data2 would come from):

<td>data1</td><td>data2</td>

Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

Was it helpful?

Solution

I don't have this issue using JSoup 1.7.3.

Here's the full code i used for testing:

final String html = "<html>\n"
        + "  <body>\n"
        + "    <h1>Title</h1>\n"
        + "    <p>here is para1</p>\n"
        + "    <p>here is para2</p>\n"
        + "  </body>\n"
        + "</html>";

Document doc = Jsoup.parse(html);

Element element = doc.select("html").first();

System.out.println(element.text());

And the output:

Title here is para1 here is para2

Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.

OTHER TIPS

Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...

void example2text() throws IOException {
    String url = "http://www.example.com/";

    String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
    org.jsoup.nodes.Document doc = Jsoup.parse(out);

    String text = "";
    Elements tags = doc.select("*");
    for (Element tag : tags) {
        for (TextNode tn : tag.textNodes()) {
            String tagText = tn.text().trim();
            if (tagText.length() > 0) {
                text += tagText + " ";
            }
        }
    }
    System.out.println(text);
}

By using answer: https://stackoverflow.com/a/35798214/4642669

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top