Question

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.

Here is my code:

try {
          doc = (Document) Jsoup.parse(new File(input), "UTF-8");
          Elements paragraphs = ((Element) doc).select("td.text");

          for(Element p : paragraphs){
           // System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
            getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");

          }
          Elements links = doc.getElementsByTag("a");
          for (Element link : links) {
            String linkHref = link.attr("href");
            String linkText = link.text();
            getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
          }
}

I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:

Text

Link

Text

Link

I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?

Was it helpful?

Solution

Try

Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");

for (Element p : paragraphs) {
    System.out.println(p.text());
    Elements links =  p.getElementsByTag("a");
    for (Element link : links) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top