문제

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.

Here is my code:

try {
          doc = (Document) Jsoup.parse(new File(input), "UTF-8");
          Elements paragraphs = ((Element) doc).select("td.text");

          for(Element p : paragraphs){
           // System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
            getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");

          }
          Elements links = doc.getElementsByTag("a");
          for (Element link : links) {
            String linkHref = link.attr("href");
            String linkText = link.text();
            getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
          }
}

I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:

Text

Link

Text

Link

I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?

도움이 되었습니까?

해결책

Try

Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");

for (Element p : paragraphs) {
    System.out.println(p.text());
    Elements links =  p.getElementsByTag("a");
    for (Element link : links) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
    }
}
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top