Question

Here is the html I'm trying to parse:

<div class="entry">
    <img src="http://www.example.com/image.jpg" alt="Image Title">
    <p>Here is some text</p>
    <p>Here is some more text</p>
</div>

I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.

Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");

for (Element desc : descs) {
    String text = desc.getElementsByTag("p").first().text();
    myArrayList.add(text);
}

But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.

I'm using a BufferedReader to read the html file one line at a time.

Was it helpful?

Solution

You could change your approach to the following:

Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");

for (Element pElem : pElems) {
   myArrayList.add(pElem.data());
}

OTHER TIPS

Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:

String line = "<div class=\"entry\">" + 
                "<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" + 
                "<p>Here is some text</p>" + 
                "<p>Here is some more text</p>" + 
              "</div>";

Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");

List<String> myArrayList = new ArrayList<String>();

for (Element desc : descs) {
    Elements paragraphs = desc.getElementsByTag("p");
    for (Element paragraph : paragraphs) {
        myArrayList.add(paragraph.text());
    }
}

In your for-loop:

Elements ps = desc.select("p");

(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))

Try this:

Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top