Using Java 6 and Jsoup 1.7.3, how can I parse this HTML where sibling text is not inside an element?

https://stackoverflow.com/questions/23461151

java
jsoup

15-07-2023
|

Question

Mainly, my question is how can I parse ...

<p>some text<br />
<br />
<strong>categorized: </strong>like this<br />
<br /></p>

... where I am ultimately interested in obtaining key value pairs like "categorized","like this" using Java and Jsoup? I am looking at the <strong> tag to be some kind of a delimiter I can use to indicate the key, then its following text which is inconveniently not enclosed in a tag I need to grab as the value.

I think the challenge for me is the "like this" part is not in an element. It is a sibling node but it is not selectable with CSS, so I can't find it with Jsoup. I am not clear on how the Node and Element relationship works in Jsoup in such a way that I can get both the element text "categorized" and its sibling "like this" in a single call.

In more detail, I do not have control over the HTML structure since I am trying to collect data from many Consumer Product Safety Commission web pages. The pages are formatted in a few different ways, but there is one format in particular that is causing me problems using Java and Jsoup to parse out data.

<div class="archived">
  <p style="text-align: center;"><strong><span style="color: #ff0000;">Note: The hotline number and ...</span></strong></p>
  <h2 style="text-align: left;">CPSC, Elkay Manufacturing Co. Announces ...</h2>
  <p>WASHINGTON, D.C. - The U.S. Consumer Product Safety Commission ...<br />
    <br />
    <strong>Name of product:</strong> Elkay hot/cold bottled water coolers <br />
    <br />
    <br />
    <strong>Units:</strong> 145,000<br />
    <br />
    <strong>Description:</strong> These 115 volt hot/cold bottled water coolers ... <br />
  <p><img title="Picture of Recalled Water Cooler" src="/PageFiles/73998/04175.jpg" alt="Picture of Recalled Water Cooler" width="110" height="434" /></p>
</div>

That particular section of HTML is shortened, but it originates from http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/

String url = "http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";
Document doc = Jsoup.connect(url).get();
Elements archived = doc.select("div.archived > *");
for(Element ele : archived) {
    //what goes here to get those key/value pairs?
}

Solution

This isn't a complete answer but it'll get you 95% there.

String url="http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";

Document doc = Jsoup.connect(url).get();

Elements archived = doc.select("div.archived strong");

for (Element element: archived){
    System.out.println("KEY: " + element.text());
    System.out.println("VALUE: " + element.nextSibling());
}

Output:

KEY: Firm's Hotline: (800) 303-5507
VALUE: <br />

KEY: Name of product:
VALUE:  Wall Plug Ethernet Bridge

KEY: Units:
VALUE:  About 53,500 units

KEY: Manufacturer:
VALUE:  NETGEAR Inc., of Santa Clara, Calif.

KEY: Hazard:
VALUE:  The plastic housing on these units can detach, posing a shock hazard. 

and so on...

As you can see, it'll require a little bit of work to disregard the unnecessary stuff, like the first element KEY/VALUE pair and whatnot, but otherwise it should work! Good luck.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow