How to get content in a header using XPath

https://stackoverflow.com/questions/18787315

28-06-2022
|

Question

I'm extracting content from a web page using Yahoo Pipes. For some reason, the developer placed the article content within <h2> tags and I'm having difficulty getting the content from there.

The content looks like this:

<div id="divid"><h2>
<p>Some content<p>
<p>Some more content</p>
</h2>
<!-- some more stuff here -->
</div>

When I use //div[@id='divid'] I can fetch the content of the whole <div> block, but when I try //div[@id='divid']//h2 or //div[@id='divid']//h2/text() I get nothing.

What am I doing wrong and how can I get the content between the <h2> tags correctly?

You may want to check the actual web page.

Solution

Maybe what you were missing is ticking the Use HTML5 parser option. Without that it could not match //h2.

That page is quite a piece of work. The text is full of <span...> tags with inline styles. I created a sample pipe to make some sense out of the page:

http://pipes.yahoo.com/pipes/pipe.info?_id=cf46006f77bdac4a6e57785c78cd0b2b

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow