Question

I'm extracting content from a web page using Yahoo Pipes. For some reason, the developer placed the article content within <h2> tags and I'm having difficulty getting the content from there.

The content looks like this:

<div id="divid"><h2>
<p>Some content<p>
<p>Some more content</p>
</h2>
<!-- some more stuff here -->
</div>

When I use //div[@id='divid'] I can fetch the content of the whole <div> block, but when I try //div[@id='divid']//h2 or //div[@id='divid']//h2/text() I get nothing.

What am I doing wrong and how can I get the content between the <h2> tags correctly?

You may want to check the actual web page.

Was it helpful?

Solution

Maybe what you were missing is ticking the Use HTML5 parser option. Without that it could not match //h2.

That page is quite a piece of work. The text is full of <span...> tags with inline styles. I created a sample pipe to make some sense out of the page:

http://pipes.yahoo.com/pipes/pipe.info?_id=cf46006f77bdac4a6e57785c78cd0b2b

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top