Get title tag from html page using XPath?

https://stackoverflow.com/questions/15141910

16-03-2022
|

Question

I have two pages that Im trying to extract the title tag from using an Xpath query. This page works: http://www.hobbyfarms.com/farm-directory/category-home-and-barn-resources-1.aspx

This page doesn't: http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A

Here's my code:

$dom = new DOMDocument();
@$dom->loadHTMLFile($href);
$xpath = new DOMXPath($dom);

$titleNode = $xpath->query("//title");
foreach ($titleNode as $n) {
    $pageTitle = $n->nodeValue;
}

I've also tried this:

$xpath->query('//title')->item(0)->textContent

But it doesnt work for the one URL either.

Does anyone see why this is occurring? And hopefully have a solution.

Solution

File is Gzipped, the following script works:

$href = 'http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A';
$dom = new DOMDocument();
$file = gzdecode(file_get_contents($href));
$dom->loadHTML($file);
$xpath = new DOMXPath($dom); 
$titleNode = $xpath->query('//title');
var_dump($titleNode->item(0));

(notice the gzdecode function used)

OTHER TIPS

The second page uses the XHTML namespace, and so you have to use XPath's qualified with that namespace:

$xpath->registerNamespace("xhtml", "http://www.w3.org/1999/xhtml");
$titleNode = $xpath->query("//xhtml:title|//title");

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow