Reading Google Sitemap XML via PHP [duplicate]

https://stackoverflow.com/questions/8442387

12-03-2021
|

Domanda

I have a image website hosted by a company. They generate (and submit to Google) a sitemap for my site. I'm trying to read the XML so I can "do stuff" with the data in my sitemap (namely hunt down missing captions and missing titles AND randomly posting one of these entries in my site as "image of the day"). The format for the sitemap is as follows:

 <url>
      <loc>http://www/link</loc> 
     <image:image>
          <image:loc>http://www/img.jpg</image:loc> 
          <image:caption>caption for the image here</image:caption> 
          <image:title>title of image here</image:title> 
      </image:image>
  </url>

My issue is I've been struggling to parse this data to make it usable in PHP. I've tried simplexml_load_file, but that only seems to capture the < loc > and ignores the whole < image:image >. I tried ->xpath(), but that has the same result. How do I get this into a usable format?

Footnote: In order to access my sitemap, the xml file is gzipped, so I use the following format to "read" it:

$url = "compress.zlib://http://www/sitemap/0.xml.gz";

I don't know if this has any effect on the input.

Soluzione

As bad solution:

$url = "compress.zlib://http://www/sitemap/0.xml.gz";
$xml=file_get_contents($url);

$xml=preg_replace('/image:(.*?)>/i','$1>',$xml);

print_r(simplexml_load_string($x));

Altri suggerimenti

For completeness, I replaced the print_r() with the following:

foreach (simplexml_load_string($xml) as $entry) {
    $loc = $entry->loc;
    $caption = $entry->image->caption;
    $title = $entry->image->title;

    // do stuff here
}

This should be the right (but not pretty) way of retrieving the nodes belonging to other namespaces (sorry about my bad English).

Let's take the following sitemap xml:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>http://www.example.com/</loc>
        <image:image>
            <image:loc>http://www.example.com/img.jpg</image:loc> 
            <image:caption>image caption</image:caption> 
            <image:title>image title</image:title> 
         </image:image>
    </url>
    <url>
        <loc>http://www.example.com/about.php</loc>
    </url>
</urlset>

You load the xml from some url:

$sitemap = simplexml_load_file($sitemap_url);

If you do:

$ns = $sitemap->getNamespaces(true);
print_r($ns);

You wil get the following array:

Array
(
    [] => http://www.sitemaps.org/schemas/sitemap/0.9
    [image] => http://www.google.com/schemas/sitemap-image/1.1
)

Let's take the first url node (surelly you will do a foreach in your code)

$url = $sitemap->url[0];

For reading the 'image' nodes, you must use the 'children' method, passing the rigth namespace as argument, so:

$child = $url->children($ns['image']);

or the even uglier

$child = $url->children('http://www.google.com/schemas/sitemap-image/1.1');

By doing a

print_r($child);

you will get:

SimpleXMLElement Object
(
    [image] => SimpleXMLElement Object
        (
            [loc] => http://www.example.com/img.jpg
            [caption] => image caption
            [title] => image title
        )

)

So, you can use for example:

$caption = $child->image->caption;

Hope this could help. More info in this article http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/

Parse as array!

As http://www.sitemaps.org/protocol.html XML description, it is a simple tree with good array representation.

You can use a 3-line XML reader,

$sitemap_array = json_decode(
   json_encode( simplexml_load_string($sitemap_xml) ),
   TRUE
);

So use eg. foreach($sitemap_array['image:image'] as $r) to traverse it (check by var_dump($sitemap_array))... see also oop5.iterations.

PS: you can also do a previous node selection by XPath at simplexml.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow