Question

I have a vector of links from which I would like to create a sitemap.xml file (file protocol is available from here: http://www.sitemaps.org/protocol.html)

I understand the sitemap.xml protocol (it is rather simple), but I'm not sure what is the smartest way to use the {XML} package for it.

A simple example:

 links <- c("http://r-statistics.com",
             "http://www.r-statistics.com/on/r/",
             "http://www.r-statistics.com/on/ubuntu/")

How can "links" be used to construct a sitemap.xml file?

Was it helpful?

Solution

Is something like this what you are looking for. (It uses the httr package to get the last modified bit and writes the XML directly with the very useful whisker package.)

require(whisker)
require(httr)
tpl <- '
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 {{#links}}
   <url>
      <loc>{{{loc}}}</loc>
      <lastmod>{{{lastmod}}}</lastmod>
      <changefreq>{{{changefreq}}}</changefreq>
      <priority>{{{priority}}}</priority>
   </url>
 {{/links}}
</urlset>
'

links <- c("http://r-statistics.com", "http://www.r-statistics.com/on/r/", "http://www.r-statistics.com/on/ubuntu/")


map_links <- function(l) {
  tmp <- GET(l)
  d <- tmp$headers[['last-modified']]

  list(loc=l,
       lastmod=format(as.Date(d,format="%a, %d %b %Y %H:%M:%S")),
       changefreq="monthly",
       priority="0.8")
}

links <- lapply(links, map_links)

cat(whisker.render(tpl))

OTHER TIPS

I could not use @jverzani's solution, because I wasn't able to create a valid xml file from the cat output. Thus I created an alternative.

## Input a data.frame with 4 columns: loc, lastmod, changefreq, and priority
## This data.frame is named sm in the code below

library(XML)
doc <- newXMLDoc()
root <- newXMLNode("urlset", doc = doc)
temp <- newXMLNamespace(root, "http://www.sitemaps.org/schemas/sitemap/0.9")
temp <- newXMLNamespace(root, "http://www.google.com/schemas/sitemap-image/1.1", "image")

for (i in 1:nrow(sm))
{
  urlNode <- newXMLNode("url", parent = root)
  newXMLNode("loc", sm$loc[i], parent = urlNode)
  newXMLNode("lastmod", sm$lastmod[i], parent = urlNode)
  newXMLNode("changefreq", sm$changefreq[i], parent = urlNode)
  newXMLNode("priority", sm$priority[i], parent = urlNode)
  rm(i, urlNode)
}

saveXML(doc, file="sitemap.xml")
rm(doc, root, temp)
browseURL("sitemap.xml")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top