How to create a sitemap.xml file using R and the {XML} package?
-
30-06-2021 - |
Question
I have a vector of links from which I would like to create a sitemap.xml file (file protocol is available from here: http://www.sitemaps.org/protocol.html)
I understand the sitemap.xml protocol (it is rather simple), but I'm not sure what is the smartest way to use the {XML} package for it.
A simple example:
links <- c("http://r-statistics.com",
"http://www.r-statistics.com/on/r/",
"http://www.r-statistics.com/on/ubuntu/")
How can "links" be used to construct a sitemap.xml file?
Solution
Is something like this what you are looking for. (It uses the httr
package to get the last modified bit and writes the XML directly with the very useful whisker
package.)
require(whisker)
require(httr)
tpl <- '
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{{#links}}
<url>
<loc>{{{loc}}}</loc>
<lastmod>{{{lastmod}}}</lastmod>
<changefreq>{{{changefreq}}}</changefreq>
<priority>{{{priority}}}</priority>
</url>
{{/links}}
</urlset>
'
links <- c("http://r-statistics.com", "http://www.r-statistics.com/on/r/", "http://www.r-statistics.com/on/ubuntu/")
map_links <- function(l) {
tmp <- GET(l)
d <- tmp$headers[['last-modified']]
list(loc=l,
lastmod=format(as.Date(d,format="%a, %d %b %Y %H:%M:%S")),
changefreq="monthly",
priority="0.8")
}
links <- lapply(links, map_links)
cat(whisker.render(tpl))
OTHER TIPS
I could not use @jverzani
's solution, because I wasn't able to create a valid xml file from the cat output. Thus I created an alternative.
## Input a data.frame with 4 columns: loc, lastmod, changefreq, and priority
## This data.frame is named sm in the code below
library(XML)
doc <- newXMLDoc()
root <- newXMLNode("urlset", doc = doc)
temp <- newXMLNamespace(root, "http://www.sitemaps.org/schemas/sitemap/0.9")
temp <- newXMLNamespace(root, "http://www.google.com/schemas/sitemap-image/1.1", "image")
for (i in 1:nrow(sm))
{
urlNode <- newXMLNode("url", parent = root)
newXMLNode("loc", sm$loc[i], parent = urlNode)
newXMLNode("lastmod", sm$lastmod[i], parent = urlNode)
newXMLNode("changefreq", sm$changefreq[i], parent = urlNode)
newXMLNode("priority", sm$priority[i], parent = urlNode)
rm(i, urlNode)
}
saveXML(doc, file="sitemap.xml")
rm(doc, root, temp)
browseURL("sitemap.xml")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow