How do I use an XPath query to get a list of character vectors in R?

Question

It seems that the objects returned by getNodeSet behave funny. When you print() them, you get a nicely-formatted string representation of the node, but when you try to as.character() them, it blows up.

A straightforward way would be to examine the code of the function print.XMLInternalNode and see what that does.

> getAnywhere(print.XMLInternalNode)
A single object matching ‘print.XMLInternalNode’ was found
It was found in the following places
  registered S3 method for print from namespace XML
  namespace:XML
with value

function (x, ...)
{
    cat(as(x, "character"), "\n")
}
<environment: namespace:XML>

Ah ha! The XMLInternalNode objects returned are S4, so they don't have the usual as.whatever() S3 methods set up for them.

So to get all of the results as character vectors, I'd do something like this:

> dat <- htmlTreeParse(url, useInternalNodes=T)
> x<-getNodeSet(dat,"///tr/td/a")
> sapply(x, function(n) {as(n, "character")})

As to the second part of your question, I would recommend not worrying about optimizing the XPath query right now. Just get your stuff working first. Once you've got it all working, if it's fast enough, you're done. If it's not, then start profiling your code to determine where the bottlenecks are. It may not even be the XPath that's slowing you down (just guessing, but the amount of time it takes to retrieve the page from the webserver is probably the biggest portion of your execution time).