Frage

I would like to use XPath to return a character vector of the link inside each anchor tag.

I can return the table of interest with

library(RCurl)
library(XML)
url <- "http://dps.alaska.gov/sorweb/aspx/sorcra1.aspx"
readHTMLTable(url, useInternalNodes = T)[[3]]

but I want to also return the link in each anchor tag associated with the name. This is what I have so far.

dat <- htmlTreeParse(url, useInternalNodes = T)
getNodeSet(dat, "///tr/td/a")

So my output is a list of RCurl objects instead of a desired character vector and I have included other tags besides the ones in my table with an imperfect XPath.

So my question is two parts. How do I convert the getNodeSet element outputs to characters vectors with the tag and what is an efficient method to get a desired XPath search?

War es hilfreich?

Lösung

It seems that the objects returned by getNodeSet behave funny. When you print() them, you get a nicely-formatted string representation of the node, but when you try to as.character() them, it blows up.

A straightforward way would be to examine the code of the function print.XMLInternalNode and see what that does.

> getAnywhere(print.XMLInternalNode)
A single object matching ‘print.XMLInternalNode’ was found
It was found in the following places
  registered S3 method for print from namespace XML
  namespace:XML
with value

function (x, ...)
{
    cat(as(x, "character"), "\n")
}
<environment: namespace:XML>

Ah ha! The XMLInternalNode objects returned are S4, so they don't have the usual as.whatever() S3 methods set up for them.

So to get all of the results as character vectors, I'd do something like this:

> dat <- htmlTreeParse(url, useInternalNodes=T)
> x<-getNodeSet(dat,"///tr/td/a")
> sapply(x, function(n) {as(n, "character")})

As to the second part of your question, I would recommend not worrying about optimizing the XPath query right now. Just get your stuff working first. Once you've got it all working, if it's fast enough, you're done. If it's not, then start profiling your code to determine where the bottlenecks are. It may not even be the XPath that's slowing you down (just guessing, but the amount of time it takes to retrieve the page from the webserver is probably the biggest portion of your execution time).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top