Quickly finding an Xpath with R

Question 1

It turns out there is no tbody tag in the html. This is added by the browser. So basically, the xpath recommended by Chrome is wrong.

library(httr)
grepl("table",content(GET(url),type="text"))
# [1] TRUE
grepl("tbody",content(GET(url),type="text"))
# [1] FALSE

Note:: This is in NO WAY a recommendation to use regular expressions to parse html!!!

The problem arises because browsers are designed to be relatively forgiving of improperly formatted html. So if a tag is unambiguously missing, the browser adds it (for example, if you send a page without a body tag, it will render anyway because the browser adds the tag to the DOM after loading the page). htmlParse(...) doesn't work that way: it merely loads and parses the server response. The tbody tag was required for tables in the HTML 4 spec, so the browser adds it. See this post for an explanation.

So one way to deal with this, in a "semi-automatic" way is:

xpath <-paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
if (length(html["//tbody"])==0) xpath <- gsub("/tbody","",xpath)
xpathSApply(html, xpath, xmlValue)
# [1] "James Madison"

Question 2

I would recommend using the selectr package that allows you to use CSS style selectors instead of XPath which can be a pain at times. Alternately, since you are looking for a table, I would recommend using the readHTMLTable function that automatically scrapes all tables of the page.

library(XML)
library(selectr)

url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
doc <- htmlParse(url)
tab <- querySelector(doc, 'table.wikitable')