Question

I'm scraping the following site: http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States

Let's say I'm interested in scraping the 4th President - I can see from the table that it's "James Madison". Using a Chrome browser, I can quickly identify the Xpath (Inspect element, Copy XPath). That gives me: "//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a". However, that does not work with R:

library(XML)
url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
html <- htmlTreeParse(url,useInternalNodes=T)
xpath <- paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
xpathSApply(html, xpath, xmlValue)

Returns NULL. The correct XPath to use here is "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a". So my questions are:

  1. How can I change settings in R, so that R sees the same XPath as my Chrome browser? I beleive it's something to do with the http user agent? This post asked a similar question but the answer didn't provide enough detail.
  2. If this is not possible, how can I use the XML package to quickly identify the correct XPath which leads to "James Madison"? i.e. "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a"

Thanks!

Was it helpful?

Solution

It turns out there is no tbody tag in the html. This is added by the browser. So basically, the xpath recommended by Chrome is wrong.

library(httr)
grepl("table",content(GET(url),type="text"))
# [1] TRUE
grepl("tbody",content(GET(url),type="text"))
# [1] FALSE

Note:: This is in NO WAY a recommendation to use regular expressions to parse html!!!

The problem arises because browsers are designed to be relatively forgiving of improperly formatted html. So if a tag is unambiguously missing, the browser adds it (for example, if you send a page without a body tag, it will render anyway because the browser adds the tag to the DOM after loading the page). htmlParse(...) doesn't work that way: it merely loads and parses the server response. The tbody tag was required for tables in the HTML 4 spec, so the browser adds it. See this post for an explanation.

So one way to deal with this, in a "semi-automatic" way is:

xpath <-paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
if (length(html["//tbody"])==0) xpath <- gsub("/tbody","",xpath)
xpathSApply(html, xpath, xmlValue)
# [1] "James Madison"

OTHER TIPS

I would recommend using the selectr package that allows you to use CSS style selectors instead of XPath which can be a pain at times. Alternately, since you are looking for a table, I would recommend using the readHTMLTable function that automatically scrapes all tables of the page.

library(XML)
library(selectr)

url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
doc <- htmlParse(url)
tab <- querySelector(doc, 'table.wikitable')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top