Question

I'm scraping the following site: http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States

Let's say I'm interested in scraping the 4th President - I can see from the table that it's "James Madison". Using a Chrome browser, I can quickly identify the Xpath (Inspect element, Copy XPath). That gives me: "//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a". However, that does not work with R:

library(XML)
url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
html <- htmlTreeParse(url,useInternalNodes=T)
xpath <- paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
xpathSApply(html, xpath, xmlValue)

Returns NULL. The correct XPath to use here is "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a". So my questions are:

  1. How can I change settings in R, so that R sees the same XPath as my Chrome browser? I beleive it's something to do with the http user agent? This post asked a similar question but the answer didn't provide enough detail.
  2. If this is not possible, how can I use the XML package to quickly identify the correct XPath which leads to "James Madison"? i.e. "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a"

Thanks!

Était-ce utile?

La solution

It turns out there is no tbody tag in the html. This is added by the browser. So basically, the xpath recommended by Chrome is wrong.

library(httr)
grepl("table",content(GET(url),type="text"))
# [1] TRUE
grepl("tbody",content(GET(url),type="text"))
# [1] FALSE

Note:: This is in NO WAY a recommendation to use regular expressions to parse html!!!

The problem arises because browsers are designed to be relatively forgiving of improperly formatted html. So if a tag is unambiguously missing, the browser adds it (for example, if you send a page without a body tag, it will render anyway because the browser adds the tag to the DOM after loading the page). htmlParse(...) doesn't work that way: it merely loads and parses the server response. The tbody tag was required for tables in the HTML 4 spec, so the browser adds it. See this post for an explanation.

So one way to deal with this, in a "semi-automatic" way is:

xpath <-paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
if (length(html["//tbody"])==0) xpath <- gsub("/tbody","",xpath)
xpathSApply(html, xpath, xmlValue)
# [1] "James Madison"

Autres conseils

I would recommend using the selectr package that allows you to use CSS style selectors instead of XPath which can be a pain at times. Alternately, since you are looking for a table, I would recommend using the readHTMLTable function that automatically scrapes all tables of the page.

library(XML)
library(selectr)

url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
doc <- htmlParse(url)
tab <- querySelector(doc, 'table.wikitable')
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top