Pergunta

I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.

How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.

library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))

for (i in 1:nrow(df))  {

    URL <- df$URL[i]
        # Exception handling
        Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
        if(inherits(Test, "try-error")) next
    HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
    Result <- xpathSApply(HTML, "//li", xmlValue)
    print(URL)
    print(Result[1])
}

Let's assume that the URL to be scraped is accessible at this step:

Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next

But then the URL stops working just before this step:

HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)

Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?

Thanks

Foi útil?

Solução

Try this:

library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df))  {  
  URL <- df[i]
  response <- GET(URL)
  if (response$status_code != 200) next
  HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
  Result <- xpathSApply(HTML, "//li", xmlValue)
  if (length(Result) == 0) next
  print(URL)
  print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n          \n              Answers          \n        "
# [1] "http://www.bing.com/"
# [1] "Images"

So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top