Question

G'day Everyone,

I have a very long list of place names (~15,000) that I want to use to look up wiki pages and extract data from them. Unfortunately not all of the places have wiki pages and when htmlParse() hits them it stops the function and returns an error.

    Error: failed to load HTTP resource

I can't go through and remove every place name that creates a non-existent URL so I was wondering if there is a way to get the function to skip places that don't have a wiki page?

    # Town names to be used
    towns <- data.frame('recID' = c('G62', 'G63', 'G64', 'G65'), 
                    'state' = c('Queensland', 'South_Australia', 'Victoria', 'Western_Australia'),
                    'name'  = c('Balgal Beach', 'Balhannah', 'Ballan', 'Yunderup'),
                    'feature' = c('POPL', 'POPL', 'POPL', 'POPL'))

    towns$state <- as.character(towns$state)

    towns$name <- sub(' ', '_', as.character(towns$name))

   # Function that extract data from wiki
   wiki.tables <- function(towns)  {
      require(RJSONIO)
      require(XML)
      u <- paste('http://en.wikipedia.org/wiki/',
                 sep = '', towns[,1], ',_', towns[,2])
      res <- lapply(u, function(x) htmlParse(x))
      tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
             , readHTMLTable)
      return(tabs)
    }

    # Now to run the function. Yunderup will produce a URL that 
    # doesn't exist. So this will result in the error.
    test <- wiki.tables(towns[,c('name', 'state')])

    # It works if I don't include the place that produces a non-existent URL.
    test <- wiki.tables(towns[1:3,c('name', 'state')])

Is there a way to identify these non-existent URLs and either skip them or remove them?

Thanks for you help!

Cheers, Adam

Was it helpful?

Solution 2

Here's another option that uses the httr package. (BTW: you don't need RJSONIO). Replace your wiki.tables(...) function with this:

wiki.tables <- function(towns)  {
  require(httr)
  require(XML)
  get.HTML<- function(url){
    resp <- GET(url)
    if (resp$status_code==200) return(htmlParse(content(resp,type="text")))
  }
  u <- paste('http://en.wikipedia.org/wiki/',
             sep = '', towns[,1], ',_', towns[,2])
  res <- lapply(u, get.HTML)
  res <- res[sapply(res,function(x)!is.null(x))]   # remove NULLs
  tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
                 , readHTMLTable)
  return(tabs)
}

This runs one GET request and tests the status code. The disadvantage of url.exists(...) is that you have to query every url twice: once to see if it exists, and again to get the data.

Incidentally, when I tried your code the Yunderup url does in fact exist ??

OTHER TIPS

You can use the 'url.exists' function from `RCurl`

require(RCurl)
u <- paste('http://en.wikipedia.org/wiki/',
                 sep = '', towns[,'name'], ',_', towns[,'state'])
> sapply(u, url.exists)
   http://en.wikipedia.org/wiki/Balgal_Beach,_Queensland 
                                                    TRUE 
 http://en.wikipedia.org/wiki/Balhannah,_South_Australia 
                                                    TRUE 
           http://en.wikipedia.org/wiki/Ballan,_Victoria 
                                                    TRUE 
http://en.wikipedia.org/wiki/Yunderup,_Western_Australia 
                                                    TRUE 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top