Automate webscraping with r

https://stackoverflow.com/questions/21575218

07-10-2022
|

Question

I have managed to scrape content for a single url, but am struggling to automate it for multiple urls.

Here how it is done for a single page:

library(XML); library(data.table)
theurl <- paste("http://google.com/",url,"/ul",sep="")
convertUTF <- htmlParse(theurl, encoding = "UTF-8")
tables <- readHTMLTable(convertUTF)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
table <- tables[[which.max(n.rows)]]
TableData <- data.table(table)

Now I have a vector of urls and want to scrape each for the corresponding table:

Here, I read in data comprising multiple http links:

ur.l <- data.frame(read.csv(file.choose(), header=TRUE, fill=TRUE))

theurl <- matrix(NA, nrow=nrow(ur.l), ncol=1)
for(i in 1:nrow(ur.l)){
  url <- as.character(ur.l[i, 2])
   }

Solution

Each of the three additional urls that you provided refer to pages that contain no tables, so it's not a particularly useful example dataset. However, a simple way to handle errors is with tryCatch. Below I've defined a function that reads in tables from url u, calculates the number of rows for each table at that url, then returns the table with the most rows as a data.table.

You can then use sapply to apply this function to each url (or, in your case, each org ID, e.g. 36245119) in a vector.

library(XML); library(data.table)
scrape <- function(u) {
  tryCatch({
    tabs <- readHTMLTable(file.path("http://finstat.sk", u, "suvaha"), 
                          encoding='utf-8')
    tab <- tabs[[which.max(sapply(tabs, function(x) nrow(x)))]]
    data.table(tab)  
  }, error=function(e) e)
}

urls <- c('36245119', '46894853', '46892460', '46888721')
res <- sapply(urls, scrape)

Take a look at ?tryCatch if you want to improve the error handling. Presently the function simply returns the errors themselves.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow