Speeding up scraping with R

https://stackoverflow.com/questions/23595581

20-07-2023
|

Question

I wonder whether someone would be so kind and suggest improvement where possible:

ContactAuto2 <- function(x) {
                tryCatch({
        tabs <- getURL(paste("http://google.de/",x,sep=""))
     xmltext <- htmlParse(tabs,encoding = "UTF-8", asText=TRUE)
    xmltable <- xpathApply(xmltext, "//adresse/text()[preceding-sibling::br]",xmlValue)
         val <- gsub("\r\n","",xmltable)
        val2 <- gsub("(\\D)([0-9])","\\1 \\2",val)
    FinalValue <- str_trim(val2)
        }, error=function(e) cat())
     }


   Base <- lapply(url1, ContactAuto2)

If I use sapply(url1, ContactAuto2) there is speed improvement but the data layout is not good.

Is there a way to improve the speed here? Why is lapply so slow compared to sapply?

I would like you to just glance through and make suggestion if any since there is no real workable full example.

Solution

First and foremost, read ?Rprof. It's a speed profiler, and will give you a table of the run-times of every individual execution in your function, making it easy to see where the speed issues are.

Here's one (my first) suggestion.

tabs <- getURL(paste("http://google.de/",x,sep=""))

Google is a massive search engine, and depending on what you're searching for, "Googling" every iteration may suck up a lot of time. Additionally, you're nesting paste in a function that downloads information. While it may not in your function, nesting function calls almost always slows things down. Consider pasting before calling getURL. Also, I would use paste0 instead of paste in this situation.

system.time(replicate(1e6, paste('a', 'b', sep = '')))
## user  system elapsed 
## 5.864   0.000   5.679 
system.time(replicate(1e6, paste0('a', 'b')))
## user  system elapsed 
## 3.98    0.00    3.82

I can't go much further than that without knowing what the xmltables look like. Can you please provide a few sample URLs for us to test your function on?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow