Domanda

I wrote a function to use RCurl to obtain the effective URL for a list of shortened URL redirects (bit.ly, t.co, etc.) and handle errors when the effective URL locates a document (PDFs tend to throw "Error in curlPerform... embedded nul in string.")

I would like to make this function more efficiently if possible (while keeping it in R). As written the run-time is prohibitively long for un-shortening a thousand or more URLs.

?getURI tells us that by default, getURI/getURL goes asynchronous when the length of the url vector is >1. But my performance seems totally linear, presumably because sapply turns the thing into one big for loop and the concurrency is lost.

Is there anyway I can speed up these requests? Extra credit for fixing the "embedded nul" issue.

require(RCurl)

options(RCurlOptions = list(verbose = F, followlocation = T,
                        timeout = 500, autoreferer = T, nosignal = T,
                        useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))

# find successful location (or error msg) after any redirects
getEffectiveUrl <- function(url){ 
  c = getCurlHandle()
  h = basicHeaderGatherer()
  curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
  possibleError <- tryCatch(getURI( url, curl=c, followlocation=T, 
                                    headerfunction = h$update, async=T),
                            error=function(e) e)  
  if(inherits(possibleError, "error")){
    effectiveUrl <- "ERROR_IN_PAGE" # fails on linked documents (PDFs etc.)
  } else { 
    headers <- h$value()
    names(headers) <- tolower(names(headers)) #sometimes cases change on header names?
    statusPrefix <- substr(headers[["status"]],1,1) #1st digit of http status
    if(statusPrefix=="2"){ # status = success
      effectiveUrl <- getCurlInfo(c)[["effective.url"]]
    } else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) } 
  }
  effectiveUrl
}

testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
              "http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
              "http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
              "http://t.co/cHglxQkz6Z") # 10th URL redirects to content w/ embedded nul
system.time(
  effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop

# does Vectorize help? 
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
  effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse
È stato utile?

Soluzione

I had bad experience with RCurl and Async request. R would completely freeze (though no error message, CPU and RAM did not spike) with only concurrent 20 requests.

I recommend switching to CURL and using curl_fetch_multi() function. It my case it could easily handle 50000 JSON request in one pool (with some division into subpools under the hood). https://cran.r-project.org/web/packages/curl/vignettes/intro.html#async_requests

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top