Question

I'd like to ask a question on the issue I'm currently stuck with. When trying to scrape an HTML page (using RCurl), I encounter this error: "Error in curlMultiPerform(multiHandle): embedded nul in string". I read a lot about this type of error and advices on how to deal with it (including one from Duncan Temple Lang, the creator of RCurl package). But even after applying his advice (as follows) I am getting the same error:

htmlPage <- rawToChar(getURLContent(url, followlocation = TRUE, binary = TRUE))
doc <- htmlParse(htmlPage, asText=TRUE)

Am I missing something? Any help will be much appreciated!


Edit:

However, there's 2nd error I haven't mentioned in the original post. It occurs here:

data <- lapply(i <- 1:length(links),
               function(url) try(read.table(bzfile(links[i]),
                                            sep=",", row.names=NULL)))

The error: Error in bzfile(links[i]) : invalid 'description' argument.

'links' is a list of files' FULL URLs, constructed as follows:

links <- lapply(filenames, function(x) paste(url, x, sep="/"))

By using links[i], I'm trying to refer to the current element of links list in an ongoing iteration of `lapply().


Second Edit:

Currently I'm struggling with the following code. I found several more cases where people advise exactly the same approach, which keeps me curious why it doesn't work in my situation...

getData <- function(x) try(read.table(bzfile(x), sep = ",", row.names = NULL))
data <- lapply(seq_along(links), function(i) getData(links[[i]]))
Was it helpful?

Solution 2

I was able to figure out myself the reasons of the issues described above. It took me a lot of time and effort, but it was worth it - now I understand R lists and lapply() better.

Essentially, I made three major changes:

1) added textConnection() and readLines() to process CSV-like files:

conn <- gzcon(bzfile(file, open = "r"))
tConn <- textConnection(readLines(conn))

However, I've discovered some issues with this approach - see my other SO question: Extremely slow R code and hanging.

2) used correct subscription notation to refer to the appropriate elements of list inside of function(i) passed to lapply():

url <- links[[1]][i]

3) used correct subscription notation to refer to whole list for lapply():

data <- lapply(seq_along(links[[1]]), getData)

Thanks to all who participated in and helped answering this question!

OTHER TIPS

Sasha,

try this

library(XML)
url <- "http://flossdata.syr.edu/data/fc/2013/2013-Dec/"
doc <- htmlParse(url)
ndx <- getNodeSet(doc,"//table")

It works like a charm.

Good luck.

S.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top