XPath for each Document in R Corpus
-
31-05-2021 - |
Question
I have a corpus, x, in R created from a directory using DirSource. Each document is a text file containing the full HTML of a related vBulletin forum webpage. Since it is a thread, each document has multiple separate posts that I want to capture with my XPath. The XPath seems to work, but I cannot put all my captured nodes back into the corpus.
If my corpus has 25 documents that have an average 4 posts each, then my new corpus should have 100 documents. I'm wondering if I have to do a loop and create a new corpus.
Here is my messy work so far. Any source from a thread in www.vbulletin.org/forum/ is an example of the structure.
#for stepping through
xt <- x[[5]]
xpath <- "//div[contains(@id,'post_message')]"
getxpath <- function(xt,xpath){
require(XML)
#either parse
doc <- htmlParse(file=xt)
#doc <- htmlTreeParse(tolower(xt), asText = TRUE, useInternalNodes = TRUE)
#don't know which to use
#result <- xpathApply(doc,xpath,xmlValue)
result <- xpathSApply(doc,xpath,xmlValue)
#clean up
result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=result))
result <- c(result[1:length(result)])
free(doc)
#converts group of nodes into 1 data frame with numbers before separate posts
#require(plyr)
#xbythread <- ldply(.data=result,.fun=function(x){unlist(x)})
#don't know what needs to be returned
result <- Corpus(VectorSource(result))
#result <- as.PlainTextDocument(result)
return(result)
}
#call
x2 <- tm_map(x=x,FUN=getxpath,"//div[contains(@id,'post_message')]")
Solution
Figured it out a while ago. htmlParse needs isURL=TRUE.
getxpath <- function(xt,xpath){
require(XML);require(tm)
x <- htmlParse(file=u,isURL=TRUE)
resultvector <- xpathSApply(x,xpath,xmlValue)
result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=resultvector))
return(result)
}
res <- getxpath("http://url.com/board.html","//xpath")
To get all the files, I use list.files to get the file list, Map/clusterMap with getxpath() to put them in a list, do.call to get them in a vector, and Corpus(VectorSource(res)) to put them in a Corpus.