XPath for each Document in R Corpus

https://stackoverflow.com/questions/10144280

31-05-2021
|

Question

I have a corpus, x, in R created from a directory using DirSource. Each document is a text file containing the full HTML of a related vBulletin forum webpage. Since it is a thread, each document has multiple separate posts that I want to capture with my XPath. The XPath seems to work, but I cannot put all my captured nodes back into the corpus.

If my corpus has 25 documents that have an average 4 posts each, then my new corpus should have 100 documents. I'm wondering if I have to do a loop and create a new corpus.

Here is my messy work so far. Any source from a thread in www.vbulletin.org/forum/ is an example of the structure.

#for stepping through
xt <- x[[5]]
xpath <- "//div[contains(@id,'post_message')]"

getxpath <- function(xt,xpath){
  require(XML)

  #either parse
  doc <- htmlParse(file=xt)
  #doc <- htmlTreeParse(tolower(xt), asText = TRUE, useInternalNodes = TRUE)

  #don't know which to use
  #result <- xpathApply(doc,xpath,xmlValue)
  result <- xpathSApply(doc,xpath,xmlValue)

  #clean up
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=result))

  result <- c(result[1:length(result)])

  free(doc)

  #converts group of nodes into 1 data frame with numbers before separate posts
  #require(plyr)
  #xbythread <- ldply(.data=result,.fun=function(x){unlist(x)})

  #don't know what needs to be returned
  result <- Corpus(VectorSource(result))
  #result <- as.PlainTextDocument(result)

  return(result)
}

#call
x2 <- tm_map(x=x,FUN=getxpath,"//div[contains(@id,'post_message')]")

Solution

Figured it out a while ago. htmlParse needs isURL=TRUE.

getxpath <- function(xt,xpath){
  require(XML);require(tm)
  x <- htmlParse(file=u,isURL=TRUE)
  resultvector <- xpathSApply(x,xpath,xmlValue)
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=resultvector))
  return(result)
}

res <- getxpath("http://url.com/board.html","//xpath")

To get all the files, I use list.files to get the file list, Map/clusterMap with getxpath() to put them in a list, do.call to get them in a vector, and Corpus(VectorSource(res)) to put them in a Corpus.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow