Domanda

I have a list of URLs and have extracted the content as follows:

library(httr)
link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
get.link=GET(link)
get.content=content(x2,as="text")
extract.content=str_extract_all(y2,"<p>(.*?)</p>")

This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that into a Corpus. My attempts fail, eg. this does not work because of the different row lengths:

all=data.frame(url.vec=c(link1,link2),text.vec=c(extract.content1,extract.content2))

Does anyone knows how to combine a character[vector] with a character[list]?

È stato utile?

Soluzione

I would do this using XML package. Then you should avoid using regular expression with html/xml documents. Use xpath instead. Here I create a small function that giving a link it create the corpus.

library(XML)
create.corpus <- function(link){
  doc <- htmlParse(link)
  parag <- xpathSApply(doc,'//p',xmlValue)
  library(tm)
  cc <- Corpus(VectorSource(parag))
  meta(cc,type='corpus','link') <- link
  cc
}
## call it 
cc <- create.corpus(link)

Inspecting the result:

 meta(cc,type='corpus')
# $create_date
# [1] "2014-01-03 17:40:50 GMT"
# 
# $creator
# [1] ""
# 
# $link
# [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"

> cc
# A corpus with 36 text documents
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top