I would do this using XML
package. Then you should avoid using regular expression with html/xml documents. Use xpath
instead. Here I create a small function that giving a link it create the corpus.
library(XML)
create.corpus <- function(link){
doc <- htmlParse(link)
parag <- xpathSApply(doc,'//p',xmlValue)
library(tm)
cc <- Corpus(VectorSource(parag))
meta(cc,type='corpus','link') <- link
cc
}
## call it
cc <- create.corpus(link)
Inspecting the result:
meta(cc,type='corpus')
# $create_date
# [1] "2014-01-03 17:40:50 GMT"
#
# $creator
# [1] ""
#
# $link
# [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
> cc
# A corpus with 36 text documents