Question

Trying to do some text-mining and wordcloud visualization on Spanish text. I actually have 9 different .txt files, but will just post one for reproduction.

"Nos los representantes del pueblo de la Nación ARGENTINA, reunidos en Congreso General Constituyente por voluntad y elección de las provincias que la componen, en cumplimiento de pactos preexistentes, con el objeto de constituir la unión nacional, afianzar la justicia, consolidar la paz interior, proveer la defensa común, promover el bienestar general, y asegurar los beneficios de la libertad, para nosotros, para nuestra posteridad, y para todos los hombres del mundo que quieran habitar en el suelo argentino: invocando la protección de Dios, fuente de toda razón y justicia: ordenamos, decretamos y establecemos esta Constitución, para la Nación ARGENTINA."

The file is saved as a .txt file. Below is my naïve attempt to generate the term-document-matrix with the correct encoding. When I inspect it, I am not getting the text as it is in the original file ("constitución" becomes "constitucif3n," for example). I'm new to text-mining, and knowing that the solution probably involves a wide variety of co-dependent adjustments, I figured I'd ask here instead of searching for 4 hours. Thanks in advance.

#Generate Term-Document-Matrix

#Convert Text to Corpus and Clean
cleanCorpus <- function(corpus) {
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, tolower)
  corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("spanish"))
  return(corpus.tmp)
}

generateTDM <- function(path) {
  cor.tmp <- Corpus(DirSource(directory=path, encoding="ISO8859-1"))
  cor.cl <- cleanCorpus(cor.tmp)
  tdm.tmp <- TermDocumentMatrix(cor.cl)
  tdm.s <- removeSparseTerms(tdm.tmp, 0.7)
}

tdm <- generateTDM(pathname)
tdm.m <- as.matrix(tdm)
Was it helpful?

Solution

Answer: Make sure the original text file is UTF-8 encoded. To do this, I had to change up my Saving preferences in TextEdit on Mac. This made everything work seamlessly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top