Question

I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word.

I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map.

Please consider the following MWE

library(tm)

docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))

dictionary <- data.frame(word = c('first', 'second', 'text'),
                      translation = c('primo', 'secondo', 'testo'))

translate <- function(text, dictionary) {
  # Would like to replace each word of text with corresponding word in dictionary
}

corp_translated <- tm_map (corp, translate)

inspect(corp_translated)

# Expected result

# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# [[1]]
# primo testo

# [[2]]
# secondo testo
Était-ce utile?

La solution

I would suggest not using a data.frame for a dictionary, since the basic object in R, a vector, is a dictionary by default.

      dict  <- c('primo', 'secondo', 'testo')
names(dict) <- c('first', 'second', 'text')

Then to "tanslate" x, where x might be "second", you simply use:

   dict[[x]]

You dont even need a wrapper function.


If you want to translate in the opposite direction, use

   name(dict)[names(dict) %in% x]

Or you can flip the dictionary

         dict.flip  <- names(dict)
   names(dict.flip) <- dict

Autres conseils

In combination with the tm_map function of the tm package, you can use stri_replace_all_fixed from package stringi. For instance:

library(tm)
library(stringi)

docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))

word <- c('first', 'second', 'text')
tran <- c('primo', 'secondo', 'testo')

corp <- tm_map(corp, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top