1) Question: How to read and tokenize texts with letter 'я'? Answer: try to write your own tokenizer and use it. For example:
my_tokenizer <- function (x)
{
strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))
2) Performance heavily depend on... performance of tolower function. May be this is a bug, I don't know, but on every time you call it you have to convert your text into native encoding using enc2native. (of course if your text language is not english).
doc.corpus <- Corpus(VectorSource(enc2native(textVector)))
And moreover after all text preprocessing on your corpus you have to convert it again. (this is because TermDocumentMatrix and many other function in tm package internally use tolower)
tm_map(doc.corpus, enc2native)
So your full flow will look like something like this:
createCorp <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(enc2native(textVector)))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("russian"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "russian")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
return(tm_map(doc.corpus, enc2native))
}
my_tokenizer <- function (x)
{
strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))