R tm package: utf-8 text

Question 1

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).

scan_tokenizer function (x) { scan(text = x, what = "character", quote = "", quiet = TRUE) }

One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:

scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))

Then you get the result well encoded:

findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"

Question 2

Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :

Encoding(text)  <- "UTF-8"

Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.

text  <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."

Encoding(text)
# [1] "unknown"

Encoding(text)  <- "UTF-8"
# [1] "UTF-8"

ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))

content(ap.corpus[[1]])

ap.tdm <- TermDocumentMatrix(ap.corpus)

ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)

print(table(ap.d$freq))
# 1  2  3 
# 62  5  1 

print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"

It worked for me, hope it does for you too.