There's no code in your question, so it's not really suitable for this site. That said, here are some comments that might be useful. If you supply code you'll get more specific and useful answers.
Yes. Breaking the text into chunks is common and advisable. Exact sizes are a matter of taste. It is often done within R, I've done it before making the corpus. You might also subset only nouns, like @holzben suggests. Here some code for cutting a corpus into chunks:
corpus_chunk <- function(x, corpus, n) { # convert corpus to list of character vectors message("converting corpus to list of vectors...") listofwords <- vector("list", length(corpus)) for(i in 1:length(corpus)) { listofwords[[i]] <- corpus[[i]] } message("done") # divide each vector into chunks of n words # from http://stackoverflow.com/q/16232467/1036500 f <- function(x) { y <- unlist(strsplit(x, " ")) ly <- length(y) split(y, gl(ly%/%n+1, n, ly)) } message("splitting documents into chunks...") listofnwords1 <- sapply(listofwords, f) listofnwords2 <- unlist(listofnwords1, recursive = FALSE) message("done") # append IDs to list items so we can get bibliographic data for each chunk lengths <- sapply(1:length(listofwords), function(i) length(listofnwords1[[i]])) names(listofnwords2) <- unlist(lapply(1:length(lengths), function(i) rep(x$bibliodata$x[i], lengths[i]))) names(listofnwords2) <- paste0(names(listofnwords2), "_", unlist(lapply(lengths, function(x) seq(1:x)))) return(listofnwords2) }
Yes, you might make a start with some code and then come back with a more specific question. That's how you'll get the most out of this site.
For a basic introduction to text mining and topic modelling, see Matthew Jockers' book Text Analysis with R for Students of Literature
If you're already a little familiar with MALLET, then try rmallet
for topic modelling. There's lots of code snippets on the web that use this, here's one of mine.