agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum()
to a sparse matrix.
A document-term-matrix created by the tm
package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i
does not contain a particular row index p
, then row p
is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui
contains all the non-zero indices, and since dtm$i
is already ordered, dtm.new
will be in the same order as dtm
. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.