Recursively walk a LARGE directory using Scala 2.8 continuations
-
25-05-2021 - |
Question
Is it possible to recursively walk a directory using Scala continuations (introduced in 2.8)?
My directory contains millions of files, so I cannot use a Stream
because I will get an out-of-memory. I am trying to write an Actor
dispatch to have worker actors process the files in parallel.
Does anyone have an example?
Solution
If you want to stick with Java 1.6 (as opposed to FileVistor
in 1.7), and you have subdirectories instead of all your millions of files in just one directory, you can
class DirectoryIterator(f: File) extends Iterator[File] {
private[this] val fs = Option(f.listFiles).getOrElse(Array[File]())
private[this] var i = -1
private[this] var recurse: DirectoryIterator = null
def hasNext = {
if (recurse != null && recurse.hasNext) true
else (i+1 < fs.length)
}
def next = {
if (recurse != null && recurse.hasNext) recurse.next
else if (i+1 >= fs.length) {
throw new java.util.NoSuchElementException("next on empty file iterator")
}
else {
i += 1;
if (fs(i).isDirectory) recurse = new DirectoryIterator(fs(i))
fs(i)
}
}
}
This requires that your filesystem has no loops. If it does have loops, you need to keep track of the directories you hit in a set and avoid recursing them again. (If you don't even want to hit the files twice if they're linked from two different places, you then have to put everything into a set, and there's not much point using an iterator instead of just reading all the file info into memory.)
OTHER TIPS
This is more questioning the question, than an answer.
If your process is I/O bound, parallel processing may not improve your throughput much. In many cases, it will make it worse, by causing disk head thrashing. Before you do much along this line, see how busy the disk is. If it's already busy most of the time with a single thread, at most one more thread will be useful - and even that may be counterproductive.
What about using an Iterator
?