Recursively walk a LARGE directory using Scala 2.8 continuations

https://stackoverflow.com/questions/9827181

25-05-2021
|

Question

Is it possible to recursively walk a directory using Scala continuations (introduced in 2.8)?

My directory contains millions of files, so I cannot use a Stream because I will get an out-of-memory. I am trying to write an Actor dispatch to have worker actors process the files in parallel.

Does anyone have an example?

Solution

If you want to stick with Java 1.6 (as opposed to FileVistor in 1.7), and you have subdirectories instead of all your millions of files in just one directory, you can

class DirectoryIterator(f: File) extends Iterator[File] {
  private[this] val fs = Option(f.listFiles).getOrElse(Array[File]())
  private[this] var i = -1
  private[this] var recurse: DirectoryIterator = null
  def hasNext = {
    if (recurse != null && recurse.hasNext) true
    else (i+1 < fs.length)
  }
  def next = {
    if (recurse != null && recurse.hasNext) recurse.next
    else if (i+1 >= fs.length) {
      throw new java.util.NoSuchElementException("next on empty file iterator")
    }
    else {
      i += 1;
      if (fs(i).isDirectory) recurse = new DirectoryIterator(fs(i))
      fs(i)
    }
  }
}

This requires that your filesystem has no loops. If it does have loops, you need to keep track of the directories you hit in a set and avoid recursing them again. (If you don't even want to hit the files twice if they're linked from two different places, you then have to put everything into a set, and there's not much point using an iterator instead of just reading all the file info into memory.)

OTHER TIPS

This is more questioning the question, than an answer.

If your process is I/O bound, parallel processing may not improve your throughput much. In many cases, it will make it worse, by causing disk head thrashing. Before you do much along this line, see how busy the disk is. If it's already busy most of the time with a single thread, at most one more thread will be useful - and even that may be counterproductive.

What about using an Iterator?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow