Frage

Hello I am looking for fastest bat rather hi-level way to work with large data collection. My task consist of two task read alot of large files in memory and then make some statistical calculations (the easiest way to work with data in this task is random access array ).

My first approach was to use java.io.ByteArrayOutputStream, becuase it can resize it's internal storage .

def packTo(buf:java.io.ByteArrayOutputStream,f:File) = {
  try {
    val fs = new java.io.FileInputStream(f)
    IOUtils.copy(fs,buf)
  } catch  {
    case e:java.io.FileNotFoundException =>
  }
}

    val buf = new java.io.ByteArrayOutputStream()
    files foreach { f:File  => packTo(buf,f) } 
    println(buf.size())

    for(i <- 0 to buf.size()) {
       for(j <- 0 to buf.size()) {
          for(k <- 0 to buf.size()) {
       //  println("i  " + i + "  " + buf[i] );
                   // Calculate something amathing using buf[i] buf[j] buf[k] 
          }
       }
    }

    println("amazing = " + ???)

but ByteArrayOutputStream can't get me as byte[] only copy of it. But I can not allow to have 2 copies of data .

War es hilfreich?

Lösung

Have you tried scala-io? Should be as simple as Resource.fromFile(f).byteArray with it.

Andere Tipps

Scala's built in library already provides a nice API to do this

io.Source.fromFile("/file/path").mkString.getBytes

However, it's not often a good idea to load whole file as byte array into memory. Do make sure the largest possible file can still fit into your JVM memory properly.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top