Question

I'm working on a Java program where I'm reading from a file in dynamic, unknown blocks. That is, each block of data will not always be the same size and the size is determined as data is being read. For I/O I'm using a MappedByteBuffer (the file inputs are on the order of MB).

My goal:

  • Find an efficient way to store each complete block during the input phase so that I can process it.

My constraints:

  • I am reading one byte at a time from the buffer
  • My processing method takes a primitive byte array as input
  • Each block gets processed before the next block is read

What I've tried:

  • I played around with dynamic structures like Lists but they don't have backing arrays and the conversion time to a primitive array concerns me
  • I also thought about using a String to store each block and then getBytes() to get the byte[], but it's so slow
  • Reading the file multiple times in order to find the block size first, and then grab the relevant bytes

I am trying to find an approach that doesn't defeat the purpose of fast I/O. Any advice would be greatly appreciated.

Additional Info:

  • I'm using a rolling hash to decide where blocks should end

Here's a bit of pseudo-code:

circular_buffer[] = read first 128 bytes
rolling_hash = hash(buffer[])
block_storage = ??? // this is the data structure I'd like to use
while file has more text
    b = next byte
    add b to block_storage
    add b to next index in circular_buffer (if reached end, start adding/overwriting front)
    shift rolling_hash one byte to the right
    if hash has a certain characteristic
        process block_storage as a byte[] //should contain entire block of data

As you can see, I'm reading one byte at a time, and storing/overwriting that one byte repeatedly. However, once I get to the processing stage, I want to be able to access all of the info in the block. There is no predetermined max size of a block either, so I can't pre-allocate.

Was it helpful?

Solution

It seems to me, that you reqire a dynamically growing buffer. You can use the built in BytaArrayOutputStream to achieve that. It will automatically grow to store all data written to it. You can use write(int b) and toByteArray() to realize add b to block_storage and process block_storage as a byte[].

But take care - this stream will grow unbounded. You should implement some sanity checks around it to avoid using up all memory (e.g. count bytes written to it and break by throwing an exception, when it exceeds an reasonable amount). Also make sure to close and throw away the reference to a stream after consuming the block, to allow the GC to free up memory.

edit: As @marcman pointed out, the buffer can be reset().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top