Java Random File Access: Get byte offset of line start

Question 1

After a lot of further googling, trial and error and more I came up with a solution that simply wraps RandomAccessFile and exposes all methods. The readLine() method however was much improved by talking the one from BufferedReader with minor adjustments. Performance is now identical to it.

This so called class OptimizedRandomAccessFile buffers readLine() calls as long as no other methods requiring or affecting the position in the file are called. eg in:

OptimizedRandomAccessFile raf = new OptimizedRandomAccessFile(filePath, "r");
String line = raf.readLine();
int nextByte = raf.read();

nextByte will contain the first byte of the next line in the file.

The full code can be found on bitbucket.

Question 2

You can try to subclass the BufferedReader class to remember the read position. But you won't have the seek functionality.

As you mentioned a record can be multi-line, but all the records are separated by a stop sequence. Given this you can use RandomAccessFile like this:

have a byte buffer byte b[] of let's say 8k in size (this is for performance reasons)
read 8k from the file in this buffer and try to find the delimiter, if not found, read another block of 8k, but previously append the data to some StringBuilder or other structure.
when you found the delimiter the position of the delimiter is given by the number of bytes processed since the last delimiter found (you need to do some simple math).

The tricky part will be if the record delimiter is longer that 1 char, but that should be a big problem.

Question 3

I would use the following sequence of java.io decorators:

   InputStreamReader    <-- reader, the top reader
   CountingInputStream  <-- cis, stores the position (from Google Guava)
   BufferedInputStream  <-- speeds up file reading
   FileInputStream

Then you read from this top reader by implementing a readLine() method which reads chars one by one until a line separator. I would not use BufferedReader as it would spoil the current position by reading a full fixed-sized buffer.

So if I get the problem right, the algorithm is as simple as

long lineStartPosition = cis.getCount();
String s = readLine(reader);
if(s.equals(DELIMITER)) { storeToIndex(lineStartPosition, recordData); }

Question 4

You can read all the data file and record where the delimiter is found and save this metadata in a different file. Now you can use the metadata to navigate through the file (jump from one delimiter to the other). Each time the data file gets modified, you will have to rescan it and regenerate the metadata.