Question

I need to randomly access specific records in a text (ASCII) file and then read from there until a specific "stop sequence" (record delimiter) is found. The file contains multi-line records and each record is separated by the delimiter. Each record also takes a different amount of lines! This is a commonly known file format in the specific area of expertise and can not be changed.

I want to index the file so I can quickly jump to a requested record.

In similar questions like

How to Access string in file by position in Java

and links in it, answer always reference the seek() method of various classes like RandomAccessFile. I know about that!

The issue I have is how to get the offset needed for seek! (indexing the file)

BufferedReader does not have a getFilePointer() method or any other to get the current byte offset from start of file. RandomAccessFile has a readLine() method but it's performance is beyond terrible. It's not usable at all for my case.

I would need to read the file line by line and each time the record delimiter is found I need to get the byte offset. How can I achieve this?

Était-ce utile?

La solution 2

After a lot of further googling, trial and error and more I came up with a solution that simply wraps RandomAccessFile and exposes all methods. The readLine() method however was much improved by talking the one from BufferedReader with minor adjustments. Performance is now identical to it.

This so called class OptimizedRandomAccessFile buffers readLine() calls as long as no other methods requiring or affecting the position in the file are called. eg in:

OptimizedRandomAccessFile raf = new OptimizedRandomAccessFile(filePath, "r");
String line = raf.readLine();
int nextByte = raf.read();

nextByte will contain the first byte of the next line in the file.

The full code can be found on bitbucket.

Autres conseils

You can try to subclass the BufferedReader class to remember the read position. But you won't have the seek functionality.

As you mentioned a record can be multi-line, but all the records are separated by a stop sequence. Given this you can use RandomAccessFile like this:

  1. have a byte buffer byte b[] of let's say 8k in size (this is for performance reasons)

  2. read 8k from the file in this buffer and try to find the delimiter, if not found, read another block of 8k, but previously append the data to some StringBuilder or other structure.

  3. when you found the delimiter the position of the delimiter is given by the number of bytes processed since the last delimiter found (you need to do some simple math).

The tricky part will be if the record delimiter is longer that 1 char, but that should be a big problem.

I would use the following sequence of java.io decorators:

   InputStreamReader    <-- reader, the top reader
   CountingInputStream  <-- cis, stores the position (from Google Guava)
   BufferedInputStream  <-- speeds up file reading
   FileInputStream

Then you read from this top reader by implementing a readLine() method which reads chars one by one until a line separator. I would not use BufferedReader as it would spoil the current position by reading a full fixed-sized buffer.

So if I get the problem right, the algorithm is as simple as

  1. long lineStartPosition = cis.getCount();
  2. String s = readLine(reader);
  3. if(s.equals(DELIMITER)) { storeToIndex(lineStartPosition, recordData); }

You can read all the data file and record where the delimiter is found and save this metadata in a different file. Now you can use the metadata to navigate through the file (jump from one delimiter to the other). Each time the data file gets modified, you will have to rescan it and regenerate the metadata.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top