Question

I have a binary file that contains blocks of information (I'll refer to them as packets henceforth). Each packet consists of a fixed-length header and a variable length body. I've to determine the lenth of the body from the packet header itself. My task is to read these packets from the file and perform some operation on them. Currently I'm performing this task as follows:

  • Opening the file as a random access file and going to a specific start position (a user-specified start position). Reading the 1st packet from this position. Performing the specific operation
  • Then in a loop
    • reading the next packet
    • performing my operation This goes on till I hit the end of file marker.

As you can guess, when the file size is huge, reading each packet serially and processing it is a time-consuming affair. I want to somehow parallelize this operation i.e. packet generation operation and put it in some blocking queue and then parallely retrieve each packet from the queue and perform my operation.

Can someone suggest how may I generate these packets in parallel?

Was it helpful?

Solution

You should only have one thread read in the file sequentially since I'm assuming the file lies in a single drive. Reading the file is limited by your IO speed so there's no point in parallelizing that in the CPU. In fact, reading non-sequentially will actually significantly decrease your performance since regular hard drives are designed for sequential IO. For each packet it reads in, it should put that object into a thread-safe queue.

Now you can start parallelizing the processing of the packets. Create multiple threads and have them each read in packets from the queue. Each thread should do their processing and put it into some "finished" queue.

Once the IO thread has finished reading in the file, a flag should be set so that the working threads stop once the queue is empty.

OTHER TIPS

If you are using a disk with platters (i.e. not an SSD) then there is no point having more than one thread read the file since all you will do is thrash the disk causing the disk arm to introduce millisecond delays. If you have an SSD its a different story and you could parallelise the reading.

Instead you should have one thread reading the data from the file and creating the packets, then doing the following:

  • wait on a shared semaphore 'A' (which has been initialised to some number that will be your 'max buffered packets' count)
  • lock a shared object
  • append the packet to a LinkedList
  • signal another shared semaphore 'B' (this one is tracking the count of the packets in the buffer)

Then you can have many other threads doing the following:

  • wait on the 'B' semaphore (to ensure there is a packet to be processed)
  • lock the shared object
  • do getFirst() on the LinkedList and store the packet in a local variable
  • signal semaphore 'A' to allow another packet into the buffered packet list

This will ensure you are reading packets as fast as possible (from a platter disk) by striping them in one continuous sequence, and it will ensure that you are processing multiple packets at once without any polling.

I guess the known fast method is using java.nio.MappedByteBuffer

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top