How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

Question 1

It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.

I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.

Writing a single huge int[] into memory should be pretty fast, surely no half an hour.

Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).

Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.

Question 2

It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.

Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.

Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)

Question 3

There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.

Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.

The implication is that parsing characters into the data you need is the bottleneck.

I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.

If that still isn't fast enough, you'd have to fall back to a native C extension.

Question 4

If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.

But there's another way - why is the file in ASCII? If it were in binary, you could just slurp it up.