Which API in Java to use for file reading to have best performance?

https://stackoverflow.com/questions/1812565

06-07-2019
|

Question

In my place where I work, used to have files with more than million rows per file. Even though the server memory are more than 10GB with 8GB for JVM, sometimes the server get hanged for few moments and chokes the other tasks.

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Which API should I use to read the files for better performance?

Righ now I am using BufferedReader(new FileReader(...)) to read these CSV files.

Process : How am I reading the file?

I read files line by line.
Every line has few columns. based on the types I parse them correspondingly(cost column in double, visit column in int, keyword column in String, etc..).
I push the eligible content(visit > 0) in a HashMap and finally clears that Map at the end of the task

Update

I do this reading of 30 or 31 files(one month's data) and store the eligible in a Map. Later this map is used to get some culprits in different tables. Therefore reading is must and storing that data is also must. Although I have switched the HashMap part to BerkeleyDB now but the issue at the time of reading file is same or even worse.

Solution

BufferedReader is one of the two best APIs to use for this. If you really had trouble with file reading, an alternative might be to use the stuff in NIO to memory-map your files and then read the contents directly out of memory.

But your problem is not with the reader. Your problem is that every read operation creates a bunch of new objects, most likely in the stuff you do just after reading.

You should consider cleaning up your input processing with an eye on reducing the number and/or size of objects you create, or simply getting rid of objects more quickly once no longer needed. Would it be possible to process your file one line or chunk at a time rather than inhaling the whole thing into memory for processing?

Another possibility would be to fiddle with garbage collection. You have two mechanisms:

Explicitly call the garbage collector every once in a while, say every 10 seconds or every 1000 input lines or something. This will increase the amount of work done by the GC, but it will take less time for each GC, your memory won't swell as much and so hopefully there will be less impact on the rest of the server.
Fiddle with the JVM's garbage collector options. These differ between JVMs, but java -X should give you some hints.

Update: Most promising approach:

Do you really need the whole dataset in memory at one time for processing?

OTHER TIPS

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Using BufferedReader(new FileReader(...)) won't cause that.

I suspect that the problem is that you are reading the lines/rows into an array or list, processing them and then discarding the array/list. This will cause the memory usage to increase and then decrease again. If this is the case, you can reduce memory usage by processing each line/row as you read it.

EDIT: We are agreed that the problem is about the space used to represent the file content in memory. An alternative to a huge in-memory hashtable is to go back to the old "sort merge" approach we used when computer memory was measured in Kbytes. (I'm assuming that the processing is dominated by a step where you are doing a lookup with keys K to get the associated row R.)

If necessary, preprocess each of the input files so that they can be sorted on the key K.
Use an efficient file sort utility to sort all of the input files into order on the K. You want to use a utility that will use a classical merge sort algorithm. This will split each file into smaller chunks that can be sorted in memory, sort the chunks, write them to temporary files, then merge the sorted temporary files. The UNIX / Linux sort utility is a good option.
Read the sorted files in parallel, reading all rows that relate to each key value from all files, processing them and then stepping on to the next key value.

Actually, I'm a bit surprised that using BerkeleyDB didn't help. However, if profiling tells you that most time was going in building the DB, you may be able to speed it up by sorting the input file (as above!) into ascending key order before you build the DB. (When creating a large file-based index, you get better performance if the entries are added in key order.)

Try using the following vm options in order to tune the gc (and do some gc printing):

-verbose:gc -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow