Tokenize big files to hashtable in Java

https://stackoverflow.com/questions/8015903

22-02-2021
|

Question

I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:

read the first file
- read the first line of the file
- split the important tokens to a string array
- copy the string array to my final map, incrementing word frequencies
repeat for all files

I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.

La solution

I am trying to rethink your problem:

Since you are trying to construct an inverted index:

Use Multimap rather then Map<String, Map<String, Integer>>

Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.

Autres conseils

Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?

You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.

==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc

By the way, why are you using double for the frequency? Surely it should be an integer value...

The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.

What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.

I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).

You could try using this library to improve your performance.

http://high-scale-lib.sourceforge.net/

It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.

Here is an article that will help you with some more inputs.

http://www.javaspecialists.eu/archive/Issue193.html

Why not use a custom class,

public class CustomData {
 private String word;
 private double frequency;
 //Setters and Getters
}

and use your map as

Map<fileName, List<CustomData>>

this way atleast you will have only 900 keys in your map.

-Ivar

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow