Question

I'm not used to working with really large datasets and I'm kind of stumped here.

I have the following code:

private static Set<String> extractWords(BufferedReader br) throws IOException {
    String strLine;
    String tempWord;
    Set<String> words = new HashSet<String>();
    Utils utils = new Utils();
    int articleCounter = 0;
    while(((strLine = br.readLine()) != null)){
        if(utils.lineIsNotCommentOrLineChange(strLine)){
            articleCounter++;
            System.out.println("Working article : " + utils.getArticleName(strLine) + " *** Article #" + articleCounter + " of 3.769.926");
            strLine = utils.removeURLs(strLine);
            strLine = utils.convertUnicode(strLine);
            String[] temp = strLine.split("\\W+");
            for(int i = 0; i < temp.length; i++){
                tempWord = temp[i].trim().toLowerCase();
                if(utils.validateWord(tempWord)){
                    words.add(tempWord);
                    System.out.println("Added word " + tempWord + " to list");
                }
            }
        }
    }
    return words;
}

This basically gets a huge text file from the BufferedReader where each line of text is a text from an article. I want to make a list of unique words in this text file, but there are 3.769.926 articles in there, so the word count is quite immense.

From what I understand about Sets, or specifically HashSets, this should be the man for the job so to speak. Everything runs quite smoothly at first, but after 500.000 articles it starts slowing down a bit. When it reaches 700.000 its beginning to get slow enough that it basically stops for a second of two before going on again. There's a bottleneck here somewhere, and I can't see what it is..

Any ideas?

Was it helpful?

Solution

I believe the issue you may be facing is that a Hash Table(set or map) has to be backed by a fixed number of entries it can hold. So your first declaration may have a table able to hold 16 entries. Putting aside things like load factors, once you tried to put 17 entries into the table, it has to grow to accommodate more entries to prevent collisions, so Java will expand it for you.

This expansion includes creating a new table with 2 * previousSize entries, then copying over the old entries. So if you are constantly expanding, you may end up hitting an area, like 524,288 where it will have to grow, but it will create a new table able to handle 1,048,576 entries, but it will have to copy over the entire previous table.

If you don't mind the extra look up time, you might think about using a TreeSet instead of a HashSet. You lookups will now be logarithmic time, but a Tree doesn't have a pre-allocated table and can grow dynamically easily. Either use this, or declare the size of your HashSet so it won't grow dynamically.

OTHER TIPS

Honestly for that sort of scale you are better off going over to a database. You can embed Derby inside your application if you don't want to use a separate one.

Their indexing systems are optimised for this sort of scale, and while HashSet etc will cope if you massage them right you are better off using the right tool for it.

As noted by TheSageMage, the HashSet implementation will constantly resize the underlying HashMap as the data grows. There are a couple of ways of getting around that: initial capacity and load factor. You can set both by using the 2-arg constructor: HashSet(int, float). If you know the approximate number of words you are going to need, you can set the initial capacity to be bigger than that number. This will make smaller maps work a little slower, but will prevent dramatic slow-down for larger maps. The load factor is how full the map must get before increasing the underlying size rehashing. Since this is a relatively time-consuming operation for large maps, you may want to set it to a large fraction, say 0.9. If your initial capacity was set so that you may exceed it but will never exceed twice that size, a large load factor will guarantee that you rehash only once and as late as possible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top