Significant slower processing as Set size goes beyond 500.000

Question 1

I believe the issue you may be facing is that a Hash Table(set or map) has to be backed by a fixed number of entries it can hold. So your first declaration may have a table able to hold 16 entries. Putting aside things like load factors, once you tried to put 17 entries into the table, it has to grow to accommodate more entries to prevent collisions, so Java will expand it for you.

This expansion includes creating a new table with 2 * previousSize entries, then copying over the old entries. So if you are constantly expanding, you may end up hitting an area, like 524,288 where it will have to grow, but it will create a new table able to handle 1,048,576 entries, but it will have to copy over the entire previous table.

If you don't mind the extra look up time, you might think about using a TreeSet instead of a HashSet. You lookups will now be logarithmic time, but a Tree doesn't have a pre-allocated table and can grow dynamically easily. Either use this, or declare the size of your HashSet so it won't grow dynamically.

Question 2

Honestly for that sort of scale you are better off going over to a database. You can embed Derby inside your application if you don't want to use a separate one.

Their indexing systems are optimised for this sort of scale, and while HashSet etc will cope if you massage them right you are better off using the right tool for it.

Question 3

As noted by TheSageMage, the HashSet implementation will constantly resize the underlying HashMap as the data grows. There are a couple of ways of getting around that: initial capacity and load factor. You can set both by using the 2-arg constructor: HashSet(int, float). If you know the approximate number of words you are going to need, you can set the initial capacity to be bigger than that number. This will make smaller maps work a little slower, but will prevent dramatic slow-down for larger maps. The load factor is how full the map must get before increasing the underlying size rehashing. Since this is a relatively time-consuming operation for large maps, you may want to set it to a large fraction, say 0.9. If your initial capacity was set so that you may exceed it but will never exceed twice that size, a large load factor will guarantee that you rehash only once and as late as possible.