How do I count repeated words?

https://stackoverflow.com/questions/6803874

25-10-2019
|

Question

Given a 1GB(very large) file containing words (some repeated), we need to read the file and output how many times each word is repeated. Please let me know if my solution is high performant or not.

(For simplicity lets assume we have already captured the words in an arraylist<string>)

I think the big O(n) is "n". Am I correct??

public static void main(String[] args) {

            ArrayList al = new ArrayList();
            al.add("math1");
            al.add("raj1");
            al.add("raj2");
            al.add("math");
            al.add("rj2");

            al.add("math");
            al.add("rj3");
            al.add("math2");
            al.add("rj1");
            al.add("is");
            Map<String,Integer> map= new HashMap<String,Integer>();

            for (int i=0;i<al.size();i++)
            {
                String s= (String)al.get(i);

                    map.put(s,null);

            }
            for (int i=0;i<al.size();i++)
            {
                String s= (String)al.get(i);
                if(map.get(s)==null)
                    map.put(s,1);
                else
                {
                    int count =(int)map.get(s);
                        count=count+1;
                        map.put(s,count);
                }


            }

            System.out.println("");
        }

Solution

Theoretically , since HashMap access is generally O(1), I guess your algorithm is O(n), but in reality has several inefficiencies. Ideally you would iterate over the contents of the file just once, processing (i.e. counting) the words while you read them in. There's no need to store the entire file contents in memory (your ArrayList). You loop over the contents three times - once to read them, and the second and third times in the two loops in your code above. In particular, the first loop in your code above is completely unnecessary. Finally, your use of HashMap will be slower than needed because the default size at construction is very small, and it will have to grow internally a number of times, forcing a rebuilding of the hash table each time. Better to start it off a size appropriate for what you expect it to hold. You also have to consider the load factor into that.

OTHER TIPS

I think you could do better than using a HashMap.

Food for thought on the hashmap solution

Your anwser is acceptable but consider this: For simplicity's sake lets assume you read the file one byte at a time into a StringBuffer until you hit a space. At which point you'll call toString() to convert the StringBuffer into a String. You then check if the string is in the HashMap and either it gets stored or the counter get incremented.

The English dic. included with linux has 400k words and is about 5MBs in size. So of the "1GB" of text you read, we can guess that you'll only be storing about 5MBs of it in your HashMap. The rest of the file, will be converted into strings that will need to be Garbage Collected after your finished looking for them in your map. I could be wrong, but I believe the bytes will be iterated over again during the construction of the String since the byte array needs to be copied internally and again for calculating the HashCode. So, the solution may waste a fair amount of CPU cycles and force GC to occur often.

Its OK to point things like this out in your interview, even if it's the only solution you can think of.

I may consider using a custom RadixTree or Trie like structure

Keep in mind how the insert method of a RadixT/Trie works. Which is to take a stream of chars/bytes (usually a string) and compares each element against the current position in the tree. If the prefix exists it just advances down the tree and byte-stream in lock step. When it hits a new suffix it begins adding nodes into the tree. Once the end of stream is reached it marks that node as EOW. Now consider we could do the same thing while reading a much larger stream, by resetting the current position to the root of the tree anytime we hit a space.

If we wrote our own Radix tree (or maybe a Trie), who's nodes had end-of-word counters (instead of markers) and had the insert method read directly from the file. We could insert nodes into the tree one byte/char at a time until we read a space. At which point the insert method would increment the end-of-word counter (if it's an existing word) and reset the current position in the tree back to the head and start inserting bytes/chars again. The way a radix tree works is to collapse the duplicated prefixs of words. For example:

The following file:

math1 raj1 raj2 math rj2 math rj3 

would be converted to:

(root)-math->1->(eow=1)
     |    |-(eow=2)
     |    
      raj->1->(eow=1)
      | |->2->(eow=1)
      | |->3->(eow=1)
      j2->(eow=1)

The insertion time into a tree like this would be O(k), where k is the length of the longest word. But since we are inserting/comparing as we read each byte. We aren't any more inefficient than just reading the file as we have to already.

Also, note the we would read byte(s) into a temp byte that would be a stack variable, so the only time we need to allocate memory from the heap is when we encounter a new word (actually a new suffix). Therefore, garbage collection wouldn't happen nearly as often. And the total memory used by a Radix tree would be a lot smaller than a HashMap.

Have you considered using a mapreduce solution? If the dataset gets bigger then it would really be better to split it in pieces and count the words in parallel

You should read through the file with words only once.

No need to put the nulls beforehand - you can do it within the main loop.

The complexity is indeed O(n) in both cases, but you want to make the constant as small as possible. (O(n) = 1000 * O(n), right :) )

To answer your question, first, you need to understand how HashMap works. It consists of buckets, and every bucket is a linked list. If due to hashing another pair need to occupy the same bucket, it will be added to the end of linked list. So, if map has high load factor, searching and inserting will not be O(1) anymore, and algorithm will become inefficient. Moreover, if map load factor exceeds predefined load factor (0.75 by default), the whole map will be rehashed.

This is an excerpt from JavaDoc http://download.oracle.com/javase/6/docs/api/java/util/HashMap.html:

The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

So I would like to recommend you to predefine a map capacity guessing that every word is unique:

Map<String,Integer> map= new HashMap<String,Integer>(al.size());

Without of that, your solution is not efficient enough, though it still has a linear approximation O(3n), because due to amortization of rehashing, an insertion of elements will cost 3n instead of n.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow