Question

So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates.

I put the lines in a TreeMap so that they are sorted, put "lines" into key and give "1" to their value. When there is a duplicate, the value of the line stacks up. Then I will have to see if there is any value that is not 1.

Here is my code:

    BufferedReader list = new BufferedReader( new FileReader( args[0] ) );
    String line;
    TreeMap<String,Integer> map  = new TreeMap<String,Integer>();

    while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        map.put(line, map.get(line) + 1);   
    }

    if ( !map.containsKey(1)  )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}

Question:

  1. Will the use of TreeMap speed up the process? Or using HashMap will have the same/faster speed?

  2. The output:

    Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.containsKey(TreeMap.java:227) at Lab10.main(Lab10.java:22)

which is if ( !map.containsKey(1) ) , but I don't know what went wrong.

Was it helpful?

Solution

The most efficient implementation really depends on your requirements.

From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.

In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).

Here's the example:

boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
    {
        if (lines.contains(line)) {
            hasDuplicate = true;
        }
        lines.add(line);
    }

    if (hasDuplicate){
        System.out.print("NOT UNIQUE");
    } else {
        System.out.print("UNIQUE");
    }
    list.close();
}

OTHER TIPS

This is a well known problem called Count-Distinct Problem There are various algorithms :

In Java you can use BitSet

The key in your map is String so you cannot put integer as a key. try

if ( !map.containsKey("" + 1)  )

If you are trying to find duplicate. Maybe you can do this:

boolean flag = false;
while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        else 
        {
            flag = true;
            break;
        }
    }

    if (flag )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}

Also since you are not using the value just the key, you can use HashSet instead.

since you are simply inserting line and occurrence. later you are retrieving one by one so no need of sorted map you can use HashMap.

and since key type is String so integer can not be passed.

i think you want know the line whose occurrence is one. so you can try:

if(map.get(line)!=1)

{

System.out.print("NOT UNIQUE");

}

else

{

System.out.print("UNIQUE");

}

All you need to know is that Set doesn't allow duplicates in Java. Which means if you have added an element into Set and trying to insert duplicate element again, it will not be allowed. In Java, you can use HashSet class to solve this problem. Just loop over array elements, insert them into HashSet using add() method and check return value. If add() returns false it means that element is not allowed in the Set and that is your duplicate. Here is the code sample to do this :

for (String name : names) {
 if (set.add(name) == false) {
    // your duplicate element
 }}

Complexity of this solution is O(n) because you are only going through array one time, but it also has space complexity of O(n) because of HashSet data structure, which contains your unique elements. So if an array contains 1 million elements, in worst case you would need an HashSet to store those 1 million elements.

Class cast Exception occurs because the datatypes are different. In case of TreeMap it doesn't support heterogeneous datatype.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top