Find the duplicated line in a large file Java

Question 1

The most efficient implementation really depends on your requirements.

From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.

In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).

Here's the example:

boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
    {
        if (lines.contains(line)) {
            hasDuplicate = true;
        }
        lines.add(line);
    }

    if (hasDuplicate){
        System.out.print("NOT UNIQUE");
    } else {
        System.out.print("UNIQUE");
    }
    list.close();
}

Question 2

This is a well known problem called Count-Distinct Problem There are various algorithms :

In Java you can use BitSet

Question 3

The key in your map is String so you cannot put integer as a key. try

if ( !map.containsKey("" + 1)  )

If you are trying to find duplicate. Maybe you can do this:

boolean flag = false;
while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        else 
        {
            flag = true;
            break;
        }
    }

    if (flag )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}

Also since you are not using the value just the key, you can use HashSet instead.

Question 4

since you are simply inserting line and occurrence. later you are retrieving one by one so no need of sorted map you can use HashMap.

and since key type is String so integer can not be passed.

i think you want know the line whose occurrence is one. so you can try:

if(map.get(line)!=1)

{

System.out.print("NOT UNIQUE");

}

else

{

System.out.print("UNIQUE");

}

Question 5

All you need to know is that Set doesn't allow duplicates in Java. Which means if you have added an element into Set and trying to insert duplicate element again, it will not be allowed. In Java, you can use HashSet class to solve this problem. Just loop over array elements, insert them into HashSet using add() method and check return value. If add() returns false it means that element is not allowed in the Set and that is your duplicate. Here is the code sample to do this :

for (String name : names) {
 if (set.add(name) == false) {
    // your duplicate element
 }}

Complexity of this solution is O(n) because you are only going through array one time, but it also has space complexity of O(n) because of HashSet data structure, which contains your unique elements. So if an array contains 1 million elements, in worst case you would need an HashSet to store those 1 million elements.

Question 6

Class cast Exception occurs because the datatypes are different. In case of TreeMap it doesn't support heterogeneous datatype.