Question

I've had a look through the previous posting and could not find one that answers this question. If possible, please could you point me at the right direction.

I am in the process of making a C# WPF file duplicate finder using MD5 and I am storing the file name and MD5 hash in a 2D array, this was the quickest way I thought I could implement this but I am having a problem with this.

The code below is of what I am trying to do:

public void fileList(string filename)
{
    string[,] fileLocationHash;
    string[] files = Directory.GetFiles(filename, "*.*", 
      SearchOption.AllDirectories);
    for (int i = 0; i < files.Length; i++)
    {
        FileStream file = new FileStream(files[i], FileMode.Open);
        MD5 md5 = new MD5CryptoServiceProvider();
        byte[] retVal = md5.ComputeHash(file);
        file.Close();

        StringBuilder sb = new StringBuilder();

        for (int x = 0; x < retVal.Length; x++)
        {
            sb.Append(retVal[x].ToString("x2"));
        }
        string fileHash = sb.ToString();
        // 2D array to compare hash and find duplicates
        fileLocationHash = new string[,]
        {
            {files[i], fileHash}
        };
        resultTextbox.Text = resultTextbox.Text
          .Insert(resultTextbox.CaretIndex, fileHash + Environment.NewLine);
        resultTextbox.Text = resultTextbox.Text
          .Insert(resultTextbox.CaretIndex, files[i] + " - ");
    }
}

I am having problems implementing a for loop to go through the fileHash section of the 2D array and finding duplicates. I cant seen to be able to figure out how to choose the 2nd part of the array, as I assumed that the following would work:

 var duplicates = fileLocationHash[]
         .GroupBy(g => g).Where(w => w.Count() > 1).Select(s => s.Key);
 foreach (var d in duplicates);

But this shows an error with fileLocationHash[] and I can't seem to understand how I would keep and index of the found files, which I will need to have in order to print out the name of the file from the other section of the 2D array.

Was it helpful?

Solution

So it looks like you trying to get map of MD5 hash to list of files with that hash. It may be better to directly express that in data strucutre:

var hashToFiles = new Dictionary<string, List<string>>();

Now when processing new file you have hash + fileName - so you can check if it is already in the map and add new/update existing entry:

if (hashToFiles.ContainsKey(hash))
{ 
  // add new entry
  hashToFiles.Add(hash, new List<string>{fileName});
}
else
{
  hashToFiles[hash].Add(fileName);
}

So with map build the only thing left is to find items with more than one elements

var keyValueForDups = hashToFiles.Where(item => item.Value.Length > 1);

Notes:

  • using SHA256 is better than outdated MD5, but ok for your purposes
  • in your current code you are recreating array every time instead of using list and appending to it
  • use custom class to hold {hash, fileName} pair to make code readable
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top