Question

I'm developing a program which is able to find the difference in files between to folders for instance. I've made a method which traverses the folder structure of a given folder, and builds a tree for each subfolder. Each node contains a list of files, which is the files in that folder. Each node has an amount of children, which corresponds to folders in that folder.

Now the problem is to find the files present in one tree, but not in the other. I have a method: "private List Diff(Node index1, Node index2)", which should do this. But the problem is the way that I'm comparing the trees. To compare two trees takes a huge amount of times - when each of the input nodes contains about 70,000 files, the Diff method takes about 3-5 minutes to complete.

I'm currently doing it this way:

private List<MyFile> Diff(Node index1, Node index2)
    {
        List<MyFile> DifferentFiles = new List<MyFile>();

        List<MyFile> Index1Files = FindFiles(index1);
        List<MyFile> Index2Files = FindFiles(index2);

        List<MyFile> JoinedList = new List<MyFile>();
        JoinedList.AddRange(Index1Files);
        JoinedList.AddRange(Index2Files);
        List<MyFile> JoinedListCopy = new List<MyFile>();
        JoinedListCopy.AddRange(JoinedList);
        List<string> ChecksumList = new List<string>();

        foreach (MyFile m in JoinedList)
        {

            if (ChecksumList.Contains(m.Checksum))
            {
                JoinedListCopy.RemoveAll(x => x.Checksum == m.Checksum);
            }
            else
            {
                ChecksumList.Add(m.Checksum);
            }
        }

        return JoinedListCopy;
    }

And the Node class looks like this:

class Node
{
    private string _Dir;
    private Node _Parent;
    private List<Node> _Children;
    private List<MyFile> _Files;
}
Was it helpful?

Solution

Rather than doing lots of searching through List structures (which is quite slow) you can put the all of the checksums into a HashSet which can be much more efficiently searched.

private List<MyFile> Diff(Node index1, Node index2)
{
    var Index1Files = FindFiles(index1);
    var Index2Files = FindFiles(index2);

    //this is all of the files in both
    var intersection = new HashSet<string>(Index1Files.Select(file => file.Checksum)
         .Intersect(Index2Files.Select(file => file.Checksum)));

    return Index1Files.Concat(Index2Files)
        .Where(file => !intersection.Contains(file.Checksum))
        .ToList();
}

OTHER TIPS

How about:

    public static IEnumerable<MyFile> FindUniqueFiles(IEnumerable<MyFile> index1, IEnumerable<MyFile> index2)
    {
        HashSet<string> hash = new HashSet<string>();

        foreach (var file in index1.Concat(index2))
        {
            if (!hash.Add(file.Checksum))
            {
                hash.Remove(file.Checksum);
            }
        }

        return index1.Concat(index2).Where(file => hash.Contains(file.Checksum));
    }

This will work on the assumption that one tree will not contain a duplicate. Servy's answer will work in all instances.

Are you keeping the entire FileSystemObject for every element in the tree? If so I would think your memory overhead would be gigantic. Why not just use the filename or checksum and put that into a list, then do comparisons on that?

I can see that this is more than just a "distinct" function, what you are really looking for is all instances that only exist once in the JoinedListCopy collection, not simply a list of all distinct instances in the JoinedListCopy collection.

Servy has a very good answer, I would suggest a different approach, which utilizes some of linq's more interesting features, or at least I find them interesting.

var diff_Files = (from a in Index1Files
                 join b in Index2Files
                 on a.CheckSum equals b.CheckSum
                 where !(Index2Files.Contains(a) || Index1Files.Contains(b))).ToList()

another way to structure that "where", which might work better, the file instances might not actually be identical, as far as code equality is concerned...

where !(Index2Files.Any(c=>c.Checksum == a.Checksum) || Index1Files.Any(c=>c.Checksum == b.Checksum))

look at the individual checksums, rather than the entire file object instance.

the basic strategy is essentially exactly what you are already doing, just a bit more efficient: join the collections and filter them against each other to make sure that you only get entries that are unique.

Another way to do this is to use the counting function in linq

var diff_Files = JoinedListCopy.Where(a=> JoinedListCopy.Count(b=>b.CheckSum == a.CheckSum) == 1).ToList();

nested linq isn't always the most efficient thing in the world, but that should work fairly well, get all instances that only occur once. I like the approach the best actually, least chance of messing something up, but the join I used first might be more efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top