Comparison between two big directories

https://stackoverflow.com/questions/606739

03-07-2019
|

Question

I have a large directory that contains only stuff in CS and Math. It is over 16GB in size. The types are text, png, pdf and chm. I have currently two branches: a branch of my brother's and mine. The initial files were the same. I need to compare them. I have tried to use Git, but there is a long loading time.

What is the best way to compare two big directories?

[Mixed Solution]

Do a "ls -R > different_files" in both directories [1]
"sdiff <(echo file1 | md5deep) <(echo file2 | md5deep)" [2]

What do you think? Any drawbacks?

[1] thanks to Paul Tomblin [2] great thanks to all repliers!

Solution

How to compare 2 folders without pre-existing commands/products:

Simply create a program that scans each directory and creates a file hash of each file. It outputs a file with each relative file path and the file hash.

Run this program on both folders.

Then you simply compare the 2 output files to see if they are the same. To compare those 2 files you just load them into a string and do a string compare.

The hashing algorithm you use doesn't matter. You can use MD5, SHA, CRC, ... You could also use the file size in the output files to help reduce the chance of collisions.

How to compare 2 folders with pre-existing commands/products:

Now if you just want a program that does it, use diff -r or windiff for windows based systems.

OTHER TIPS

Use fslint: website. One of the options of the tool is "Duplicates". As per the description from the site: One of the most commonly used features of FSlint is the ability to find duplicate files. The easiest way to remove lint from a hard drive is to discard any duplicate files that may exist. Often a computer user may not know that they have four, five, or more copies of the exact same song in their music collection under different names or directories. Any file type whether it be music, photos, or work documents can easily be copied and replicated on your computer. As the duplicates are collected, they eat away at the available hard drive space. The first menu option offered by FSlint allows you to find and remove these duplicate files.

Use md5deep to create recursive md5sum listings of every file in those directories.

You can the use a diff tool to compare the generated listings.

Are you just trying to discover what files are present in one that aren't in the other, and vice versa? A couple of suggestions:

Do a "ls -R" in both directories, redirect to files, and diff the files.
Do a "rsync -n" between them to see what rsync would have to copy if it were to be allowed to copy. (-n means don't do the rsync, just show you what it would do if you ran it without the -n)

I would diffing by comparing the output of md5sum * | sort

That will take you to the files that are different/missing

I know this question has already been answered, however if you are not into writing such a tool yourself, there's a very well working open source project by the name of tardiff available on sourceforge which basically does exactly what you want, and even supports automated creation of patches (in tar format obviously) to account for differences.

Hope this helps

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow