finding office duplicate file content with c#

https://stackoverflow.com/questions/19364090

30-06-2022
|

Question

I calculate a check sum to compare with others and find duplicates files but for office files, share point properties are include. So a file with different location for example don't have same check sum.

My idea is to open in a memorystream this file unzip xml (for word word/document.xml) and use it to calculate checksum or use crc property of my zip library. By this way i don't include doc properties but only content (a part)

it work well but for excel or powerpoint there is several files in a folder to represent content of doc.

First do you think it is the right way. Second how can I combine crc properties of files to have a CRC representing the content folder.

For word /word
For Excel /xl/worksheets
For powerpoint ppt/slides

Solution

Using a CRC hash for the files (stripping sharepoint meta data) seems appropriate. As long as the CRC is long enough to provide a statistically unique hash based on the number of files you are indexing.

Why are you trying to combine them for a folder based CRC? How are you planning on combining them? If you are thinking of just summing the CRCs for the folder value I'd say that wouldn't be guaranteed to be unique.

I think it would be better to create a dictionary using the CRC as the key and then simply use Dictionary.ContainsKey for look up/comparison

or watch for duplicate entries via the Dictionary.Add catching ArgumentExcptions (key already exists)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow