Question

I'm going to create a music library program, easy. Storing the information, easy.

I previously looked at another music library made in c#, the guy claimed that even if you move the file, on rediscovery it will know all the information about that file retrieved from the database (xml, sql).

More info on rediscovery: When you move files you have to get the music library to rediscover because its current information is wrong, such as the file path, on re discovery it will find the file, check it in the database, and update any information

I thought this is impossible, till now. If you hash a file and use that hash as the key, you can then use that to always check the file to make sure it is the one.

Please correct me if I'm wrong and confirm what I'm saying is true (that is the question).

  • File path isn't used in hashing the file. (I don't know how to hash)
  • Re hash after every ID3 tag write (changing the file changes the hash?)
  • Using the Hash as an Key/Id will mean that if the file is moved it can be still referenced to the information stored about it
  • Once information read is read out of the xml (if we're using xml as a database) file, storing it in a dictionary is the quickest and best way to have the contents in memory

It is a question, it needs an answer, its about c#. I'm using c#, thats why it's specific, I'm doing background research, I just wanted some expert opinion on what i've stated

Was it helpful?

Solution

Answering your questions

  • file path should not be used when computing hash. Neither filename nor extension.

  • rehashing after each ID3 tag write would solve your problem provided that all changes occur in your application

  • hash can safely be used as a key for your purposes (see below)

  • probably yes, if I understand you correctly

Possibility of repeated hash value

Depending on the hashing function you choose, if you search, you will find/generate another file with the same hash in year, millenium, billion years or you will not do it till the end of the world.

It's all a matter of probabilities. Check details of each hashing function to learn how low the probability of finding another file with the same hash is.

Problem of changed tags in mp3 files

While this may be a problem, what you need to do is hash only the part of file that is not the ID3 tag. They are usually located at the end of the file and take a very small percent of the file size.

What you can do is to use the hashing funciton on the part of the file that will not be changing. Just skip the last N bytes of a file when hashing.

OTHER TIPS

Yes, if you hash the file contents, then even if the file moves somewhere else, it will still result in the same hash when you do it again. So yes, you can totally identify files based on their content’s hash value (this is what Git does for example). As for creating a hash of a file, there are several questions that will tell you how to do it, for example this one.

Note though that due to ID3 tags and stuff, your files are not immutable, so hashing on the file contents might not be the best idea after all. If you change the tags of a file, its hash will change, resulting in a new file (at least for your application). Of course, if you change the tags within your application, then you can easily take track of those changes and update the old record to use the new hash. The same idea could be applied to identifying the file based on its path though too (if you move it within your application, you could just update its path in the database as well). The problem though is that both these actions are likely to happen outside of your application.

So both identification methods (hash of file contents, or file path) are somewhat flawed, but there is no real alternative for identifying the file.

Hashing will work for you. It basically creates a checksum based on all bytes in the file. Using a good hash will give you a signature for each file which is unique (there is more chance of winning the lottery five times in a row as finding two files which are different with the same hash).

Problem is you need to read the entire file to calculate the hash. This might hurt performance a bit.

So on rediscorvery you might want to first check if the filesize is the same. If not there is no need to read the entire file and calculate the hash. But you need to store filesize and hash for that.

Some info on hashing (using the MD5 method)

http://www.fastsum.com/support/md5-checksum-utility-faq/md5-hash.php

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top