Pregunta

Recently I was looking for a program that will run as a daemon and find files that have the same size/type, check if they're the same, then make both a hard link to a single copy if they are. And I started wondering why Operating Systems don't do this automatically.

I thought maybe because it would be time consuming, but it wouldn't need to check if no new files were added outside of the cache directory, and checking the size would rapidly cut the search space. Then I thought maybe because it doesn't come up very often; but if that were the case then I would expect game consoles to do this, because most games will use the same stock sound effects package for instance, but they don't. Having two games from one series takes the same amount of space as just summing the two sizes, even though tons of assets would be reused.

Or in a system like youtube, they check videos against other videos when checking for copyright violations, but they don't seem to cause two identical videos to be stored only once, considering how mirroring a video can prevent it being taken off the site, (e.g. when 'youtube vs the users' kept being mirrored they took it out of the search results rather than continuing to take them off the site).

So, what's the reason the system doesn't compress things this way?

¿Fue útil?

Solución

It's called deduplication.

Some filesystems do it (like ZFS), some block-level storage systems do it (like NetApp), some backup systems do it (rsnapshot), source code managment systems do it (Git, bzr, fossil)

It's not so rare, just that until recently it was an expensive option for generic filesystems.

Note that it's not a good idea to do it as you suggest (hardlinks) for general-use volumes, since editing one 'copy' would edit the other one too. You should take care of breaking the link first. Some applications never "edit" files in place, instead on each "save" a new file is created and afterwards it is renamed to replace the original. In those cases yes, it would be reasonably safe to keep hardlinks; but do you want to audit every application you use on your files? Nuch easier is to keep separate copies separate

Otros consejos

There are filesystems that do this, btrfs or ZFS for example. Not (just) for files, also for individual extents.

Dropbox also does this (or at least used to). Uploading a large file that another user already uploaded takes only a small amount of time, because it isn't actually uploaded. The client sends a hash of the file to the server, and when the server already knows about the file, it will tell the client to stop uploading.

The problem is, by doing this in the background you are changing the mutability semantics of the system in a way that people just wouldn't expect.

Consider the following workflow:

  1. I create a wonderful piece of art in myasciidrawing.txt.
  2. I decide that I want to create a similar piece of art, so I copy myasciidrawing.txt to awesomeasciidrawing.txt and start editing it.
  3. Some time later, happy with my creation I save it.
  4. Later I go back to look at myasciidrawing.txt and find that it has the contents of mynewasciidrawing.txt and I've lost the original!

What happened is that between 2) and 3) the wonderful space saving background deduplication routine has identified that myasciidrawing.txt and awesomeasciidrawing.txt have the same contents, deduplicated and linked them, so that when I saved awesomeasciidrawing.txt it ovewrote myasciidrawing.txt too!

Worse than that, is that whether myasciidrawing.txt and awesomeasciidrawing.txt have the same contents after step 3 depends on what software you are using to edit.

If you are using software which edits the original in place the links will mean that both appear to be edited at the same time. If the software renames the old file '.bak' and then writes a new file with the same name, then myasciidrawing.txt and awesomeasciidrawing.txt.bak will both contain the original drawing, but awesomeasciidrawing.txt will point to the updated contents.

This is one of the reasons that deduplicating filesystems tend to use copy-on-write semantics, since any deduplicated data is, by definition, shared data.

I'm going to assume you mean resource type files here e.g. images, songs, sounds etc. If you're on about sharing executable code, this path has already been well trodden: windows has had its DLL Hell whilst Unix gets round this by compiling an all singing, all dancing executable.

The main problem with resource type files to my mind comes with editing. Say you have a photo and you want to make a copy to clip it, correct it, enhance it etc. Clearly you wouldn't want the edit to also update the source as the raw information would be lost.

There is some mileage in applications that would do such a job - to control photo libraries & music collections etc. But I don't see the value in building this into the OS as standard.

Bear in mind also that some operating systems don't have the facility to create linked files as elegantly as Unix does.

Licenciado bajo: CC-BY-SA con atribución
scroll top