What's the name of this hashing/caching/versioning algorithm?

https://stackoverflow.com/questions/5842815

27-10-2019
|

Question

I saw it in a presentation a few weeks back, tried to implement it, failed and forgot about it. But now I wanna know how it works =)

It's a way of efficiently transfering/storing data. It would work in any language. This is what (I think) it does:

You have 1 very big file (eg entire javascript collection of a website).

Split it in blocks of 48 bytes
Hash every block of 48 bytes (eg. MD5)
Split the list of blocks on hashes that end with 0x00
The big blocks (>= 1 hash) should now be different sizes. Some very big, some very small.
Glue the blocks between those hashes (again: very different sizes of actual data)
Hash those blocks
Now you have a list of hashes that represent the current version of the big file

The idea is that when a piece of code changes in the big file, only 1 or 2 hashes change. With the new file, you do all those above steps and you only upload/download the parts (blocks, identifieable by its hash) that have actyally changed. Depending on how much code was changed and on the size of the blocks surrounding that code, you'll never need to download more than 4 blocks. (Instead of the whole file.) The other end of the communication would then replace the original blocks (same algorithm, same functionality) with the new blocks.

Sound familiar? They mentioned a name, but couldn't find anything on it. When I tried to build it, it just didn't work, because if you don't change exactly 48 bytes [1], ALL the hashes after that change [2] are different...

If someonw knows the name: great. If someone could explain it also: perfect!

UPDATE
I found the presentation it was in. It was mentioned (and is used) in a new product "Silo": http://research.microsoft.com/apps/pubs/default.aspx?id=131524 Related: http://channel9.msdn.com/Events/MIX/MIX11/RES04 (So it actually was Microsoft research! Neat!)

From the first link:

A Silo-enabled page uses this local storage as an LBFS-style chunkstore.

In the second link (a video), the good stuff starts at 6:30. Now I've seen it twice... I still don't get it =)

Keywords are Delta encoding and Rabin fingerprints.

Solution

This sounds ... kind of ... like how remote differential compression works;

In the Low Bandwidth File System (LBFS) [24], an RDC protocol is used to optimize the communication between a sender and a recipient by having both sides subdivide all of their files into chunks and compute strong checksums, or signatures, for each chunk. When a client needs to access or copy a file from the server, the latter first transmits the list of signatures for that file to the client, which determines which of its old chunks may be used to reconstruct the new file, and requests the missing chunks. The key to this protocol is that the files are divided independently on the client and server, by determining chunk boundaries from data features.

PDF http://research.microsoft.com/apps/pubs/default.aspx?id=64692

OTHER TIPS

You can solve the "changes which aren't a multiple of the block size" problem using rolling hashes. This is what rsync uses to transfer only changed parts of a file.

That sounds very much like shingling.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow