Question

I have been reading "Version Control with Git" by J. Loeliger and M. McCullough and I found following explanations for git's internal structure and packed files, respectively:

"Git's internal database efficiently stores every version of every file - not their difference - as files go from one revision to the next. Because Git uses the hash of a file's complete content as the name for that file, it must operate on each complete copy of the file. It cannot base its work or its object stores entries on only part of the file's content nor on the difference between two revision of that file."

"To create a packed file, Git first locates files whose content is very similar and stores the complete content for one of them. It then computes the difference, or deltas, between similar filed and stores just the difference."

Now they seem contradictory to me, first paragraph being wrong, because Git does store delta of blobs (delta being blobs themselves). So why would the authors decide to explain it such way? Or could someone bridge the gap between those two paragraphs? It seems to me that Git does fine with packed files without having the full snapshot. I have an example here from git-scm.com.

Was it helpful?

Solution

The two paragraphs are talking about different layers of the system.

Git is based on an object database, where the only objects are commits, trees, blobs and tags. These are the objects that users can work with, and none of them represents a change as such: patches and diffs are all generated on demand.

Git does use delta-encoding to pack objects together for storage, but this is essentially an implementation detail of the storage system and wire protocol, not part of the fundamental model of how Git works. It is entirely possible for Git to work without doing delta-encoding for storage (and this is exactly how it started out), or for a different implementation to store the objects using an incompatible encoding. Notably, the way that deltas are stored often bears no resemblance to the changes that you actually see as parts of diffs etc- the deltas are based just on the byte sequences of the objects, not lines at all, for example. These deltas are all abstracted away, and you have to engage in some hacking to see them at all.

So the point the authors were trying to make is that Git's fundamental modes of operation are all based on complete files, and that operations such as git log -p are in fact calculating the diffs on the fly, not simply showing what is stored. They are honest enough to point out that the on-disk storage may involve storing deltas, but these are a low-level concept.

The rules for pack files include that any one pack file must be self-contained: that is, if an object is stored in a pack file as a delta, the base object must also be stored in the pack file. Up to a limit, deltas can be chained together: but you can always get an object out of a pack file without having to go outside of it. When Git internally needs an object from the pack, the deltas will be applied to produce it, it generally won't operate on the deltified representation at all. (AFAIK the only exception is when getting the object to put into another pack, where deltas may be copied as-is)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top