Question

I'm currently learning Git and have an issue understanding how Git retrieves files from a branch when you check it out.

From what I understand the branch is just a pointer to a commit. So I believed that when you checkout a branch it back tracks all commits from parent to parent starting from the commit where the pointer is at. But I don't understand how it chooses a parent for a commit that has multiple parents, like in a merge.

For example:

enter image description here

Assuming I want to check out master, how does Git know that at point C5 it should go for C4 and not C8. Or have I totally misunderstood this?

When you checkout a branch, how does Git know what files to put in your working tree?

Was it helpful?

Solution

Git is different from most other version control systems (VCS).

Most VCS-es store "deltas" of various forms. For instance, if the tip-most commit in the entire repository is C9 as identified by master and you extract that, you might get all the files in the repository as is, while if you extract C5 (previous commit from C9), you'd start with all the latest files, and then C5 says "undo this, undo that, undo the other thing" and the version-control system undoes those and that gets you the state as of commit C5.

Again, git does not do this.

Instead, git's repository stores what git calls "objects". There are four types of objects: "commits", "annotated tags", "trees", and "blobs". We'll ignore annotated tags (they are not needed for this purpose) and just consider the other three.

Each object has a unique, 160-bit name that gets represented as an SHA-1 hash. The value of the hash is constructed by computing the SHA-1 of the object's contents (plus its type). Git assumes that no two different objects in the repository will ever compute the same SHA-1 (if they do, git explodes messily; but this has never happened). (But note that the same object—e.g., the same foo.c file in many commits—has one single unique SHA-1.)

A commit object looks like this:

$ git cat-file -p 5f95c9f850b19b368c43ae399cc831b17a26a5ac
tree 972825cf23ba10bc49e81289f628e06ad44044ff
parent 9c8ce7397bac108f83d77dfd96786edb28937511
author Junio C Hamano <gitster@pobox.com> 1392406504 -0800
committer Junio C Hamano <gitster@pobox.com> 1392406504 -0800

Git 1.9.0

Signed-off-by: Junio C Hamano <gitster@pobox.com>

That is, it has a tree, a list of parents, an author-and-date, a committer-and-date, and a text message. That's all it has, too. Each parent is the SHA-1 of the parent commit(s); a root commit has no parents, and a merge has multiple parents, but most commits just have one parent, which is what gives you the arrows in the diagram you posted.

A tree object looks like this:

$ git cat-file -p 972825cf23ba10bc49e81289f628e06ad44044ff
100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f    .gitattributes
100644 blob b5f9defed37c43b2c6075d7065c8cbae2b1797e1    .gitignore
100644 blob 11057cbcdf4c9f814189bdbf0a17980825da194c    .mailmap
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING
040000 tree 47fca99809b19aeac94aed024d64e6e6d759207d    Documentation
100755 blob 2b97352dd3b113b46bbd53248315ab91f0a9356b    GIT-VERSION-GEN
[snip lots more]

The tree gives you the top-level directory that goes with that commit. Most tree entries are blobs; subdirectories are more trees. The mode of a blob gives you the executable bit (these look like Unix file modes but git really uses only the one executable bit, so that the mode is always 100644 or 100755). There are a few more modes for special cases (e.g., symlinks) but we can ignore them for now. In any case, each entry has yet another unique SHA-1, which is how git finds the next item (sub-tree or blob).

Each blob object contains the actual file. For instance, the blob for GIT-VERSION-GEN is the git version generator script:

$ git cat-file -p 2b97352dd3b113b46bbd53248315ab91f0a9356b
#!/bin/sh

GVF=GIT-VERSION-FILE
DEF_VER=v1.9.0
[snip]

So, to extract a commit, git needs only:

  1. translate a symbolic name like HEAD or master to the commit's SHA-1
  2. extract the commit object to find the top-level tree
  3. extract the top-level tree object to find all the files and sub-trees
  4. for each file, extract the file object; and for each sub-tree, recursively extract that tree and its objects.

(Git objects are stored compressed, and are eventually further compressed into "pack files" which do use deltas, but in a very different way from other VCS-es. There's no need to delta-compress a file foo.c against a previous version of foo.c; git can delta-compress trees against each other, for instance, or some C code against some documentation. The exact pack file format has undergone several revisions as well: if some future version has an even better way to compress things, the pack format can be updated from version 4 to version 5, for instance. In any case, "loose" objects are just zlib-compressed rather than delta-compressed. This makes accessing and updating them quite fast. Pack files are used for more-static items—files that have not been modified—and for network transmission. They are built during git gc, and also on push and fetch operations [which use a variant called a "thin" pack, when possible].)

For more of the git "plumbing" commands that allow you to read and write individual objects, see the Pro Git book (reminded from gatkin's answer).

OTHER TIPS

Git stores a complete snapshot of all tracked files on every commit, not just a diff. In addition to the parent commit ID, C9 (and every commit) has a tree ID. You can see this with

git log --pretty=format:%T HEAD -1

That command prints the SHA1 hash of the tree, and if you then do git show on that hash you'll get a listing of the top folder in your project, which is the start of the tree. Internally, the tree object has pointers to other objects for the files and other trees for subfolders.

See chapter 9 of Pro Git for details.

Git is unlike most other version control systems. It does not rely on diffs between revisions to re-create the files in your repository. Unlike subversion for example, that usually needs to visit the parent commits and their associated diffs to re-create a file, git does not need to.

In other words, at any time, all git needs is access to one commit to be able to re-create the entire repository.

Therefore, it does not matter whether a commit has one or more parents.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top