Can git pre-receive hooks evaluate the incoming commit?

Question

(Note: see also Understanding git rev-list, to understand how the code below works.)

You need to use the SHA-1 IDs being supplied on standard input:

while read oldsha newsha refname; do
    ... testing code goes here ...
done

The "testing code" then needs to look at at least some and maybe all three items, depending on the tests to be performed.

The value in $oldsha will be 40 0s if the reference name $refname is being proposed to be created. That is, $refname (typically something like refs/heads/master or refs/tags/v1.2, but any name in refs/ can appear: refs/notes/commits, for instance) does not exist now in the receiving repository, but will exist and will point to $newsha if you allow the change.

The value in $newsha will be 40 0s if the reference name $refname is being proposed to be deleted. That is, $refname does exist now and points to object $oldsha; if you allow the change, that reference-name will be deleted.

The values of both will be nonzero if the reference name $refname is being proposed to be updated, i.e., it currently points to git object $oldsha, and if you allow the change, it will point to new object $newsha instead.

If you just run git log or git show, git uses the SHA-1 it finds by running git rev-parse HEAD. In a typical receiving repository, HEAD is a symbolic reference pointing to refs/heads/master (the file HEAD literally contains the string ref: refs/heads/master), so you will see the top-most commit on branch master (as you observed).

You need to look specifically at any new objects coming in. How do you know which new objects are coming in? That depends on what's happening to the supplied $refname, and possibly other refnames as well.

If the refname is to be deleted, nothing new is coming in. Whether any underlying git objects will be deleted (garbage collected) depends on whether that refname is the "last" reference to those objects. For instance, suppose the entire standard input sequence consists of two directives:

delete refs/heads/foo
delete refs/tags/v1.1

Suppose further that refs/heads/foo (branch foo) points to commit F in this commit-graph diagram, and tag v1.1 points to annotated tag G:

A - B - C - D   <-- refs/heads/master
      \
        E - F   <-- refs/heads/foo
             \
              G <-- refs/tags/v1.1

Deleting branch foo is "safe" in that no commits will go away because annotated tag G will retain them, via the v1.1 tag.

Deleting tag v1.1 is "safe"(ish) in that no commits will go away because branch foo will retain them, via the refs/heads/foo reference. (The annotated tag object itself will go away. It's up to you whether to allow this)

However, deleting both is not safe: commits E and F will become unreachable and will be collected. (It's up to you whether to allow this anyway.)

On the other hand, it's possible that along with those two directives, stdin contains a third directive:

create refs/heads/foo2 pointing to commit H, with commit H pointing to commit G as its parent [Edit: on re-reading this now, I notice the glaring assumption that G is a commit object rather than a tag object. If we assume G is a commit object the rest of the below is correct, but the above becomes at least a little wrong. However, the general idea—that the DAG is protected by having external references—is still right, and this should mostly make sense.]

in which case deletion of foo is safe after all, as the new branch foo2 will retain commit H which will retain commit G.

Doing a complete analysis is tricky; it's often better to just do a piecewise analysis that allows "safe" operations (whatever you decide these are), and force users to push updates piecewise in a "safe" manner (create branch foo2 first, and only then delete branch foo as a separate push, for instance).

If you only want to look at new commits, then, for each reference update:

If it's a delete, allow it (or use other rules).
If it's a create or a modification, find commit objects it makes reachable that were not reachable before, and examine those commits.

In most "normal" pre-receive hooks you'd use the methods outlined below, but we have an alternative for this particular task.

There's a short-cut method for modifications that handles the most common, and usually most interesting, cases. Suppose someone proposes updating refs/heads/foo from 1234567... to 9876543.... It's possible that some objects in the range already existed, e.g., perhaps 1234567 is the ID of commit C and 9876543 is the ID of commit E:

A - B - C           <-- refs/heads/foo
          \
            D - E   <-- refs/heads/bar

in which case this will examine objects D and E. This is also true if commits D and E have just been uploaded but have no references yet, i.e., the proposed update is to add D and E and the graph currently looks like this:

A - B - C           <-- refs/heads/foo
          \
            D - E   [no reference yet]

In either case, a simple:

git rev-list $oldsha..$newsha

produces the object IDs you should look at.

For new references, there's no short-cut. For instance, suppose we have the same five commits shown above, with the same refs/heads/foo but no refs/heads/bar, and the actual proposal is "create refs/heads/bar pointing to E". In this case, we should again look at commits D and E, but there's no obvious way to know about D.

The non-obvious way, which only works in some cases, is to find objects that will be reachable given the proposed creation, that are not currently reachable at all:

git rev-list $newsha --not --all

In this particular case, this will again produce the IDs for D and E.

Now let's consider your particular case, where you want to look at all commits that are being proposed-to-be-added. Here's a way to handle this one.

For all proposed updates:

If this one is a delete, we have some deletes.
If this one is a create or update, we have some new commits; accumulate the new SHA.

If we have some deletes and we have accumulated some SHAs, reject the attempt: it's too hard. Make the user separate out the operations.

Otherwise, if we have no accumulated SHAs, we must just have deletes (or maybe nothing at all—should not happen, but harmless); allow this (exit 0).

Otherwise we must have some new SHA-1 values.

Using the proposed new SHAs as starting points, find all git objects that would be reachable, excluding all objects that are currently reachable under any name. These are all the new objects.

For each one that is a commit, examine it to see if it's forbidden. If so, reject the entire operation (even if some parts could succeed); as before, it's too hard to figure out, so make the user separate out the "good" operations from the "bad" ones.

If we get this far, everything is OK; permit the entire update.

In code form:

#! /bin/sh
# (untested)
NULL_SHA1="0000000000000000000000000000000000000000" # 40 0's
new_list=
any_deleted=false
while read oldsha newsha refname; do
    case $oldsha,$newsha in
    *,$NULL_SHA1) # it's a delete
        any_deleted=true;;
    $NULL_SHA1,*) # it's a create
        new_list="$new_list $newsha";;
    *,*) # it's an update
        new_list="$new_list $newsha";;
    esac
done
$any_deleted && [ -n "$new_list" ] && {
    echo 'error: you are deleting some refs and creating/updating others'
    echo 'please split your push into separate operations'
    exit 1
}
[ -z "$new_list" ] && exit 0

# look at all new objects, and verify them
# let's write the verifier function, including a check_banned function...
check_banned() {
    if [ "$1" = root ]; then
        echo "################################################################"
        echo "Commits from $1 are not allowed"
        echo ... rest of message ...
        exit 1
     fi
}
check_commit() {
    check_banned "$(git log -1 --pretty=format:%an $1)"
    check_banned "$(git log -1 --pretty=format:%cn $1)"
}


git rev-list $new_list --not --all |
while read sha1; do
    objtype=$(git cat-file -t $sha1)
    case $objtype in
    commit) check_commit $sha1;;
    esac
done