Rejecting large files in git

https://stackoverflow.com/questions/858278

21-08-2019
|

Question

We have recently started using git and had a nasty problem when someone committed a large (~1.5GB file), that then caused git to crash on various 32bit OSes. This seems to be a known bug (git mmaps files into memory, which doesn't work if it can't get enough contingous space), which isn't going to get fixed any time soon.

The easy (for us) solution would be to get git to reject any commits larger than 100MB or so, but I can't figure out a way to do that.

EDIT: The problem comes from accidental submission of large file, in this case a large dump of program output. The aim is to avoid accidental submission, just because if a developer does accidentally submit a large file, trying to then get it back out the repository is an afternoon where no-one can do any work, and has to fix up all local branches they have.

Solution

When exactly did the problem occur? When they committed the file originally or when it got pushed elsewhere? If you have a staging repo that everyone pushes to, you could implement an update hook to scan changing refs for large files, along with other permissions etc checking.

Very rough and ready example:

git --no-pager log --pretty=oneline --name-status $2..$3 -- | \
  perl -MGit -lne 'if (/^[0-9a-f]{40}/) { ($rev, $message) = split(/\s+/, $_, 2) }
     else { ($action, $file) = split(/\s+/, $_, 2); next unless $action eq "A"; 
       $filesize = Git::command_oneline("cat-file", "-s", "$rev:$file");
       print "$rev added $file ($filesize bytes)"; die "$file too big" if ($filesize > 1024*1024*1024) }';

(just goes to show, everything can be done with a Perl one-liner, although it might take multiple lines ;))

Called in the way that $GIT_DIR/hooks/update is called (args are ref-name, old-rev, new-rev; e.g. "refs/heads/master master~2 master") this will show the files added and abort if one is added that is too big.

Note that I'd say that if you're going to police this sort of thing, you need a centralised point at which to do it. If you trust your team to just exchange changes with each other, you should trust them to learn that adding giant binary files is a bad thing.

OTHER TIPS

You can distribute a pre-commit hook that prevents commits. On central repositories you can have a pre-receive hook that rejects large blobs by analyzing the received data and prevent it from being referenced. Data will be received, but since you reject updates to refs, all new objects received will be unreferenced and can be picked up and dropped by git gc.

I don't have a script for you though.

If you have control over your committers' toolchain, it may be straightforward to modify git commit so that it performs a reasonableness test on the file size prior to the "real" commit. Since such a change in the core would burden all git users on every commit, and the alternative strategy of "banish anyone who would commit a 1.5GB change" has an appealing simplicity, I suspect such a test will never be accepted in the core. I suggest you weigh the burden of maintaining a local fork of git -- nannygit -- against the burden of repairing a crashed git following an overambitious commit.

I must admit I am curious about how a 1.5 GB commit came to be. Are video files involved?

Here is my solution. I must admit it doesn't look like others I have seen, but to me it makes the most sense. It only checks the inbound commit. It does detect when a new file is too large, or an existing file becomes too big. It is a pre-receive hook. Since tags are size 0, it does not check them.

    #!/usr/bin/env bash
#
# This script is run after receive-pack has accepted a pack and the
# repository has been updated.  It is passed arguments in through stdin
# in the form
#  <oldrev> <newrev> <refname>
# For example:
#  aa453216d1b3e49e7f6f98441fa56946ddcd6a20 68f7abf4e6f922807889f52bc043ecd31b79f814 refs/heads/master
#
# see contrib/hooks/ for an sample, or uncomment the next line (on debian)
#

set -e

let max=1024*1024
count=0
echo "Checking file sizes..."
while read oldrev newrev refname
do
#   echo $oldrev $newrev $refname
    # skip the size check for tag refs
    if [[ ${refname} =~ ^refs/tags/* ]]
    then
        continue
    fi

    if [[ ${newrev} =~ ^[0]+$ ]]
    then
        continue
    fi

    # find all refs we don't care about and exclude them from diff
    if [[ ! ${oldrev} =~ ^[0]+$ ]]
    then
        excludes=^${oldrev}
    else
        excludes=( $(git for-each-ref --format '^%(refname:short)' refs/heads/) )
    fi
#   echo "excludes " ${excludes}
    commits=$(git rev-list $newrev "${excludes[@]}")
    for commit in ${commits};
    do
#       echo "commit " ${commit}
        # get a list of the file changes in this commit
        rawdiff=$(git diff-tree --no-commit-id ${commit})
        while read oldmode newmode oldsha newsha code fname
        do
#           echo "reading " ${oldmode} ${newmode} ${oldsha} ${newsha} ${code} ${fname}
            # if diff-tree returns anything, new sha is not all 0's, and it is a file (blob)
            if [[ "${newsha}" != "" ]] && [[ ! ${newsha} =~ ^[0]+$ ]] && [[ $(git cat-file -t ${newsha}) == "blob" ]]
            then
                echo -n "${fname} "
                newsize=$(git cat-file -s ${newsha})
                if (( ${newsize} > ${max} ))
                then
                    echo " size ${newsize}B > ${max}B"
                    let "count+=1"
                else
                    echo "ok"
                fi
            fi
        done <<< "${rawdiff}"
    done
done

exit ${count}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow