Question

Im in the unfortunate situation of having to store some binary files in git,

However I can choose how the data is stored on disk - in Git (in our own format which only the build system needs to read).

I'd like to avoid talking specifics too much, since I dont think its so important - but to give some context these are many icon files, but the same question would apply to many small sound-files or 3d-models too.

converting these files into one large image will be a build step, so the images can be stored however we like in git.

  • Binary compressed (eg, PNG (image), FLAC (sound))
  • Binary uncompressed (eg, PPM (image), Uncompressed-WAV (sound))
  • ASCII representation of binary data (eg, mime encoding, XPM (image))

Lets assume there will be changes to some files occasionally - so avoiding storing a new binary blob for every small change to a pixel - would be nice.

I'm interested to know:

  • Which options will store a totally new binary blob each time the binary file changes (even a few bytes).
  • Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).
  • I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

All things considered what are the best options for avoiding a large git repo (as edits are made to the binary files) assuming using binary files can't be avoided completely?

Was it helpful?

Solution

Which options will store a totally new binary blob each time the binary file changes (even a few bytes).

All of them. All blobs (indeed, all objects in the repo) are stored "intact" (more or less) whenever they are "loose objects". The only thing done with them is to give them a header and compress them with deflate compression.

At the same time, though, loose objects are eventually combined into "packs". Git does delta-compression on files in packs: see Is the git binary diff algorithm (delta storage) standardized?. Based on the answers there, you'd be much better off not "pre-compressing" the binaries, so that the pack-file delta algorithm can find long strings of matching binary data.

Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).

I have not tried it but the overall implication is that the answer to this should be "yes".

I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

Certainly all files that are completely unchanged will be stored with a lot of "de-duplication" instantly, as their SHA-1 checksums will be identical across all commits, so that each tree names the very same blob in the repository. If foo.icon is the same across thousands of commits, there's just the one blob (whatever the SHA-1 for foo.icon turns out to be) stored.


I'd recommend experimenting a bit: create some dummy test repos with proposed binaries, make proposed changes, and see how big the repos are before and after running git gc to re-pack the loose objects. Note that there are a lot of tuneables; in particular, you might want to fuss with window, depth and window-memory settings (which can be set on command lines or in git config entries).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top