How to remove old versions of media files from a git repository

https://stackoverflow.com/questions/6358476

28-10-2019
|

Question

I have a Git repository with several huge media files (images and audio files). Several versions of these media files have been successively commited to the repo. The files are successively refined versions of the same assets, and they have the same name.

I want to keep only the latest version in the Git repository, because it is becoming too big.
What is the simplest way to do this?
How can I propagate these changes correctly to the upstream repository?

Solution

I have a script (github gist here) to remove a selection of unwanted folders from the entire history of a git repo, or to delete all but the latest version of a folder.

It's hard-coded to assume that all git repositories are in ~/repos, but that's easy to change. It should also be easy to adapt to work with individual files.

OTHER TIPS

Old thread but in case someone else stumbles along here…

GitHub & Bitbucket both recommend using BFG Repo-Cleaner.

See:
GitHub: Remove Sensitive Data
Bitbucket: Reduce Repository Size & Bitbucket: Maintaining a Git Repository

Example to remove files over 1 Megabyte, as well as jpgs, pngs and mp3s that are not in HEAD:

# First get the latest bfg.jar, then:
$ git clone --mirror git://example.com/some-big-repo.git
$ java -jar bfg.jar --strip-blobs-bigger-than 1M --delete-files '*.{jpg,png,mp3}' some-big-repo.git
$ cd some-big-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push

Note: now you've pushed the updated revs, the remote repository should also run it's git gc …else you won't see the size reduction. (see e.g. https://stackoverflow.com/a/28782154/3419541)

Finally, re-clone the repository to be sure that you don't accidentally re-commit the old media file blobs.

Check the section on 'Removing Objects' in the chapter Maintenance and Data Recovery in the ProGit book. It provides steps about how to go about removing objects from the git repo. But be warned though that it is destructive.

As mentioned already, you will be re-writing history here, so you will have to get collaborators (if any) to do git rebase.

As for stripping a particular file from history, Github has a nice walkthrough.

For a solution going forward, you should look at putting the binary files in a sub-module.

Git's submodule support allows a repository to contain, as a subdirectory, a checkout of an external project. Submodules maintain their own identity; the submodule support just stores the submodule repository location and commit ID, so other developers who clone the containing project ("superproject") can easily clone all the submodules at the same revision. Partial checkouts of the superproject are possible: you can tell Git to clone none, some or all of the submodules.

https://git-scm.com/docs/git-submodule

https://git-scm.com/book/en/v2/Git-Tools-Submodules

As far as I know, this can't be done, because in git, every commit depends on the contents of the entire history up to that point. So the only way to get rid of the old, big files would be to "replay" the entire commit history (preferrably with the same commit timestamps and authors), omitting the big files. Note that this will produce an entirely separate commit history.

This is obviously not a very viable approach, so the lesson is probably "don't use git to version huge binary files". Instead, you could perhaps have a separate (ignored) folder for the files and use a separate system to version control them.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow