Question

I have a repository that has grown very large because of a number of large blobs that were checked in years ago. They have been removed in subsequent revisions and aren't needed any longer, so I should be able to purge any reference to them now.

I have seen some references to using git filter-branch but using this command seems dangerous and kludgy, so I tried this:

git checkout --orphan new-master
git rm -rf --cached *
git merge --squash master
git branch -D master
git gc --prune=now

Shouldn't this mean that anything that has been created and subsequently deleted at any point in the history is permanently dropped?

For some reason, it doesn't seem to work - the size is more or less the same.

Any suggestions?

Was it helpful?

Solution

Sorry but filter-branch is the only way to do this.

You should try testing it out in a separate clone of your repository if you're feeling nervous. Just remember that git backs everything up for you when you do this, so your cloned repository will increase in size locally until you push up the changed history.

I would check out GitHub's useful page on this.

Also, if you'll excuse my shameless plug, I have been working on a Ruby gem recently that provides some basic metrics about large files in both your history and your working copy. It's still under active development but it works and hopefully you might find it useful.

Edit: Why your approach doesn't work

First of all, git is a distributed revision control system which means that all the branches and history are copied locally when you do a clone. Consequently, you can do a git checkout <commit-sha> for any commit in the repository's history to get exactly what the repository was like at a point in the past.

Creating a new branch does not free you from the history of the repository; in fact, branches are just pointers to commits. So to simplify, all branches have a shared ancestry which is why your new-master branch is exactly the same as your old master branch. The small decrease in size will probably have been down to git getting slightly better optimisation from the garbage collection.

When you ran git gc --prune=now, you were just removing loose objects i.e. objects not in your packfile. A packfile is where git efficiently stores objects in order to increase efficiency and reduce the size of your repository. You can find more information here.

It's a lot to take on board if you're a git newcomer but I've tried to give a high-level overview. I would explore the excellent git documentation and get ready to bust out that git filter-branch command to truly make a dent in reducing your repository's size.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top