Domanda

We're running a central git repository (gforge) that everyone pulls from and pushes to. Unfortunately, some inept co-workers have decided that pushing several 10-100Mb jar files into the repo was a good idea. As a consequence of this, our server we use a lot has run out of disk space.

We only realised this when it was too late and most people had pulled the new huge repo. If the problem hadn't been pushed, then we could just do a rebase to snip out those huge commits and fix it, but now everyone has pulled from it, what is the best way to remove that commit (or do a rebase to just remove the large files) and then have this not cause chaos when everyone wants to pull/push from/to the repo?

It's supposed to be a small repo for scripts, but is now about 700M in size :-(

È stato utile?

Soluzione

Check this out https://help.github.com/articles/remove-sensitive-data . Here they write about removing sensitive data from your Git repository but you can very well use it for removing the large files from your commits.

Altri suggerimenti

The easiest way to avoid chaos is to give the server more disk.

This is a tough one. Removing the files requires removing them from the history, too, which can only be done with git filter-branch. This command, for example, would remove <file> from the history:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch <file>' \
--prune-empty --tag-name-filter cat -- --all

The problem is this rewrites SHA1 hashes, meaning everyone on the team will need to reset to the new branch version or risk some serious headache. That's all fine and good if no one has work in progress and you all use topic branches. If you're more centralized, your team is large, or many of them keep dirty working directories while they work, there's no way to do this without a little bit of chaos and discord. You could spend quite a while getting everyone's local working correctly. That written, git filter-branch is probably the best solution. Just make sure you've got a plan, your team understands it, and you make sure they back up their local repositories in case some vital work in progress gets lost or munged.

One possible plan would be:

  1. Get the team to generate patches of their work in progress, something like git diff > ~/my_wip.
  2. Get the team to generate patches for their committed but unshared work: git format-patch <branch>
  3. Run git filter-branch. Make sure the team knows not to pull while this is happening.
  4. Have the team issue git fetch && git reset --hard origin/<branch> or have them clone the repository afresh.
  5. Apply their previously committed work with git am <patch>.
  6. Apply their work in progress with git apply, e.g. git apply ~/my_wip.

In addition to the other answers, you may want to consider adding some pre-emptive protection against future giant jar files, in the form of a pre-receive hook in the repo that forbids users (or at least "non-admin users") from pushing very large files, or files named *.jar, or whatever seems best.

We've done this sort of thing before, including forbidding specific commit IDs because of certain users who just couldn't get the hang of "save your work on a temp branch, reset and pull, and re-apply your work, minus the giant file".

Note that the pre-receive hook runs in a rather interesting context: the files have actually been uploaded, it's just that the references (usually branch heads) have not actually changed yet. You can prevent the branch heads from changing but you'll still be using (temporary, until gc'ed) disk space and network bandwidth.

Use filter-branch!

git filter-branch --tree-filter 'find . -name "*.jar" -exec rm {} \;'

Then just purge all the commits that don't have any files in them with:

git filter-branch -f --prune-empty -- --all

GForge guy here. Even thought this is primarily a git question, I'd like to offer two things:

  1. Starting in GForge 6.3, site admins can identify projects that are using too much disk, as well as old and orphaned projects. This might help you avoid full-disk situations, especially if you have lots of separate teams and projects.
  2. Implementing git hooks (SCM hooks in general) in easy to do in GForge. Site administrators can configure any number of hook commands, and project-level people can then select which hooks they want for their project. Adding a hook that prevents certain types (or sizes?) of file would be a good fit for this feature.
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top