How to identify and potentially remove big binary commits inside an SVN repository?

https://stackoverflow.com/questions/2176803

svn
fsfs

24-09-2019
|

Question

I am working with an SVN repository that is over 3 years old, contains over 6,100 commits and is over 1.5 GB in size. I want to reduce the size of the SVN repository (I'm not talking about the size of a full SVN export - I mean the full repository as it would exist on the server) before moving it to a new server.

The current repository contains the source code for all of our software projects but it also contains relatively large binary files of no significance such as:

Full installers for a number of 3rd party tools.
.jpg & .png files (which are unmodified exports of PSDs that live in the same folder).
Bin and Obj folders (which are then 'svn ignored' the next commit).
Resharper directories.

A number of these large files have been 'SVN deleted' since they were added, creating a further problem of identifing the biggest offenders.

I want to either:

Create a new SVN repository that contains only the code for all of the software projects - it is really important that the copied files maintain their SVN history from the old repository.
Remove the large binary commits and files from the existing repository.

Are either of these possible?

Solution

You will have to use svnadmin dump to get a dump file of your current repository and possibly svndumpfilter to process the dump file. You can also manually modify the dumpfile as long as you're carefull.

It's probably not going to be a quick and easy job, but it can be done. I've done something similar, only to a much smaller repository. I had a repo with about 150 revisions that took about 600MB.

Make a dump from your current repository, make the necessary changes and try to load the modified dumpfile in a new repository. Then check the new repository to make sure everything is still making sense (History is still correct, no weird changes in paths, ...).

OTHER TIPS

Otherside is right about svnadmin dump, etc. Something like this will get you a rough pointer to revisions that added lots of data to your repo, and are candidates for svndumpfilter:

for r in `svn log -q | grep ^r | cut -d ' ' -f 1 | tr -d r`; do
   echo "revision $r is " `svn diff -c $r | wc -c` " bytes";
done

You could also try something like this to find revisions that added files with a particular extension (here, .jpg):

svn log -vq | egrep "^r|\.jpg$" | grep -B 1 "\.jpg$"

If you deleted files from the repository using "SVN Delete", you didn't actually deleted the files. This would be the beauty of the SVN. Once a file is added to the repository, it is there forever (unless using dump & load). Upon "deleting" the files, you actually create a new revision that marks the deletion, but the files continue to exist in previous revisions.

I've done some dump & load, but to a much much bigger repository. Around 60,000 (!!!) revisions. It took time but at the end, after careful loading, the repository is again built.

Your only way is to list the revisions that the files were added, modified and deleted. Then dump the revisions in between, and load them in the right order. BE AWARE, there is no room for mistakes. If you make a mistake, you will have to start over. Dump & load from the start.

My suggestion, if the large files are such a problem, consider creating a newly fresh repository with no history. Keep the old one for history comparison, and start working from fresh.

Good Luck.

If you just need to find the offending commits and you have access to the server hosting the repository: look for large files in db/revs subdirectory of the repository (assuming it uses fsfs format).

Isn't this just a different problem, with an extra step? I.e. you need to locate files that you consider to be large and binary, and then check if they are indeed managed by SVN or have been built locally (or imported from the parallel asset system, if it's already in place).

So, just find the files, then do svn info on them to find out if they're part of the repository.

Just a small thought, you say that the current state of the repository (the current HEAD) is good, i.e. the large binary files have been svn delete'ed in the past. Therefore your issue is purely the size of the repository?

I know you said you would like to keep all the commit history, but as an option, you could do two dumps, one for the whole revision history, and one for the current HEAD revision.

If you put the full dump on to a DVD for example you would have the data available if you ever needed it, but you could then delete the whole repository and svn load the revision dump, leaving you with a small clean repository.

it is also possible to dump from a specific revision onwards, rather than just the head, so for example you could keep the last 3 months of revisions and dump everything older on to a DVD....

Elaborating on Otherside's answer, here's what specifically worked for me:

svnadmin create new-repo
svnadmin dump old-repo | svndumpfilter exclude --pattern '*.exe' '*.jpg' '*.png' | svnadmin load new-repo

You might be able to exclude your Obj and Bin directories by adding them to the svndumpfilter command – I didn't try it.

Also, Subversion's fsfs-stats program (new in Subversion 1.8, replaced by in 1.9 by svnfsfs stats) might be useful for quantifying the file types and specific files that are filling up your repository.

This might be useful for comparing the repositories afterward:

colordiff -u <(svn log -v file:///.../old-repo ) <(svn log -v file:///.../new-repo)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow