Separating previously conjoined code into multiple git repositories

https://stackoverflow.com/questions/10258054

02-06-2021
|

Domanda

This question sounds similar to many posed here, but it's obnoxiously different.

I have an git repository that was once an svn repository (that was once a cvs repository). This contains data going back to about 1999.

The time has come to split this one repository in to several different repositories, preserving all of this rich history. However, the structure of the repository has changed frequently. All current projects came from a base project, which grew to a few projects, which shrunk to two projects, and then grew again. Code has been moved around but was never duplicated; it has now all found a final resting place in one of several mature projects.

This makes splitting the repositories very hard if I want to preserve the history. Using git-filter-branch seems like the right approach, but all of these seem to hack off parts of the repository and truncate history with them.

EDIT ADDED To clarify, here's a small example, pretending I'm in the root of the repository. Let's say the repository looks like this:

foo/
    bar/
        file.txt
    baz/

Now let's say I edit the contents of file.txt. Then I rename it to newfile.txt. Then I edit the contents again. Then I move this file out of bar/ and into baz/. My repository now looks like this:

foo/
    bar/
    baz/
        newfile.txt

Ok, now let's say I want to split baz/ out into its own repository. Using git filter-branch or using git subtree split will lose all commit messages and history for newfile.txt back when it was inside bar/ and when it was named file.txt.

I understand that checking out a historical revision might be crazy; it might reference something called ../bar/ or it might reference an invalid directory that doesn't exist and fail spectacularly. I don't care as long as I can look at the file contents at any particular revision.

END EDIT

It seems like there are two paths for what I want to do:

Clone the repository N times, preserve the folders that I want in that repository (via git rm-ing other folders), and somehow hack off any revisions that do not eventually reference files that are in the HEAD. I realize this will have a few negative side effects, in that checking out old revisions will not provide a meaningful code base - I don't care. In order to do this I'd need to find a way to get all paths that descend from all files that exist in HEAD, which I could do with an ugly script.
Build some sort of history index of what the repository looked like during each index. Use a tree filter and chop off files that aren't matched in their respective revision. Then, delete the files that don't appear in or descend from files in HEAD.

Is it possible to find all files that don't appear in HEAD and remove any history pertaining to them? I don't care about resurrecting files that have been long deleted, and this seems to be at the crux of my issue.

Alternative solutions would also be appreciated. I'm relatively new to git, so I'm probably missing something obvious.

Soluzione

I ended up having to do this in a several stage process.

First, I got a list of all the files paths that were ever found in the repository:

git log --pretty=format: --name-only --diff-filter=A | sort -u

Using that, I was able to determine where the files I wanted to keep had resided at one point or another. In my case, they had resided in four separate directories in the repository throughout their lifetimes. I used this information to manually create a regex, such as (?:^foo|^bar/baz|^qux/(?:moo|woof)). This matches the directories I wanted to keep.

I then created a perl script to preserve those pathnames AND any parent pathnames that contained them.

use Path::Class;    
if(scalar(@ARGV) < 1) { die "no regex"; }

my $regex = qr/$ARGV[0]/;    
my @want; my @remove; my $last = undef; my $lastrm = undef;

while(<STDIN>) {
    chomp;
    my $d = $_;
    if( $d =~ $regex ) {
        if( ! defined($last) || ! dir($last)->subsumes(dir($d)) ) {
            $last = $d;
            push @want, $d;
        }
    } else {
        if( ! defined($last) || ! dir($last)->subsumes(dir($d)) ) {
           push @remove, $d;
        }
    }
}
foreach $rm (@remove) {
    my $no_rm = 0;
    if( defined($lastrm) && dir($lastrm)->subsumes($rm) ) {
        $no_rm++;
    } else {
        foreach $keep (@want) {
            if( dir($rm)->subsumes(dir($keep)) ) {
                $no_rm++;
            }
        }
    }
    if( $no_rm == 0 ) {
        print "$rm\n";
        $lastrm = $rm;
    }
}

Finally, I used git filter-branch to use my new filter with my regex to keep the paths that I wanted.

git filter-branch --prune-empty --index filter '
    git ls-tree -d -r -t --name-only --full-tree $GIT_COMMIT 
    | sort | /path/to/filter.pl "(?:regex|of|paths)" 
    | xargs -n 50 git rm -rf --cached --ignore-unmatch' -- --all

The sort is necessary as it ensures the perl script gets the directories in their proper hierarchy.

I hope this helps someone, as it took me many, many hours to come up with this. :)

Altri suggerimenti

You should look into installing and using git subtree https://github.com/apenwarr/git-subtree it handles splitting repos and preserving history well.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow