Question

I plan to use LibGit2/LibGit2Sharp and hence GIT in an unorthodox manner and I am asking anyone familiar with the API to confirm that what I propose will in theory work. :)

Scenario

Only the master branch will exist in a repository. A large number of directories containing large binary and non-binary files will be tracked and committed. Most of the binary files will change between commits. The repository should contain no more than 10 commits due to disk space limitations (disk fills up quite often now).

What the API does not provide is a function that will truncate commit history starting at a specified CommitId back to the initial commit of the master branch and delete any GIT objects that would be dangling as a result.

I have tested using the ReferenceCollection.RewiteHistory method and I can use it to remove the parents from a commit. This creates me a new commit history starting at CommitId going back to the HEAD. But that still leaves all of the old commits and any references or blobs that are unique to those commits. My plan right now is to simply clean up these dangling GIT objects myself. Does anyone see any problems with this approach or have a better one?

Was it helpful?

Solution

But that still leaves all of the old commits and any references or blobs that are unique to those commits. My plan right now is to simply clean up these dangling GIT objects myself.

While rewriting the history of the repository, LibGit2Sharp takes care of not discarding the rewritten reference. The namespace under which they are stored is, by default, refs/original. This can be changed through the RewriteHistoryOptions parameter.

In order to remove old commits, trees and blobs, one would first have to remove those references. This can be achieved with the following code:

foreach (var reference in repo.Refs.FromGlob("refs/original/*"))
{
    repo.Refs.Remove(reference);
}

Next step would be purge the now dangling git objects. However, this cannot be done through LibGit2Sharp (yet). One option would be to shell out to git the following command

git gc --aggressive

This will reduce, in a very effective/destructive/non recoverable way, the size of your repository.

Does anyone see any problems with this approach or have a better one?

Your approach looks valid.

Update

Does anyone see any problems with this approach or have a better one?

If the limit is the disk size, another option would be to use a tool like git-annex or git-bin to store large binary files outside of the git repository. See this SO question to get some different views on the subject and potential drawbacks (deployment, lock-in, ...).

I will try the RewriteHistoryOptions and foreach code that you provided. However, for now it looks like File.Delete on dangling git objects for me.

Beware, this may be a bumpy road to go

  • Git stores objects in two formats. Loose (one file on the disk per object) or packed (one entry on the disk containing many objects). Removing objects from a pack file tend to be a bit complex as it requires to rewrite the pack file.
  • On Windows, entries in the .git\objects folder are usually read-only files. File.Delete can't remove them in this state. You'd have to unset the read-only attribute first with a call to File.SetAttributes(path, FileAttributes.Normal);, for instance.
  • Although you may be able to identify what commits have been rewritten, determining what are the dangling/unreachable Trees and Blobs may turn into quite a complex task.

OTHER TIPS

Per the suggestions above here is the preliminary (Still Testing) C# code that I came up with that will truncate the master branch at a specific SHA creating a new initial commit. It also removes all dangling references and Blobs

        public class RepositoryUtility
{
    public RepositoryUtility()
    {
    }
    public String[] GetPaths(Commit commit)
    {
        List<String> paths = new List<string>();
        RecursivelyGetPaths(paths, commit.Tree);
        return paths.ToArray();
    }
    private void RecursivelyGetPaths(List<String> paths, Tree tree)
    {
        foreach (TreeEntry te in tree)
        {
            paths.Add(te.Path);
            if (te.TargetType == TreeEntryTargetType.Tree)
            {
                RecursivelyGetPaths(paths, te.Target as Tree);
            }
        }
    }
    public void TruncateCommits(String repositoryPath, Int32 maximumCommitCount)
    {
        IRepository repository = new Repository(repositoryPath);
        Int32 count = 0;
        string newInitialCommitSHA = null;
        foreach (Commit masterCommit in repository.Head.Commits)
        {
            count++;
            if (count == maximumCommitCount)
            {
                newInitialCommitSHA = masterCommit.Sha;
            }
        }
        //there must be parent commits to the commit we want to set as the new initial commit
        if (count > maximumCommitCount)
        {
            TruncateCommits(repository, repositoryPath, newInitialCommitSHA);
        }
    }
    private void RecursivelyCheckTreeItems(Tree tree,Dictionary<String, TreeEntry> treeItems, Dictionary<String, GitObject> gitObjectDeleteList)
    {
        foreach (TreeEntry treeEntry in tree)
        {
            //if the blob does not exist in a commit before the truncation commit then add it to the deletion list
            if (!treeItems.ContainsKey(treeEntry.Target.Sha))
            {
                if (!gitObjectDeleteList.ContainsKey(treeEntry.Target.Sha))
                {
                    gitObjectDeleteList.Add(treeEntry.Target.Sha, treeEntry.Target);
                }
            }
            if (treeEntry.TargetType == TreeEntryTargetType.Tree)
            {
                RecursivelyCheckTreeItems(treeEntry.Target as Tree, treeItems, gitObjectDeleteList);
            }
        }
    }
    private void RecursivelyAddTreeItems(Dictionary<String, TreeEntry> treeItems, Tree tree)
    {
        foreach (TreeEntry treeEntry in tree)
        {
            //check for existance because if a file is renamed it can exist under a tree multiple times with the same SHA
            if (!treeItems.ContainsKey(treeEntry.Target.Sha))
            {
                treeItems.Add(treeEntry.Target.Sha, treeEntry);
            }
            if (treeEntry.TargetType == TreeEntryTargetType.Tree)
            {
                RecursivelyAddTreeItems(treeItems, treeEntry.Target as Tree);
            }
        }
    }
    private void TruncateCommits(IRepository repository, String repositoryPath, string newInitialCommitSHA)
    {
        //get a repository object
        Dictionary<String, TreeEntry> treeItems = new Dictionary<string, TreeEntry>();
        Commit selectedCommit = null;
        Dictionary<String, GitObject> gitObjectDeleteList = new Dictionary<String, GitObject>();
        //loop thru the commits starting at the head moving towards the initial commit  
        foreach (Commit masterCommit in repository.Head.Commits)
        {
            //if non null then we have already found the commit where we want the truncation to occur
            if (selectedCommit != null)
            {
                //since this is a commit after the truncation point add it to our deletion list
                gitObjectDeleteList.Add(masterCommit.Sha, masterCommit);
                //check the blobs of this commit to see if they should be deleted
                RecursivelyCheckTreeItems(masterCommit.Tree, treeItems, gitObjectDeleteList);
            }
            else
            {
                //have we found the commit that we want to be the initial commit
                if (String.Equals(masterCommit.Sha, newInitialCommitSHA, StringComparison.CurrentCultureIgnoreCase))
                {
                    selectedCommit = masterCommit;
                }
                //this commit is before the new initial commit so record the tree entries that need to be kept.
                RecursivelyAddTreeItems(treeItems, masterCommit.Tree);                    
            }
        }

        //this function simply clears out the parents of the new initial commit
        Func<Commit, IEnumerable<Commit>> rewriter = (c) => { return new Commit[0]; };
        //perform the rewrite
        repository.Refs.RewriteHistory(new RewriteHistoryOptions() { CommitParentsRewriter = rewriter }, selectedCommit);

        //clean up references now in origional and remove the commits that they point to
        foreach (var reference in repository.Refs.FromGlob("refs/original/*"))
        {
            repository.Refs.Remove(reference);
            //skip branch reference on file deletion
            if (reference.CanonicalName.IndexOf("master", 0, StringComparison.CurrentCultureIgnoreCase) == -1)
            {
                //delete the Blob from the file system
                DeleteGitBlob(repositoryPath, reference.TargetIdentifier);
            }
        }
        //now remove any tags that reference commits that are going to be deleted in the next step
        foreach (var reference in repository.Refs.FromGlob("refs/tags/*"))
        {
            if (gitObjectDeleteList.ContainsKey(reference.TargetIdentifier))
            {
                repository.Refs.Remove(reference);
            }
        }
        //remove the commits from the GIT ObectDatabase
        foreach (KeyValuePair<String, GitObject> kvp in gitObjectDeleteList)
        {
            //delete the Blob from the file system
            DeleteGitBlob(repositoryPath, kvp.Value.Sha);
        }
    }

    private void DeleteGitBlob(String repositoryPath, String blobSHA)
    {
        String shaDirName = System.IO.Path.Combine(System.IO.Path.Combine(repositoryPath, ".git\\objects"), blobSHA.Substring(0, 2));
        String shaFileName = System.IO.Path.Combine(shaDirName, blobSHA.Substring(2));
        //if the directory exists
        if (System.IO.Directory.Exists(shaDirName))
        {
            //get the files in the directory
            String[] directoryFiles = System.IO.Directory.GetFiles(shaDirName);
            foreach (String directoryFile in directoryFiles)
            {
                //if we found the file to delete
                if (String.Equals(shaFileName, directoryFile, StringComparison.CurrentCultureIgnoreCase))
                {
                    //if readonly set the file to RW
                    FileInfo fi = new FileInfo(shaFileName);
                    if (fi.IsReadOnly)
                    {
                        fi.IsReadOnly = false;
                    }
                    //delete the file
                    File.Delete(shaFileName);
                    //eliminate the directory if only one file existed 
                    if (directoryFiles.Length == 1)
                    {
                        System.IO.Directory.Delete(shaDirName);
                    }
                }
            }
        }
    }
}

Thanks for all of your help. It is sincerely appreciated. Please note I edited this code from the original because it did not take into account directories.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top