Git notes perfomance and alternatives

https://stackoverflow.com/questions/22468146

16-06-2023
|

Вопрос

We are using git at work for a large team (>100 developers) and I am writing different scripts to provide git statistics to management.

One of the statistic that management wants to know is when commit was actually pushed to the repository. They don't really care about author date or committer date because what is matter is when the commit was pushed and therefore picked up by CI server. So I had to implement a thing like push date. Just for completeness (not to advertise myself :)) here is my blogpost describing the details.

Basically I use custom git notes to store details when the commit was actually pushed to the remote repository.

Let's consider a simple task: provide list of all commits between A (exclusively) and B (inclusively) and output commit hash, commit message and a push date

I can do something like:

git log A..B  --notes=push-date --format=<begin>%H<separator>%s<separator>%N<end>

And then parse things accordingly. Well this is significantly slow anyway. And also I don't like do a string parsing and I prefer strongly typed approach.

So to solve performance issues and get rid of parsing I decided to use LibGit2Sharp library.

Well, if we don't touch notes it works pretty fast but as soon as I try to retrieve notes it becomes very, very slow:

# PowerShell script
$pushDateNote = $commit.Notes | Where-Object -FilterScript { $_.Namespace -eq "push-date" }
$pushDate = [DateTime]::Parse($pushDateNote.Message)

For comparison if I don't include notes—results for 200 commits returned in about 2 seconds. If I include notes—time goes up to 2 minutes.

And I've checked that bottleneck here is a search note by a commit. It seems that git itself doesn't have a map between commit and note so it needs to lookup through all the notes all the time. I've just checked we have 188921 commits in the repository, so the most likely situation will go even worse. So my solution is not scalable at all.

So my question: am I doing it wrong? Maybe git is not right tool to store its own metadata efficiently? I am thinking now to move all the metadata into an external database such as MSSQL. But I'd rather keep everything in one place. Alternatively I was thinking to keep whole map between commit and its push date serialized as a note in one commit

For example to use magic hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904 (Is git's semi-secret empty tree object reliable, and why is there not a symbolic name for it?)

git notes add 4b825dc642cb6eb9a060e54bf8d69288fbee4904 -m serialized-data
$serializedData = git notes show 4b825dc642cb6eb9a060e54bf8d69288fbee4904

This will help to retrieve data only once and therefore no lookup issues. But it will add additional overhead to serialize-deserialize data and this just doesn't look right to me.

Please share your thoughts.

Решение

Accessing the notes from the Commit object makes libgit2 access the notes tree at each iteration of the loop. A more efficient way to do it is to:

first, load the list of commits you are interested in (you are already doing that apparently)
then load all the notes associated with the push-date namespace only once
and eventually perform a join between those two lists

note: this will add some more pressure from a memory perspective, but it should be faster.

This can be done in C# with the following code:

using (var repo = new Repository("your_repo_path"))
{
    var notes = repo.Notes["push-date"];
    var commits = repo.Commits.QueryBy(
        new CommitFilter {Since = "1234567", Until = "89abcde"});

    var pairs = from commit in commits
        from note in notes
        where note.TargetObjectId == commit.Id
        select new {Commit = commit, Note = note};

    foreach (var pair in pairs)
    {
        Debug.Write(pair.Commit.Sha + " : " + pair.Note);
    }
}

This will output the commits which have a note associated in the push-date namespace.

note: if you are using the QueryBy syntax to retrieve the list of commits, please be aware that commit specified as Until will be excluded from the list (e.g.: as in git log A...B)

In order to also show the commits which have no notes associated in the push-date namespace, you can use the following linq query:

var pairs2 = from commit in commits
             join note in notes on commit.Id equals note.TargetObjectId into gj
             from subnote in gj.DefaultIfEmpty()
             select new { Commit = commit, Note = subnote };

Другие советы

You can always consider using alternatives to 'git notes'. See: https://www.tikalk.com/posts/2015/11/12/yet-another-way-to-implement-commit-metadata/

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow