Question

I have got a repository with more than 10.000 entries. I dont want to take care about renamed files. What would be the best approach to count the number of changes done to a file?

My idea was to iterate over all commits and comparing the target.sha of the file with its parent commit. If the SHA is the same, the file was not changed. If the SHA is different a file change was found, meaning this is a new version.

foreach(Commit c in repository.Commits)
{
//DO THE WORK
}

This takes some time, but was the fastest i could get for now.

Maybe someone has a better idea?

Was it helpful?

Solution

The way you describe is basically as fast as you're going to get it. What's left would be any optimisations specific to your implementation of the solution, but without posting code, we cannot comment on that.

It could be worth comparing the trees that lead to the file instead of only the file to save a few allocations and marshalling costs; but you won't really get better algorithmically than comparing the tree entries.

OTHER TIPS

This would actually be your best bet. It's the same approach as Git takes to solve the issue so it would take a lot of work to make it work any better, faster, and as reliably. You could try using a faster hashing algorithm like MD5 if all you care about is counting the number of commits where changes are made.

NOTE: Theoretically you could encounter some accuracy issues with MD5 but only for incredibly large data sets and it should suffice for your needs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top