Question

I am trying to extract (source code line, author label) pair from git repositories. The easiest way to do that is using git blame. The problem is that git blame takes the last committer as the author no matter whether the committer just indents the code or really changes the code. Do you know any method to it better?

Or maybe before trying to solve the problem, I should first check how many source lines are associated with multiple authors. If the percentage is small, there is no need to worry about it. But I find even counting the number is difficult. For a commit with a single parent, how can we know that the commit changed a line rather deleted a line and added a lined? For a commit with two parents (like a merge), how should I combine the diff results from the two branch?

Thanks

Was it helpful?

Solution

Overview

This is a fundamental misunderstanding of how Git works. Git does not commit patches or diffs; it commits trees and blobs, although packfiles certainly do some sort of deltification. Most of the commit history is calculated at run-time with some flavor of diff.

In other words, if your diff tools can do what you want, so can Git.

git-blame

The git-blame command won't do what you want, because the man page says (emphasis mine):

Annotates each line in the given file with information from the revision which last modified the line.

In other words, it's strictly line-oriented.

git-log

You can get close to what you want with git-log. For example:

# Show diffs with indifference to whitespace changes (e.g. indenting).
git log --patch --ignore-space-change

# Just ignore whitespace altogether.
git log --patch --ignore-all-space

# Show deletions with [- -] and additions with {+ +}.
git log --patch --word-diff=plain

# Custom diff format where ~ denotes newlines.
git log --patch --word-diff=porcelain

The porcelain format is intended for text processing, but it's very non-intuitive from a visual point of view. However, it is well-documented in man 1 git-diff for your programming pleasure.

The downside is that you will have to get your author information from the GIT_AUTHOR_NAME or GIT_COMMITTER_NAME associated with each commit, rather than having Git decorate it for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top