Why squash git commits for pull requests?

https://softwareengineering.stackexchange.com/questions/263164

06-10-2020
|

Question

Why does every serious Github repo I do pull requests for want me to squash my commits into a single commit?

I thought the git log was there so you could inspect all your history and see exactly what changes happened where, but squashing it pulls it out of the history and lumps it all into one commit. What is the point?

This also seems to go against the "commit early and commit often" mantra.

Solution

So that you have a clear and concise git history that clearly and easily documents the changes done and the reasons why.

For example a typical 'unsquashed' git log for me might look like the following:

7hgf8978g9... Added new slideshow feature, JIRA # 848394839
85493g2458... Fixed slideshow display issue in ie
gh354354gh... wip, done for the week
789fdfffdf... minor alignment issue
787g8fgf78... hotfix for #5849564648
9080gf6567... implemented feature # 65896859
gh34839843... minor fix (typo) for 3rd test

What a mess!

Whereas a more carefully managed and merged git log with a little additional focus on the messages for this might look like:

7hgf8978g9... 8483948393 Added new slideshow feature
787g8fgf78... 5849564648 Hotfix for android display issue
9080gf6567... 6589685988 Implemented pop-up to select language

I think you can see the point of squashing commits generally and the same principle applies to pull requests - Readability of the History. You may also be adding to a commit log that already has hundreds or even thousands of commits and this will help keep the growing history short and concise.

You want to commit early and often. It's a best practice for many reasons. I find that this leads me to frequently have commits that are "wip" (work-in-progress) or "part A done" or "typo, minor fix" where I am using git to help me work and give me working points that I can go back to if the following code isn't working out as I progress to get things working. However I do not need or want that history as part of the final git history so I may squash my commits - but see notes below as to what this means on a development branch vs. master.

If there are major milestones that represent distinct working stages it's still ok to have more than one commit per feature/task/bug. However this can often highlight the fact that the ticket under development is 'too big' and needs to be broken down into smaller pieces that can standalone, for example:

8754390gf87... Implement feature switches

seems like "1 piece of work". Either they exist or they don't! Doesn't seem to make sense to break it out. However experience has shown me that (depending on organizational size and complexity) a more granular path might be:

fgfd7897899... Add field to database, add indexes and a trigger for the dw group
9458947548g... Add 'backend' code in controller for when id is passed in url.
6256ac24426... Add 'backend' code to make field available for views.
402c476edf6... Add feature to UI

Small pieces mean easier code reviews, easier unit testing, better opportunity to qa, better alignment to Single Responsibility Principle, etc.

For the practicalities of when to actually do such squashes, there are basically two distinct stages that have their own workflow

your development, e.g. pull requests
your work added to the mainline branch, e.g. master

During your development you commit 'early and often' and with quick 'disposable' messages. You may wish to squash here sometimes, e.g. squashing in wip and todo message commits. It is ok, within the branch, to retain multiple commits that represent distinct steps you made in development. Most of the squashes you choose to do should be within these feature branches, while they are being developed and before merge to master.
When adding to the mainline branch you want to the commits to be concise and correctly formatted according to the existing mainline history. This might include the Ticket Tracker system ID, e.g. JIRA as shown in examples. Squashing doesn't really apply here unless you want to 'roll-up' several distinct commits on master. Normally you don't.

Using --no-ff when merging to master will use one commit for the merge and also preserve history (in the branch). Some organizations consider this to be a best practice. See more at https://stackoverflow.com/q/9069061/631619 You will also see the practical effect in git log where a --no-ff commit will be the latest commit, at the top of the HEAD (when just done), whereas without --no-ff it may be further down in the history, depending on dates and other commits.

OTHER TIPS

Because often the person pulling a PR cares about the net effect of the commits "added feature X", not about the "base templates, bugfix function X, add function Y, fixed typos in comments, adjusted data scaling parameters, hashmap performs better than list"... level of detail

If you think that your 16 commits are best represented by 2 commits rather than 1 "Added feature X, re-factored Z to use X" then that is probably fine to propose a pr with 2 commits, but then it might be best to propose 2 separate pr's in that case (if the repo still insists on single commit pr's)

This doesn't go against the "commit early and commit often" mantra, as in your repo, while you are developing you still have the granular details, so you have minimal chance of losing work, and other people can review/pull/propose pr's against your work while the new pr is being developed.

The main reason from what I can see is as follows:

The GitHub UI for merging pull requests currently (Oct 2015) does not allow you to edit the first line of the commit message, forcing it to be Merge pull request #123 from joebloggs/fix-snafoo
The GitHub UI for browsing the commit history currently does not allow you to view the history of the branch from the --first-parent point of view
The GitHub UI for looking at the blame on a file currently does not allow you to view the blame of the file with the --first-parent point of view (note that this was only fixed in Git 2.6.2, so we could forgive GitHub for not having that available)

So when you combine all three of these situations above, you get a situation where unsquashed commits being merged look ugly from the GitHub UI.

Your history with squashed commits will look something like

1256556316... Merge pull request #423 from jrandom/add-slideshows
7hgf8978g9... Added new slideshow feature
56556316ad... Merge pull request #324 from ahacker/fix-android-display
787g8fgf78... Hotfix for android display issue
f56556316e... Merge pull request #28 from somwhere/select-lang-popup
9080gf6567... Implemented pop-up to select language

Whereas without squashed commits the history will look something like

1256556316... Merge pull request #423 from jrandom/add-slideshows
7hgf8978g9... Added new slideshow feature, JIRA # 848394839
85493g2458... Fixed slideshow display issue in ie
gh354354gh... wip, done for the week
789fdfffdf... minor alignment issue
56556316ad... Merge pull request #324 from ahacker/fix-android-display
787g8fgf78... hotfix for #5849564648
f56556316e... Merge pull request #28 from somwhere/select-lang-popup
9080gf6567... implemented feature # 65896859
gh34839843... minor fix (typo) for 3rd test

When you have a lot of commits in a PR tracing where a change came in can become a bit of a nightmare if you restrict yourself to using the GitHub UI.

For example, you find a null pointer being de-referenced somewhere in a file... so you say "who did this, and when? what release versions are affected?". Then you wander over to the blame view in the GitHub UI and you see that the line was changed in 789fdfffdf... "oh but wait a second, that line was just having its indent changed to fit in with the rest of the code", so now you need to navigate to the tree state for that file in the parent commit and re-visit the blame page... eventually you find the commit... it's a commit from 6 months ago... "oh **** this could be affecting users for 6 months" you say... ah but wait, that commit was actually in a Pull Request and was only merged yesterday and nobody has cut a release yet... "Damn you people for merging commits without squashing history" is the cry that can usually be heard after about 2 or 3 code archeology expeditions via the GitHub UI

Now let us consider how this works if you use the Git command line (and super-awesome 2.6.2 which has the fix for git blame --first-parent)

If you were using the Git command line, you would be able to control the merge commit message completely and thus the merge commit could have a nice summary line.

So our commit history would look like

$ git log
1256556316... #423 Added new slideshow feature
7hgf8978g9... Added new slideshow feature, JIRA # 848394839
85493g2458... Fixed slideshow display issue in ie
gh354354gh... wip, done for the week
789fdfffdf... minor alignment issue
56556316ad... #324 Hotfix for android display issue
787g8fgf78... hotfix for #5849564648
f56556316e... #28 Implemented pop-up to select language
9080gf6567... implemented feature # 65896859
gh34839843... minor fix (typo) for 3rd test

But we can also do

$ git log --first-parent
1256556316... #423 Added new slideshow feature
56556316ad... #324 Hotfix for android display issue
f56556316e... #28 Implemented pop-up to select language

(In other words: the Git CLI lets you have your cake and eat it too)

Now when we hit the null pointer issue... well we just use git blame --first-parent -w dodgy-file.c and we are immediately given the exact commit where the null pointer de-reference was introduced to the master branch ignoring simple whitespace changes.

Of course if you are doing merges using the GitHub UI then git log --first-parent is really crappy thanks to GitHub forcing the first line of the merge commit message:

1256556316... Merge pull request #423 from jrandom/add-slideshows
56556316ad... Merge pull request #324 from ahacker/fix-android-display
f56556316e... Merge pull request #28 from somwhere/select-lang-popup

So to cut a long story short:

The GitHub UI (Oct 2015) has a number of shortcomings with how it merges pull requests, how it presents the commit history and how it attributes blame information. The current best way to hack around these defects in the GitHub UI is to request people to squash their commits before merging.

The Git CLI doesn't have these issues and you can easily choose which view you want to see so that you can both discover the reason why a particular change was made that way (by looking at the history of the unsquashed commits) as well as see the effectively squashed commits.

Post Script

The final reason often cited for squashing commits is to make backporting easier... if you only have one commit to back port (i.e. the squashed commit) then it is easy to cherry pick...

Well if you are looking at the git history with git log --first-parent then you can just cherry pick the merge commits. Most people get confused cherry picking merge commits because you have to specify the -m N option but if you got the commit from git log --first-parent then you know that it is the first parent that you want to follow so it will be git cherry-pick -m 1 ...

I would see if the code is going public or not.

When projects stay private:

I would recommend to not squash and see the making of the sausage. If you use good and small commits, tools like git bisect are super handy and people can quickly pinpoint regression commits, and see why you did it (because of the commit message).

When projects go public:

Squash everything because some commits can contain security leaks. For example a password that has been commited and removed again.

I agree with the sentiments expressed in other answers in this thread about showing a clear and concise history of features added and bugs fixed. I did however, want to address another aspect which your question eluded to but did not explicitly state. Part of the hangup you may have about some of git's methods of working is that git allows you to rewrite history which seems strange when being introduced to git after using other forms of source control where such actions are impossible. Furthermore, this also goes against the generally accepted principle of source control that once something is committed/checked-in to source control, you should be able to revert to that state no matter what changes you make following that commit. As you imply in your question, this is one of those situations. I think git is a great version control system; however, to understand it you must understand some of the implementation details and design decisions behind it, and as a result it has a steeper learning curve. Keep in mind git was intended to be a distributed version control system and that will help to explain why the designers allow git history to be rewritten with squash commits being one example of this.

Because of the perspective ... The best practice is to have a single commit per single <<issue-management-system>> issue if possible.

You could have as many commits as you want in your own feature branch / repo, but it is your history relevant for what you are doing now for YOUR perspective ... it is not he HISTORY for the whole TEAM / PROJECT or application from their perspective to be kept several months from now ...

So whenever you would like to commit a bug fix or feature to a common repo ( this example is with the develop branch ) what you could do it as follows:

how-to rebase your feature branch into develop quickly

    # set your current branch , make a backup of it , caveat minute precision
    curr_branch=$(git rev-parse --abbrev-ref HEAD); git branch "$curr_branch"--$(date "+%Y%m%d_%H%M"); git branch -a | grep $curr_branch | sort -nr

    # squash all your changes at once 
    git reset $(git merge-base develop $curr_branch)

    # check the modified files to add 
    git status

    # add the modified files dir by dir, or file by file or all as shown
    git add --all 

    # check once again 
    git log --format='%h %ai %an %m%m %s'  | less

    # add the single message of your commit for the stuff you did 
    git commit -m "<<MY-ISSUE-ID>>: add my very important feature"

    # check once again 
    git log --format='%h %ai %an %m%m %s'  | less

    # make a backup once again , use seconds precision if you were too fast ... 
    curr_branch=$(git rev-parse --abbrev-ref HEAD); git branch "$curr_branch"--$(date "+%Y%m%d_%H%M"); git branch -a | grep $curr_branch | sort -nr

    # compare the old backup with the new backup , should not have any differences 
    git diff <<old-backup>>..<<new-backup-branch>>


    # you would have to git push force to your feature branch 
    git push --force

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange