How to improve our branching model?

https://softwareengineering.stackexchange.com/questions/348684

12-01-2021
|

Question

I work as a release manager in a company with 15 internal developers and 10 external developers.

Setup

Code base is a heavily modified Magento (PHP)
Test suite is almost non-existent but is improving every day
Development is done on each developer's laptop with Docker containers
We have 3 environments
- preproduction
- staging
- production
We do 2 preproduction releases by day
We do 1 production release every week

Branching model / code review process

All features / fix branches are created from master
When a development is ready or has new commits, they are pushed to origin feature branch
A merge request is made to preproduction branch
The person in charge of the review handles the MR, resolves conflicts etc.
An automated process does 2 deliveries by day to our preprod environment
The development is available in preproduction for people from the business teams to validate
The validation / feedback process can take up to 6 months
Once a feature is validated, a merge request is made to the master branch where I take care of the merges (developers are now supposed to rebase their feature branch onto master befoire doing the MR)
All merges are made to master, then master is delivered to our staging environment 2 days before production delivery

Main issues

Over time, preproduction branch is far, far, far from master, i.e. a 6'000 commits difference
Business teams validate work on a code base that is not at all what will be deployed to production
Validation process is very slow, so we have to keep WIP developments for a long time
We have numerous regressions, where a WIP development breaks another feature under test
Conflicts are solved numerous times by different persons (the main reviewer when merging on preproduction, the developer when rebasing onto master before doing a MR on master)

I am seeking help on ways to improve this whole process and everybody's life.

Any insight would be greatly appreciated!

Solution

When I read through your question, everything sounded reasonable and even enviably agile – right up to the point where you said you have to wait 6 months (that is over 120 business days!) until a production-ready feature has been approved for deployment?!??!

Your problem is not the branching model.

Your root cause is that sluggish feedback cycle.

The “main issues” you've identified are problematic, but they primarily seems to be symptoms of the delayed feedback.

The cost of waiting

All this waiting causes your process to be wasteful – in the sense of stacks of $$$ slowly evaporating while contributions sit around in limbo:

The long delays increase the merging effort. New conflicts have to be resolved. The devs have to re-contextualize themselves with this code. Do these merges frequently introduce new bugs (costly to fix this late in the process)? How much does all of that add to the time needed to ship a feature? I'd expect the added cost from these delays to be in the range of 10% – 50% or maybe even more.
Each feature probably has some value to your users. Until it is delivered, this value is missing. What is the value of this feature over a period of five months? That is the value your organization is forfeiting by having to wait 6 months instead of one month for feature delivery.
These delays are likely impacting employee satisfaction. Most people like to make an impact. It's hard to see that impact when a small change needs half a year to go live. And all that code sitting in forgotten branches – how much effort was spent there that didn't lead to any value?

Possible reasons

It is now your job to find out why the feedback process takes so long. You will have to talk with the business teams, learn about their expectations and constraints. Possible reasons:

Are the initial requirements too vague, leading to multiple change–feedback cycles before the feature is validated?
Are new requirements bolted on to features in the feedback process? “Can't you just quickly add X, that's much quicker than requesting a new feature…”
Do the business teams not assign a suitable priority to feedback, and let these tasks wait for multiple weeks? Then why do they need these features?
Do the business teams use pre-production versions for their daily work, and are therefore not under any pressure to move these features to production?
Are features requested at a higher rate than can be validated, and validation happens in a FIFO order?
Does feature validation of each change actually involve an exhaustive QA process that covers much more than the changed functionality?

By the way, I strongly recommend a physical Kanban board to visualize the flow of features through your development pipeline: a big whiteboard with coloured post-it notes for each feature/branch. That makes it easy to communicate the scale of these delays. You can also add a “swim lane” for each team.

Mitigation strategies

Depending on your circumstances and the reasons for these delays, there are a variety of strategies you can try:

Each feature needs a single contact point from the business teams. They should be available for any questions from your devs, and they should approve the finished features. This shouldn't be a manager, but a subject expert.
Add testing/QA roles to your team. They can spot quality issues and regressions before the changes are shown to the business teams. Automated unit tests are no replacement for a good QA team.
Manage the number of in-progress features per business team. If a business team has a multi-week backlog of features to be validated waiting for them, communicate the cost of these delays and ask them to to complete those validations first – you have more than enough requests to handle from the other teams.
Ask the business teams to validate smaller features first. This makes the process feel much faster, though it delays slow features even more.
Offer deadlines: “We can deploy that feature to production by the end of the month if you approve it by the 23rd”.
If a feature takes a very long time to be validated, something fundamental seems to be wrong. Consider declaring the implementation a prototype, and start the feature development process anew – this time, with a better idea what is actually needed.
Move to a slower release cycle. Your process sounds wonderfully agile, but the truth is: it's not. At least not yet.

How could a release cycle work? You collect features for each release. For one release period, the devs work on these features. Then the business teams have time for one release period to validate the changes, while the devs work on the features for the next release. Towards the end of each period, the feedback is collected and the features are prepared for deployment. If the feedback process delays a feature, it can be integrated into the next release.

The advantage here is that the rebasing doesn't have to happen continuously, but only around each release. While there will still be conflicts, you no longer suffer from them all the time. The preproduction version offered for feedback would also include all features scheduled for that release, thus allowing you to detect it earlier when features break each other.

What the devs can do

Something the developers can try is using feature toggles. Each feature should be merged as soon as possible so that conflicts are detected early. However, that code will not be run when deployed, as it is protected by a feature toggle. This is a configuration variable that allows the feature to be enabled or disabled at runtime. This also allows use cases like enabling the feature just for a single test user, while it is invisible for the rest. Once the feature has been proven in deployment, the toggle can be removed so that it is always active.

However, feature toggles are not a silver bullet. You gain less merge conflicts and can do large refactorings more easily, but now have a massively more complex code base – how do all these flags interact? You can't test all combinations. This is only viable if each feature is very cohesive and doesn't need far-reaching changes throughout the code base. This is easier if the system was designed with extensibility in mind, e.g. via prudent use of some design patterns.

Since the devs are producing features faster than can be deployed, they could spend some time for speculative, high-risk – high-reward experiments: things like rewriting some part of the code base. Or go for an unit test blitz, where you try to get as many files to 80% statement coverage as possible within a week. Or do some training. Or make some time available for refactoring, which will will allow new features to be integrated more easily. All of that will pay off in the long run, but will take the pressure of the business teams for a while.

Conclusion

While you could change your branching model, such changes would just be cosmetic. The real problem seems to be the delays in the feedback process. You will have to work with the business teams to find a way to speed these up. Once these delays are reduced to a tolerable time frame, you can return to the other opportunities for process improvement that you've identified.

OTHER TIPS

I feel your pain.

Try not to branch and merge. Instead of branching and merging, try altering your codebase so that it is modular. Then rely on build tooling to produce the correct artifacts for deployment to any intended environment by selecting what modules constitute the final build.

So project x is not live yet, but it is checked into the same codebase as BAU, and the project X team are working on it. The BAU build does not add the project X module to the final artifact. But the Project X build does.

If you have need for two teams to work on exactly the same module, at the same time then either defer that work to a single team and let them manage the complexity, or create a copy of that module and let the team managing that module handle rebasing, or use feature switches. But in practice you'll find that it's actually quite rare that two teams work on exactly the same code regularly. Most work that is not BAU will be on new functionality.

Using the SCM to drive the content of the build, instead of using the build tooling, is a trap we've all been in for many years.

"Business teams validate work on a code base that is not at all what will be deployed to production"

This is actually your whole problem; solve it and the other problems go away. Anyone with half a brain and a bit of IT experience can tell you that this is very bad due to the unnecessary increased risk that it brings with it. Usually a few bad production events is enough to rattle common sense into the business. (Note: I am not suggesting you do something unethical)

If I were you, I would lobby real hard for a single line, single package code promotion path backed with a few cost projections of if it goes right and if it goes wrong.

Companies like the one you described love process, and will buy in on just about any process improvement. If you have had any production deployments that went bad and affected a bunch of users, correcting the code promotion path under the umbrella of constant process improvement should get you some results, and if you can get them do not let the momentum evaporate.

Another thought to show how ridiculous the current approach is would be to sequentially version every deployable build or package that gets pushed to preprod, and assign a version to each prod deployment using the largest preprod feature that makes up that prod deployment. Then create or publish a schedule of what version is delivered to preprod and what is delivered to prod, and make sure it goes to each group that does the testing. Seeing the version numbers jumping all over the place may help you get some positive change.

You can also publish a "features awaiting test" report with the dates (and versions) of things that went to preprod and who is responsible for testing, how many days its been waiting, and maybe some coarse measurement of gain to business of deploying feature. Some may respond by testing sooner just to avoid being the worst on the list. Color coding the slowest or longest waiting items and putting them at the top may work as additional "motivation".

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange