What are the strategies on a large software program (200 engineers) to get people to fix environments when people are measured on features?

https://softwareengineering.stackexchange.com/questions/348049

11-01-2021
|

Pregunta

I work on a large software program in financial services. (15 project managers, 20 technical leads, 15 environments, 150 people on the technical side).

We do lots of Bankwide integration to hundreds of systems (insurance, bill payment, mutual funds, insurance, tax reporting, equities trading etc), which are frequently down in the development environments. Whilst there is an environment team, they're basically system administrators who need the assistance of a Java Lead to identify the root cause of an issue and fix it.

In smaller scale teams I'd worked on before (an investment fund system) a single PM would own a set of environments all the way to production, and would be responsible for removing blockages for a particular feature all the way to production.

In this larger programme, the project managers have a pattern of wiggling out of this responsibility. There are no points for them in fixing environments. In addition, tech leads get slammed for working on something that is not shipping features.

The test managers throw up their hands because the testers can't login to a system 1/3 of the time.

Now the way I've phrased the question may lead you to the answer "well change the way people are measured! Duh!" If you can articulate a concrete way for people to be measured that creates this incentive, I'm keen to hear it. Unfortunately things are not that simple. Project Managers commit to shipping software, and a kanban board shows everyone's utilization shipping stories, and so there is an incentive to maximise utilization and story points shipped.

Now you can make use of the information radiator and show all the stories as blocked, but the answer that comes back in that situation is, "make it someone else's problem," instead of "take the time to find the solution, and fix it in a way that it never happens again."

Another arguments is that this sort of thing sorts itself out. The person who feels the pain needs to spend the time to fix the problem. Interestingly enough this doesn't stop them from being slapped by the PM for letting their utilisation go down.

I'm considering taking it to the top of the program, and offering a simple 'washing up roster system' - that puts one PM/Tech Lead pair on fixing the environments a day per fortnight. The feedback I've had on this is that the Programme head finds it convenient to ignore these issues, or fire off a short-term operational responsibility "you - make this go away!" instead of thinking strategically about the problem. Taking it to the top is basically playing with fire.

When I take it to the test manager to ask him to talk to the head of the program, he says "The head of the programme is an operational not a strategic thinker. He's not interested in a systemic or medium term fix."

My current idea is to get agreement from the test manager on the burn-rate costs associated with environments that are down - and then link these to our automated availability reports. (We have reports that show graphs of different parts of the system being up and down). This way we can have an argument about the cost of fixing vs not fixing. The problem is that this is tempting a reaction because it relies on making people look bad financially.

My question is: What are the strategies on a large software program (200 engineers) to get people to fix environments when people are measured on features?

EDIT: Thanks the feedback so far has been enormously constructive and helpful. The question was raised about what an environment issue is. These include but are not limited to:

The primary integration and customer web server is out of memory
The primary integration web server hasn't loaded its caches
The primary integration web server has failed startup and is not showing a login page
The primary integration web server is out of system with the Tivoli access management system
One of the multiple of satellite systems is down (emails, statements, fees, equities trading, user setup, end of day)
The primary database being down or running slow

The broader point being that the rate of change on the system is high enough for this to be more likely new issues rather than the same issues cropping up.

Someone helpful has suggested systematising these and measuring the occurrence of them. I have proceeded on several 'lightweight runbook' initiatives listing issues and root causes on a wiki. The more poisonous PMs see this as a utilisation failure.

Someone helpful asked about the definition of done. At present this is defined as the software passing a DEV/QA test. (Ie prior to the SIT test phase, the UAT test phase, and the performance test phase).

Solución

It depends what kind of issues you are facing. "Fixing the environment" is somewhat vague, and lack of precision in describing the problem might in itself be a reason it is hard to get solved. If the problem is unclear it is also unclear who is responsible for fixing it.

You have to break the perceived problems into concretely described issues with steps-to-reproduce and so on. Then you get the issues prioritized and scheduled like all other development tasks. If an issue cause extensive downtime for QA it should be easy to get the fix prioritized, since downtime is pretty costly. (At least if management is halfway rational. If not, then your organization have management problems which are outside of the scope of this forum.)

If the downtime is due to software releases frequently introducing bugs, then you have to redefine your "definition of done". A feature which causes the development environment to crash is not "done".

Looking at your examples it seems QA is (or should be!) your friend here. If the development environment is down or unresponsive for whatever reason, then a feature should not be accepted by QA. If development is feature driven, then everyone have an incentive to get these issues fixed, since a feature is not considered delivered before QA accepts it.

Two of the point demands special consideration:

The database is slow. If the database is functional but slow it is not obvious if QA should accept a feature. Here you will have do define acceptance criteria for performance of the system. Eg. "User should see the response screen within 2 seconds after pressing the OK-button".
External systems are down. Well, you don't have any control over that. You migh have to look into SLA's to see you you can force them to fix their issues. If it is a recurrent problem you will have to find alternatives or make your system more fault tolerant.

Otros consejos

The best strategy I have seen is an old fashioned one.

Hire an OPs dept and make it their job to keep the servers up.

Sure this doesn't work in a startup, where devs have to, and want to, do everything. But it works very well in a large company with large systems where you want to hire 'unit of work' devs who just write code.

The alternative seems to end up with bored senior devs who's entire job becomes diagnosing and fixing dev/test environments.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange