Question

The company I work for notifies its (90%+ non-profit) customers ahead of time that there may be a brief outage at 10pm and that their customer websites' ability to edit will be suspended for maybe an hour or longer at that time on a week night when it should not impact customers or their site visitors.

There are usually no scheduled maintenance during this time. However we did have a mildly severe IT outage one month ago on a crucial day and another one a couple of weeks before that.

Because of a separate IT problem this week on a test environment (non-customer impacting) it necessary to push this week's scheduled 10pm maintenance back a week.

The PR team was FURIOUS about the scheduled maintenance getting moved. They feel it is a black eye and that it looks unprofessional in front of customers and possibly prospects. We feel that it's a very normal part of the business and that it inconveniences no one.

Why would PR feel that way and what could we have done differently?

Was it helpful?

Solution

You have a couple of problems here but I believe the majority can be fixed with statistics and good communication.

Why are you experiencing so many outages?

This is probably PRs biggest worry. There have been two outages outside the windows and they've probably had a rough time because of this.

The first thing I'd recommend is doing Root Cause Analysis on these problems and making them public. I'm not talking about forcing anyone to apologise, I'm talking about a simple:

  • What went wrong?
  • What happened?
  • How are you going to stop it happening again?

These should be include a list of actions which are followed up on and their progress reported back to all the departments.

Doing this will help show the business that you're learning from issues and are not going to attempt to hide from them.

The reason I strongly advise is that I feel PR may feel like they have to "smooth things over" after Dev mistakes. That's probably not entirely fair, but by showing that you're actively working to improve you'll get them much more on board when you go asking for downtime.

Why is your maintenance window is at 10pm?

Most businesses no longer do overnight deployments... why? Because if anything does go wrong they are less able to respond.

Most websites experience higher levels of traffic over the evenings and weekends, conveniently this means that the lower periods of activity are when your developers are working!

10pm is obviously a business decision. That doesn't mean it's correct... perhaps you actually get less traffic at 10:30am on a Tuesday?

Communication

Why would PR feel that way?

Had they already notified the customers when you moved the downtime? Did they have to resend the email? Were they consulted or informed about the new time?

If PR are notifying your clients that there may be an outage and then there's downtime at a completely different time they may be worried (rightly so) that it will simply look like another outage.

My suggestion would be to set up a process to ensure that you feed back to PR (I'm assuming here that they are the guys doing the client communication) to let them know exactly which maintenance windows will be used, what for, and if you need additional ones when you would like them. Close these gaps, don't make them send emails unnecessarily and communicate so they do know when you do need them to do a little client communication.

I would also seriously consider looking into holding pages with messages "We're upgrading our site" they look a lot more professional than a 500 error!

TLDR

Your actions:

  • Go through your outages (including the test one as it took resource away from a live release) and find out why they happened, and put processes in place to make sure they won't happen again. Communicate these findings to the business. This isn't a blame game, it's about minimising customer impact in the future.
  • Establish a system to make sure you don't advertise downtime you're not going to use and do have it when you require. It may be that a window is the best way to achieve this, it may not be.
  • Look at your stats, is 10pm really the best time to have downtime?

The next time you want a scheduled maintenance window I would suggest you go with:

We would like to run some DB maintenance. Doing this at 10pm would be risky because most of our team will be out of the office and we'd have to pay overtime. Our site traffic logs indicate that Wednesday morning is a quiet time, I'd like to propose scheduling some down time then instead.

If PR fight back then ask them to explain the impact they envisage given most of your clients' benefactors will be at work. Don't be aggressive but you have done your best to find the optimum time, ask them what you didn't consider.

It does sound to me like these guys are on your side, but you do need to be aware of each other's efforts to do what's right for your customers.

Licensed under: CC-BY-SA with attribution
scroll top