SaaS Customer Success: Best Practices for Unplanned Outages

SaaS Customer Success - Unplanned Outages So I got an email recently about best practices for dealing with unplanned outages from a SaaS Customer Success standpoint.

I’ve attempted to answer the question in a meaningful way, but I am the first to acknowledge that there is a lot more to it than just what I talked about in this post.

That said, I think my answer provides a great way to think about Customer Success and unplanned outages by taking full advantage of the SaaS business model architecture.

Check it out….

Here’s David’s question:

Lincoln, I’m hoping you can help me with a quick question on SaaS best practices. I lead the support structure for a company whose applications are all provided in a SaaS environment and I’m looking to create clear guidelines for when we should notify customers/users when there are unplanned outages.

Meaning if the application is down x minutes we do nothing, if its down y minutes we email all users, etc. Any thoughts you have on this would be helpful. – David

Here’s my answer:

David, that might be a quick question to ask, but it’s not a quick question to answer.

Trust Trumps Everything

The main thing to remember is that Trust is Paramount in SaaS.

In B2B Software-as-a-Service, trust is absolutely king.

Trust that your product does what it says it will do.

Trust that you won’t lose data.

Trust that you’ll keep their data secure.

Trust that you won’t rip them off, steal their credit card numbers, share their IP out the backdoor with China.

And trust begins and ends with communication.

That means you have to decide not to hide from your customers and be transparent with them to the level they need and/or expect.

Don’t Hide from Customers

Outage. Hiccup. Glitch. Issue. Problem. Downtime. Crash.

Whatever you want to call it… if it is something other than the norm and it affects customer-facing areas of your app, you can’t hide from it.

Yes, if backend processes bounce but that has no ill effects on customers and the customers wouldn’t even know it happened, that might be okay to not talk about.

But when there’s an issue – and especially if customers experienced it – you can’t hide.

Situation Impact

Okay, so when there is a problem, there are several ways to handle it, but here’s why it’s difficult to give a general answer to what to do (IMHO)…

…the reaction depends not so much on the amount of time of the outage… but the impact.

In the early days of a SaaS company’s life, it’s very likely that your app could go down completely for minutes at a time (maybe even days!) and no one would notice.

However, those that are impacted could suffer greatly if the app is a core part of their business.

On the other hand, at scale, a 1 second “blip” in just one “area” of your app could literally affect millions of transactions.

But those “transactions” might be low-value, non-core-businesss functions that the users – or your system – will just try again. No harm no foul.

So to say “it depends” on how to react is an understatement.

Outside of SLAs – which are an interesting thing in the SaaS world, especially for vendors who’ve built their apps on top of layers of infrastructure and services that they don’t own, control, or even know where it is physically.

Given that, I’m not going to get into SLAs here except to say that how you react to an “outage” in a general sense and how you react to – and even define – and outage when SLAs are involved are two very different things.

Remember this…

People Use your App

I know sometimes it’s hard to remember this, but your customers – even your free(loading) users – are people.

Because of this fact, a good rule of thumb is to remember that while the problem with your app affected millions of transactions and thousands or millions of individual users and customers, each customer was affected individually.

This means that your users don’t cry for the masses that lost their work and they certainly don’t care about how this is affecting you and how that’s affecting your relationship with your cat.

They care about themselves, even if they’re generally good, caring, giving people in real life.

It’s good to remember that “they” aren’t your customer base…

“They” aren’t your user base…

They are individual people that trust(ed) you and have expectations of a certain level of service.

Of course their expectations around the level of service to expect or what is realistic could very well be incorrect, but that’s your fault for mismanaging those expectations in the first place.

Anyway…

Even though the fan – and the room – is now covered in the fallout from this mess, and it might be panic-inducing to have thousands or millions (or even just tens) of unhappy users out there frustrated with why they can’t finish their work or communicate with their customers or post their next blog post or check on their campaign…

…and even though they’re out there complaining on Twitter and posting “y u no work?” memes on Reddit…

… you need to keep in mind that each one of them is experiencing the outage individually, not as a cohort of unhappy customers.

Who cares, Lincoln? Why is that important? Context.

Context is King

First I have to say that Jason Lemkin, former CEO of EchoSign, has a great post on scaling your Customer Success team which should shed some light on how you might reach out to your various customer types, so I won’t go into that detail here.

But suffice it to say that when you reach out to your customers and users – people (as we’ve established) – you need to keep in mind that you’re reaching out to a human being experiencing an issue and feeling quite helpless (as we all have when a cloud service just disappears).

They’re angry – probably pissed – but not for the self-centered reason you think (this is where context comes in):

They’re embarrassed in front of a customer for using your system to store the artwork that their customer cannot access now
They’re embarrassed in front of co-workers because they championed your system in their company over other competitive products – they invested social capital with you – and now this failure is blowing back on them
They’re frightened that they’ve lost work and remember that “work” means sunk costs, time and materials, ideas they’ll forget, hours they’ve invested, etc.
They fear the worst… catastrophic data loss, hackers from a tiny country they’ve never heard of outside of, well, news about hackers

In other words, they’re feeling something about this situation that isn’t strictly technical or – honestly – even about your app.

Just like marketing, this isn’t about you or your SaaS app… it’s about your customer.

They aren’t upset that they can’t get in and see your super-slick UI elements.
They aren’t kicking their trashcan across the office because they can’t see your elegant sentiment-analysis algorithm at work.
They aren’t yelling obscenities at their cat because they can’t access your super-cool implementation of Node.js and CouchDB…

Get the point?

They’re all people, so treat them that way.

Rage is a Gauge

The irony is that the more passionate and urgent the reaction to your outage is – probably – is an indication of how important your SaaS app is to your customers, at least in a B2B environment.

So while you’re in the midst of this mess, keep in mind that those horrible emails, phone calls, and tweets you’re getting indicate that you’ve found an audience. It’s not all bad, even though it probably sucks right then.

In fact, if you’re a startup wondering if you have reached Product / Market Fit, don’t survey your users and ask if they’d be unhappy if your service went away…

… unplug your service on Monday morning and see what the reaction is.

If the response is crickets, you probably aren’t there yet.

Disclaimer: Don’t do that. (But it would prove the point) (But don’t do that).

So back to the question at hand…

How do you React to an Unplanned Outage?

Don’t.

Don’t react.

That’s not how SaaS providers think… you’re better than that!

Get Proactive.

Remember, you’re a SaaS vendor… you have visibility into what’s going on in your system in ways legacy software vendors simply did (do) not.

In the old days – I’m telling’ my stories kids, so pay attention – when software was installed at a customer location (yes, this actually used to happen), support would only know about a problem when the customer called.

Often, the customer only called after trying everything they could possibly do on their end to fix it first because talking to “support” was such an awful, brain-squenching experience.

When they finally did contact the vendor’s support team, and after the requisite “did you reboot the server?” question – then they might get started actually helping the customer fix the issue. Maybe.

But you’re better than that, right?

You’re a SaaS vendor and you should be able to see what your customers are doing and interacting with inside of your app (at a high-level at least… maybe not exactly what data they’re working with). If you don’t have this visibility, that’s a problem far beyond the context of this post and means that you’re not taking full advantage of the SaaS business model architecture.

Being a SaaS vendor means that you have the ability to be proactive and not wait until people contact you with a problem.

And this is especially true when you notice something that is affecting a large swath of users across all of the layers and “areas” of your application.

What to Say

What level of detail do you provide? Whatever is necessary.

What level of transparency is required? Whatever your customers need.

What might your customers be thinking? Put yourself in their shoes.

You have to know your customers.

If your customers are software developers, DevOps, or other technical folks, you might need to provide some low-level details… and doing so might even endear you to that crowd.

But if your customers are SMBs or Department managers in larger companies, they probably don’t need such a level of detail.. and in fact, providing that might alienate them.

If it’s something out of your control – an integration partner’s API was down, your cloud infrastructure provider went away briefly, etc. – don’t just pass the blame, but explain that the problem was out of your hands…

…and then tell your customers (in the level of detail congruent with the audience) how you’re going to fix this problem so it doesn’t happen again.

I say it is 100% legitimate to blame Amazon Web Services when they go down and take your service with ’em…

… however, it’s incompetent to blame them a second time for the same exact issue!

Ultimately, the depth and frequency of your transparency and updates about an outage depends entirely upon who your customers are and your importance to their business.

Once you know what level of detail to provide, you need to…

Get Proactive!

I hate to even give tactical advice here because it varies so much, but this might get you thinking about what is possible as a SaaS vendor.

Here’s some super-generic advice for a minor “blip” (whatever that means in your situation).

Since this is SaaS we should be able to see who was using the system when the problem occurred, which means you should be able to notify them – probably via email or other outside-the-app means – if there’s even a possibility they were affected.

Easy enough, but what if you know that while 500 users were logged-in at the time of the “blip,” that the issue would have caused people to not be able to log-in or access your app at all?

Well… that very much depends.

If the time was 12AM Pacific on a Saturday and those logged-in users were from the handful of customers in Europe, you might be able to safely assume that few people in the US – where the vast majority of your customers reside – were not affected.

However, if the time was 8AM Pacific on a non-holiday Monday… you might have only had 500 logged-in and active at the time, but could have had 10k people trying to access your service right then (possibly contributing to excess downtime… inadvertent DDOS attack!).

It should be easy to pull up some historical data to see that from 8:00 to 8:03AM on a non-holiday Monday, your system goes from 500 users to 10,500 active users… which means you need to not just reach out to those who were logged in and potentially directly affected, but probably to everyone.

If it’s a prolonged outage – and you don’t know when it’ll be back up or you do but it will be a while (whatever a while is) – reach out to everyone and perhaps point them to a status blog or Twitter account where they can get updates on the outage.

Again, it is 100% unique to your situation, but hopefully that gets you thinking in the right direction.

I hope this helps a bit… even if it doesn’t directly answer the question.