Preparing for and learning from data loss with postmortems

Your first order of business on April 5, 2022? Run a script meant to delete a specific legacy app from your customers’ instances of your service. The script was peer-reviewed and cross-checked, so while your heart still races a bit knowing you’re dealing with production environments, you feel confident in what your team has built. You hit Enter.

Less than 10 minutes later, a customer sends in a support ticket claiming they can’t access any of their data—not just the specific legacy app you wanted to delete. As you realize your script hosted not just a single instance of your service but nearly 900 representing 775 individual customers, this quickly becomes an all-hands-on-deck effort led by an incident management team.

A complete restoration takes days of 24/7 effort, but even then, the work isn’t done—your team still has to take accountability by writing a detailed postmortem.

If you’re working at an organization like Atlassian, who dealt with this exact issue in their infamous 2022 outage, that’s a task they can do with aplomb—they’re well respected for not only founding some of the most sophisticated incident management processes in tech, but sharing much of what they know with others.

You’re never ready for an incident or outage, especially those involving data loss, but are you even less ready to describe what actually happened?

What is a postmortem?

A postmortem is a structured document, produced after a significant incident or outage affecting customers or other end users, that analyzes what happened, why it happened, and how you’re going to prevent it from happening again.

That’s an essential step because too many talented people at sophisticated companies bias themselves into thinking a specific problem simply could not exist, which makes implementing a solution proactively seem like a waste of time.

Of this phenomenon, Dan Luu writes, “One of the things I find to be curious about these failure modes is that when I talked about what I found with other folks, at least one person told me that each process issue I found was obvious. But these ‘obvious’ things still cause a lot of failures. In one case, someone told me that what I was telling them was obvious at pretty much the same time their company was having a global outage of a multi-billion dollar service, caused by the exact thing we were talking about. Just because something is obvious doesn’t mean it’s being done.”

A postmortem is a “call to arms” for continuous improvement—a promise that someone has analyzed what went wrong and that through transparency and accountability, you can admit fault and strengthen your data protection measures. It’s a way to live up to your values, retain worried customers, and maybe even showcase some of your team’s problem-solving prowess.

… or it could simply be a regulatory requirement for your industry.

Examples of high-quality postmortems

Want to get a feel for the best-in-class postmortems to set as your goalposts before you prepare? Any of these should offer some inspiration on what to shoot for in the end:

Atlassian: Post-Incident Review on the Atlassian April 2022 outage (referenced in the introduction)
GitLab: Postmortem of database outage of January 31
Tarsnap: 2023-07-02 — 2023-07-03 Tarsnap outage post-mortem
Cloudflare: Thanksgiving 2023 security incident
Roblox: Roblox Return to Service 10/28-10/31 2021

Luu also maintains a long list of postmortems on GitHub. We can’t guarantee their quality, but you should find a few gems amongst the many unfortunate incidents.

How to best prepare for the inevitable

The postmortem happens only after an outage, security incident, or instance of data loss, but you shouldn’t start preparing in that chaotic moment—that’s when you must be actively fixing the issue or ensuring everyone is documenting their steps toward remediation for later. What can you do beforehand to set your future miserable self up for success?

Conduct mock incidents to generate familiarity with a stressful situation

This is like testing on a local development server, where it belongs, not in production. At Rewind, we routinely conduct tabletop exercises (TTX) to help stakeholders understand where our disaster recovery plans stand and where we could make improvements.

You could extend this exercise by having your team write postmortems on the TTX, detailing what they had done in the fictional scenario to resolve the issue and mitigate damage so that everyone understands the expectations of their role.

Ensure your observability and logging tools are in tip-top shape

Without data, you’ll have real trouble on both sides of the postmortem. First, good luck with root cause analysis when you lack information about events that are often ephemeral or need only to strike once. Second, your postmortem will inevitably feel lacking if you can’t explain how you discovered the issue or worked out the proper fix.

One point of nuance here—if your availability relies heavily on third parties, like SaaS apps or cloud providers, you should take extra measures to monitor their health and performance. Observability platforms like Datadog and Splunk have features to help you repeatedly query their APIs or endpoints to know exactly when and in what fashion they might fail.

Have a legitimate status page for customers

Search engine provider Kagi recently experienced nearly seven hours of downtime due to an unfortunate confluence of hardware updates and an external cyberattack. The chief complaint from those who responded on Hacker News? They weren’t transparent about what was happening.

User @muhammadusman wrote, “I was one of the users that went and reported this issue on Discord. I love Kagi but I was a bit disappointed to see that their status page showed everything was up and running. I think that made me a bit uneasy and it shows their status pages are not given priority during incidents that are affecting real users.”

Just remember that your status page becomes your first source of truth during outages. It can’t be comprehensive—leave the details for your public postmortem document—but it sets the scene for how transparent and accountable your company culture truly is.

Ensure you have proper backup and recovery tooling and procedures

The only way to make an incident or outage worse is to realize that it also deleted mission-critical data for your processes or your customers’ usage of your platform and you don’t have a clear path to restoration. That’s where a situation that might take a few hours to resolve becomes a grueling multi-day expedition—where customers are also breathing down your neck and threatening to migrate whatever data is left elsewhere.

For your SaaS data, there’s always Rewind!

Of course, your incident response or disaster recovery planning doesn’t stop there. We’ve already published plenty of resources around both here on the blog:

What can you do when you’re in the midst of a postmortem?

Now you’re familiar with what a postmortem is and how you can prepare your organization with the culture and technology to handle the stressful situation—but what about creating your postmortem analysis and document with confidence and clarity?

Halt other work to accommodate the process

Successful postmortem processes only work with a certain degree of freshness. If you let the analysis and production process slip away for a few days, even a week or more, the chances that you’ll be able to meaningfully piece together all the parts of your story fall dramatically.

Instead, your culture must recognize the importance of a postmortem, and your software development process must be able to survive a few days’ worth of quiet.

Refuse to apply blame

Almost all SRE and DevOps handbooks agree that the best postmortems acknowledge that availability is an all-hands-on-deck effort, not the responsibility of a single point person or team. According to Atlassian, this “blameless” style emphasizes “improving performance moving forward” by assuming the good intentions of everyone involved.

One big benefit of going blameless, aside from pushing your culture toward collaboration and continuous improvement, is that people are more likely to speak up to report an incident or ideate solutions rather than stay quiet due to fear of retribution.

Others disagree that blameless postmortems are a good goal or even possible. J. Paul Reed, a consultant who helps tech companies with release engineering, argues you should aim for “blame-aware” postmortems. He writes, “They collectively acknowledge the human tendency to blame, they allow for a productive form of its expression, and they constantly refocus the postmortem’s attention past it.”

Whichever path you take, continuously remind yourself and others that outages are a team sport.

Highlight good practices

On the flip side, you should be ready and willing to recognize your peers publicly if they took quick-thinking action during the incident to recognize it quickly or mitigate its severity. Championing someone’s good work goes much further to improve the culture of continuous improvement than even the most constructive of never-ending criticism.

Screenshot and include relevant observability data

Following our “pre-postmortem” recommendation, you should have collected and stored robust observability data. That’s essential for root cause analysis, but it’s also handy for illustrating both the severity of the incident and how your team creatively searched for answers.

Roblox did a fantastic job of this, showing the comprehensiveness of their “normal” observability dashboard for a Consul cluster.

They then followed up with nuanced and specific visualizations that their engineers recognized as clues toward a proper remediation.

Emphasize specificity whenever possible, then illustrate what you can’t

Readers of your completed postmortem will want to know not just what your team saw, but when and how they responded. Whether over email, Slack, or a dedicated incident management platform, all actions your team took related to the incident are likely timestamped and stored away for future reference.

As you rebuild the sequence of events, bring forth as much of the detail from these stored events as possible. Screenshots of actual tickets or conversations are ideal, even if you need to do some redaction, but at the very least, be specific about times, intents, and incidents of collaboration.

In their postmortem of a 14-day outage that left hundreds of customers without access to their cloud-based Atlassian products, their team included very specific details about when actions occurred, who was responsible, and how information was propagated internally. They even created a comprehensive timeline to illustrate the process and demonstrate the best of both worlds:

List your proposed changes comprehensively

A postmortem isn’t complete without details about how you’re going to prevent the incident from happening again. You must illustrate that you’ve learned from your mistakes, as that’s the only way to rebuild trust with an inherently skeptical audience.

Break up your action items, starting with immediate changes to processes, systems, and tools. For example, if you will implement more robust and automated backups of the SaaS data your application depends on, highlight your new vendor(s) and explain how their systems will help you prevent an outcome like customer data loss in the future. Next, showcase long-term preventative measures—think cultural shifts, educational efforts, or migrations to new infrastructure that are necessary but can’t, due to complexity, be done tomorrow.

Assign ownership for action items

Piling a bunch of Jira tickets on unsuspecting engineers might feel like finger-pointing and punishment, but if no one is responsible for showing up, no one will. We’re all talking about tech here, not health emergencies or crimes, but the bystander effect plays a role in postmortems, too.

Specific people and teams need to be given assignments and held accountable to them at a later date—while also recognizing they will need extra time to complete their works in progress.

GitLab did an incredible job with ownership in the aftermath of their infamous data loss incident. They have been a famously transparent company for years, but they also created a meta issue for all other tasks and projects related to improving their infrastructure to prevent further data loss issues, allowing anyone—both internal engineers and customers—to track progress or even jump in.

Communicate the postmortem findings to stakeholders

Aim for public disclosure of your postmortem—that’s the best way to reach all your current and future customers with transparency. Based on the nature of the incident, like a zero-day cyber attack, you may need two “editions” of your postmortem—one for public disclosure with crucial information removed, and one for customers or regulators who will want every detail.

A postmortem isn’t the place for product marketing-esque messaging, but you do need to think carefully about your language at all times—to prevent any last-minute miscommunications, don’t:

Minimize the impact of an outage or data loss incident by reiterating that only “3-5% of our customers” were affected—that doesn’t reduce the perceived severity of your incident, and devalues customers who were affected.
Chalk it all up to coincidence, as the Kagi folks did. Incidents might feel like they only happened because of a Final Destination-esque series of unfortunate events, but if you ask “why” at least five times, you’ll almost certainly come to a different conclusion.
Shrug off blame to vendors, dependencies, or third parties—you still chose to rely on them, so you’re still responsible. If a vendor experiences an outage and your system falls like the next domino in line, that incident also becomes yours.

As one Hacker News user said in response to a Cloudflare postmortem, “Interesting choice to spend the bulk of the article publicly shifting blame to a vendor by name and speculating on their root cause. … I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor’s.”

What’s next?

Postmortems are a necessary part of an inherently painful process.

But every moment, from the panic that sets in after seeing stable metrics crash through to the debugging and remediation process and all the way through your analysis of what went wrong, offers invaluable opportunities for growth, learning, and strengthening your team’s resilience.

As the dust settles on your infrastructure, a well-executed postmortem becomes a new source of pride—things don’t always go to plan, but your team has the talent and gusto to solve problems and get operations back on track. If you’re looking for positive outcomes of an outage, you can’t do much better than reassuring customers that you take incidents seriously and are continuously working on both technology and teamwork to prevent them from happening again.

In their postmortem, Atlassian’s team writes, “As we move forward from this incident and re-evaluate our internal processes, we want to recognize that people don’t cause incidents. Rather, systems allow for mistakes to be made.”

How you improve those systems says a great deal about your team’s character. You’re not just recovering from an incident, but building a trustworthy organization, which in turn, if you play your cards right, might even become a new competitive advantage.

Joel Hans

Joel Hans writes copy and marketing content that energizes startups with the technical and strategic storytelling they need to win developer trust. Learn more about how he helps clients like ngrok, CNCF, Rewind, and others at commitcopy.com.

The postmortem playbook: preparing for and learning from data loss