Edition 3 - The Playbook

LeadReads: Going beyond test coverage - The Playbook

Hi ,

As promised in this edition, I’m giving you the playbook on increasing build stability and velocity. The collection of DORA metrics is a prerequisite for this. If you’re already using Jira, I’d recommend using Compass to automate the collection and reporting of key metrics. You’ll need additional automation for metrics like CFR; just bake it into your release pipeline.

Before we get into the details, here’s a quick recap of the last article:

Alright, now let’s do a deep dive.

Identify the failure rate at each stage.

Identifying the failure rate at each stage, coupled with battle-tested metrics for ensuring smooth delivery, acts as an early disaster detection system. It doesn’t take a genius to figure out that a 40% failure rate in development will not lead to a successful production release.

Aim for these metrics.

  1. Dev Releases: Less than 10% failure rate

  2. Stable Staging Releases: Less than 5% failures

  3. Production Releases: Less than 1%

Deriving Insights from Failure

A failure in your system is the worst thing that can happen. Everything that you’ve done until that point has been to ensure that the users are able to meet their objectives. A system failure prevents them from meeting that objective, i.e., the business failed to achieve its objective. Dramatic, isn’t it? It should be. Failures need to be treated as such. It can’t just be another bug in your sprint that will be tackled next.

Root-causing issues, categorising them in buckets, and documenting them to be later referenced and studied have served us very well in the past, and I’m sharing my secret sauce with you to ease your journey.

  • Root Cause Analysis Document An RCA document is the equivalent of a doctor’s diagnosis. It helps identify the underlying causes of failures, providing a path to permanent resolution rather than temporary fixes. It fosters learning from mistakes and taking preventive measures for the future. Attached is a sample RCA document for your reference.

  • Categorising Failures with 'Buckets' Assigning failure categories, or 'buckets', streamlines the diagnosing and resolving process. They're like labeled drawers where we put specific types of issues. When a problem arises, we know exactly which 'drawer' to open, saving time and effort. Tags like scale, regression, integration, security, and manual intervention serve as good starting points for any product.

💡 Use notion for the documentation since searches look for both page titles as well as content.

Solving for the Class of Issues

RCAs and failure buckets, if not coupled with action, will lead to blame games. Use the learnings from the failure buckets and the RCA to circle in on the right automation technique to prevent this genre of issue. Sticking to a single kind of automation (for example, unit testing) and expecting all of your problems to be solved isn’t practical.

Automation is not limited to your testing strategy. It includes everything from setting up pre-commit hooks in the developers’ machine to ensuring that you have visibility into your systems, coupled with the ability to deploy on demand to ensure that the Mean Time to Recovery is under an hour.

💡 Use a well-rounded automation strategy for better release stability.

If you need help setting up your automation or you’d like to see it in action, you know where to find me.

Deliver features 50% faster using Feature Flags.

Progressively release features to a small subset of users.

A/B test different feature strategies with users.

Coincide feature releases with marketing campaigns without any code changes.

Holler back to this email today to know how!