Unified Backup Strategy: Why Immutability Is the Cheap Part

Blog 10 min read

Across the postmortems I have collected over the years, one pattern shows up again and again: the backups existed, the dashboards were green, and the data we pulled back was already encrypted. The same shape, different companies. Snapshots present, retention honored, restore "successful," and the restored bytes carrying the attacker's payload because the breach predated the recovery point we trusted. I logged enough of these to stop treating "the backup ran" as a signal of anything. The Motability Operations rebuild on AWS is interesting to me precisely because it answers the question every one of those postmortems exposed: can you prove the copy you restore is clean before you mount it.

Motability Operations runs the UK's Motability Scheme for roughly 900,000 disabled customers. As their AWS footprint grew across multiple accounts and multiple AWS Organizations, they could not guarantee recovery from a verified, untampered state during a cyberattack. Their existing setup had no way to isolate recovery points from production, no consistent policy across managed and customer-managed accounts, and no clean answer for Financial Conduct Authority auditors.

So they rebuilt the whole thing on AWS Backup. The interesting part of their story is not that they chose immutability. It is the order they did things in, and that order is the argument I want to make here.

The reflex when people read "ransomware-resilient backup" is to fixate on the lock: Vault Lock, compliance mode, write-once-read-many, done. I think that reflex is backwards. Immutability is the cheapest, most mechanical step in a unified backup strategy. The expensive, load-bearing work is isolation and proof of recovery. Get those wrong and a locked vault just guarantees you keep a pristine copy of data you can never trust.

A Locked Vault Without Isolation Is a Locked Box of Maybe

Compliance mode on an AWS Backup vault does exactly one thing well: it stops deletion. Once the grace period closes, no one, including root, including AWS itself, can shorten retention or remove a recovery point inside its window. That is real and worth having. But deletion resistance answers only half the threat. The other half is whether the bytes inside the vault are clean, and a lock says nothing about that.

This is why Motability's two-tiered design matters more than the lock that sits on top of it. Tier 1 lives in each workload account: AWS-managed keys, governance mode, fast operational restores that the team can manage with normal IAM privileges. Tier 2 lives in a separate central backup account as a logically air-gapped vault, locked in compliance mode, with customer-managed keys at one CMK per platform. Workload backups are copied up into Tier 2; a breach in a workload account cannot reach across to alter or delete the central copy because the account boundary plus the lock removes the path. The separation, not the cryptography, is what survives a full account compromise.

The blast-radius math is what sells it. In a flat design, one stolen admin credential reaches every recovery point. In this design, the same credential reaches Tier 1 and stops at the account wall. That is the difference between a bad week and an unrecoverable quarter.

TierLives inLock modeKeysIts job
Tier 1Workload accountGovernanceAWS-managedFast day-to-day restores
Tier 2Central backup accountComplianceCMK, one per platformLast-resort clean copy after compromise

There is a real cost to this rigidity, and I would rather name it than pretend it away. Compliance mode means you cannot shorten retention to save money once the lock takes effect, and you cannot quietly delete a misconfigured plan. If you seal a Tier 2 vault with the wrong retention or a botched key policy, you live with it for the full window. That is the deliberate trade: you give up operational flexibility in Tier 2 precisely so an attacker gives up theirs too.

Five Accounts Exist So a Breach Cannot Travel

Motability's architecture splits the work across five account types. The split is a containment design, and reading it as bureaucratic tidiness misses the point entirely.

The delegated management account defines backup policy centrally and hosts the dashboard that shows job health across the estate. Workload accounts hold Tier 1 vaults for quick restores. The central backup account holds the air-gapped Tier 2 vaults. A separate forensic account verifies integrity, with the Tier 2 vault shared to it over AWS Resource Access Manager. And a recovery account is stood up on demand during an incident, where the last clean backup is restored within the team's recovery-time objectives.

The point of five boundaries is least privilege with no overlap. Workload teams reach only their own workload and recovery accounts. Backup administrators alone touch the central account. The forensic account reads copies but cannot write production. So a credential stolen anywhere in this topology unlocks one room and gets stuck at the door. I built the flat version of this once, where the backup role had reach into everything "for convenience," and convenience is exactly the word an attacker uses too.

One operational edge worth flagging from the broader AWS Backup behavior: backups run one concurrent job per resource, and additional requests queue. During a mass-recovery event that queuing is real friction, and it argues for testing failover under load rather than assuming the architecture diagram scales linearly.

Verified Recovery Is the Metric, Not Recovery Time

Here is where I will plant my flag against a lot of the disaster-recovery orthodoxy I grew up on. For most of my career the headline number was RTO: how fast can you bring it back. Speed still matters, but speed restoring the wrong data is worse than slowness, because it reinfects the environment you just rebuilt. The metric that actually protects you is whether you can prove a given recovery point is clean before you mount it.

Motability wires this in as an event-driven loop. When a backup copy lands in the central account, EventBridge triggers AWS Backup restore testing into the forensic account, and Elastio, an APN partner, analyzes the restored data for tampering and ransomware artifacts. CloudWatch carries the monitoring and retry logging. The output is not "the file exists." It is "this recovery point restored and passed inspection," which is the only assurance worth carrying into an incident.

This reflects a genuine 2026 shift in how the field talks about recovery: from raw speed toward verified outcomes, where a successful restore test counts for more than a low RTO on a slide. I would push it further. A recovery point you have not test-restored is an untested assumption, and untested assumptions are where 100-day recovery cycles come from. The work to validate is not free, computationally or operationally, but the alternative is discovering your assumption was wrong during the incident itself.

If you want a way to interrogate your own backup story, run it through this table. Each row is something to check, what a passing answer looks like, and why a wrong answer changes the call.

What to checkA good answerWhy it changes the call
When restore tests fireOn every new recovery point, triggered as it landsA schedule that trails your backups leaves a window of unverified points you might grab in a panic
Where integrity and ransomware scanning runsAn isolated account, never productionScanning in production risks detonating the artifact you are hunting for
What happens to a recovery point that failsBlocked or quarantined so it cannot be selectedAn unflagged bad point is exactly what a stressed operator picks during an outage
How you find the last clean copyA recorded lookup of the last proven-clean pointIf it is an archaeology project, you burn recovery hours reconstructing trust under pressure
How often you rehearse the full pathOn a regular cadence, before any real incidentThe first real test of restore-into-isolated-account should not coincide with the breach

Tag-Based Policy Has One Failure Mode You Must Engineer Around

The piece that makes a unified strategy scale is tag-based enforcement: the delegated account pushes backup plans across the organization, and any resource carrying the mandatory tags inherits protection automatically. No per-account hand-configuration, no drift between teams. This is the right pattern, and Motability credits it with cutting the manual effort of onboarding new workloads.

But automation by tag has a quiet hole, and pretending otherwise is how teams get burned. An untagged resource is an unprotected resource. It sits outside the policy and outside the audit scope until something notices it, which during fast deployment can be never. The fix is to treat the tag as a hard deployment gate rather than post-provisioning cleanup. Enforce it where infrastructure is defined, reject untagged resources at creation through service control policies, and make "no backup tag" fail the build the same way a missing required field would. The compliance dashboard is then telling you the truth, because there is nothing silently outside it.

About

I am Alex Kumar, a Senior Platform Engineer and infrastructure architect at Rabata.io, an S3-compatible object storage provider built for AI/ML teams and cost-conscious enterprises. I work from Toronto. Most of what I know about backups came from carrying a pager: running storage and disaster recovery at scale long enough to have lost data, recovered it, and a few times failed to recover it at all.

The rule I repeat in every architecture review is short. You never assume backups work; you test the restore. Everything here is the same advice I give my own teams, which is why I keep returning to verification over speed and isolation over locks. My writing tends to land on Kubernetes storage, DR strategy, and making infrastructure cost legible, because the things that page you at 3 a.m. Are almost always one of those three.

Conclusion

Strip Motability's rebuild down to its thesis and it comes out plain: they did not buy ransomware resilience by turning on a lock. They earned it by isolating recovery points across account boundaries, by proving every backup restores clean before they would ever need it, and by closing the tagging gap so nothing slips outside the policy.

That is the position I have defended throughout. The lock is the easy 10 percent. The isolation and the verification are the 90 percent that decides whether you recover, and most teams underspend on them because they do not demo as cleanly as "immutable." If you run multi-account cloud and your backup story stops at retention rules, you have built the cheap part and skipped the expensive one.

So the test is the one Motability could not pass at the outset: can you prove, today, that your last backup would restore clean. If you cannot, that is the work in front of you, and no amount of compliance-mode locking stands in for it.

Frequently Asked Questions

No. Compliance mode stops deletion, even by root, but it says nothing about whether the bytes inside are clean. You also need account isolation so a breach cannot reach the copy and forensic restore testing so you know the copy is uninfected. The lock is necessary but it is the cheapest layer, not the whole defense.

The two tiers separate jobs that conflict. Tier 1 in each workload account stays in governance mode for fast, flexible day-to-day restores. Tier 2 in a separate central account is air-gapped and locked in compliance mode as the last clean copy. The account boundary means a compromised workload credential cannot reach or alter the central tier.

Verified recovery first. Restoring fast from a compromised recovery point reinfects the environment you just rebuilt, which is worse than a slower clean restore. Wire automated restore testing and integrity scanning into the backup loop so you always know your last proven-clean point, then optimize speed on top of that foundation.

Untagged resources. Automation only protects what carries the mandatory tags, so anything deployed without them sits outside both protection and audit scope, often silently. Enforce the tag at creation through service control policies and infrastructure-as-code so an untagged resource fails the build rather than slipping into production unprotected.

To contain blast radius. Splitting delegated management, workload, central backup, forensic, and recovery functions into separate accounts means a stolen credential unlocks one role, not the entire estate. Backup administrators alone reach the central vault, workload teams reach only their own accounts, and the forensic account can read copies without touching production.