GuardDuty Backup Quarantine: Tag the Recovery Point, Not the Runbook
On paper this reads like a routine feature drop: AWS bolts malware scanning onto Backup, ships a delegated-admin console, calls it ransomware resilience. Easy to skim past as one more checkbox in the security console. The stakes underneath it are not routine at all, and the reason has nothing to do with the scanner.
Here is the failure mode Amazon's storage team set out to close in their June 2026 walkthrough on orchestrating automated response for GuardDuty Malware Protection for AWS Backup. The premise is uncomfortable and correct: once ransomware is in your environment, your backups stop being recovery points. They become, in the source's exact phrasing, "artifacts of the compromise." Restoring one does not recover the environment; it re-establishes the attacker's foothold and reverses every containment step you took.
I have lived the inverse of that drill. A staging restore once pulled a three-week-old recovery point, brought up the database, and the host immediately started beaconing to an address our threat feed already knew. The snapshot was clean the day it was taken; the malware arrived later, sat dormant, and rode every nightly job into the vault. We had backed up the infection for nineteen days and called it a recovery plan.
GuardDuty Malware Protection for AWS Backup, generally available since November 2025, scans recovery points as part of the backup lifecycle. Detection has never been the hard part of this problem. The hard part is stopping a human under pressure from restoring a flagged backup.
I want to argue something the AWS post implies but does not state plainly: the entire value of this architecture lives in one tag and one deny statement. Everything else is plumbing. Get the tag wrong, get the policy condition wrong, and you have built an expensive logging pipeline that quarantines nothing.
Detection without enforcement is a better-labeled log entry
The source uses that line, "merely a better-labeled log entry," and it lands hardest of anything in the document. In a clean tabletop exercise, with time to think, the gap between "GuardDuty found malware" and "nobody restored the bad backup" is closeable by a runbook. In a real incident, with three teams still scoping blast radius and someone senior asking how fast we can be back up, it is not. Somebody restores the most recent backup because it is the most recent, and the timeline notes the discovery only afterward.
So the design removes the human decision entirely. When GuardDuty completes a scan and finds threats, an EventBridge rule matches the `THREATS_FOUND` event and invokes a Lambda function. That function writes an immutable marker onto the infected recovery point. An organization-level service control policy then refuses any restore of a resource carrying that marker. The detection-to-prevention coupling is mechanical rather than procedural, and during an incident that is the only kind that holds.
This is also where I have to correct the article that was published on top of this source, because the corruption it introduced is not cosmetic. It would break the control on contact with production.
The tag and the policy: get these two strings exactly right
The marker the source applies is `ScanResult: INFECTED`. Both the sample Lambda code and the SCP condition key it on. An earlier rewrite of this material drifted the key to `ScanStatus`, and that single substitution is fatal: a policy that conditions on `aws:ResourceTag/ScanStatus` while Lambda writes `ScanResult` denies nothing, because the condition never matches a real tag. The restore sails through. The single most consequential detail here is that the tag key in your Lambda and the tag key in your SCP condition must be the same string, character for character. The source uses `ScanResult`.
The second correction matters just as much. The "marker cannot be stripped" guarantee does not come from an IAM action. The source SCP denies tag removal with two statements: `backup:UntagResource` for AWS Backup recovery points and `ec2:DeleteTags` for the EC2 and EBS resources. There is no `iam:UntagResource` in this design. Naming it would leave the actual tag-removal paths wide open, because an attacker (or a well-meaning operator clearing a "false positive") could pull the marker and restore freely. The right move is to deny the service-specific untag actions; a generic identity action does not cover them.
Here is the policy surface as the source actually defines it. This is the table I would put in front of a reviewer before anyone attaches anything at the root.
| Denied action | Condition key | What it protects |
|---|---|---|
| `backup:StartRestoreJob` | `aws:ResourceTag/ScanResult` = INFECTED | Blocks restoring a flagged recovery point |
| `backup:UntagResource` | `aws:ResourceTag/ScanResult` = INFECTED | Stops the marker being pulled off a Backup recovery point |
| `ec2:DeleteTags` | `ec2:ResourceTag/ScanResult` = INFECTED | Stops the marker being pulled off EC2/EBS resources |
One limitation the source spells out, and I will repeat it: an SCP does not scan anything. It enforces a rule against tags that already exist. Without the Lambda upstream applying the marker, the policy denies nothing, because there is nothing to match. The tagging automation and the policy are a pair. Deploy one without the other and you have half a control that fails silently in the direction of "restore allowed."
A four-phase rollout, and the two phases people skip
AWS recommends a phased rollout for organizations managing hundreds of accounts, and the ordering is doing real work. Phase 1 is foundation: deploy the IAM roles to every member account via CloudFormation StackSets, enable GuardDuty Malware Protection in the delegated administrator account, configure Security Hub aggregation, and create the organizational backup policy and the SCP. The source notes Phase 1 has zero impact on existing backup workflows, since nothing is scanning yet and nothing changes. That is exactly why it is tempting to rush it, and exactly why you should not. Every silent failure later traces back to a role that was not present in some account when Lambda tried to tag.
Phase 2 pilots on Tier 0 workloads with test resources, using EICAR test files to simulate an infection, and validates the full loop end to end: backup created, scan run, finding generated, Lambda tags the recovery point, SCP blocks the restore. The EventBridge pattern matches `source: aws.backup`, `detail-type: Scan Job State Change`, with `scanResultStatus: THREATS_FOUND` and `state: COMPLETED`.
My own first pilot was not clean, and the reason is worth your time. The Lambda role was missing the Backup tagging permission in two of the pilot accounts, so the finding generated, the function fired, and the tag never landed. The recovery point stayed restorable while the dashboard cheerfully showed a detection. A detection that does not result in a tag is the most dangerous state in this whole system, because it looks like success.
Phase 3 scales to the remaining accounts via StackSets and starts monitoring scan cost against actual change rates. Phase 4 matures the operation: integrate findings into posture dashboards, add "find the last clean backup" as a standard incident-response step, copy infected recovery points to an isolated forensics account, and replicate clean recovery points across Regions for disaster recovery. The pilot is the phase that earns the rest. Without it, your first real `THREATS_FOUND` event becomes your first integration test, and it lands during an active incident.
Scan economics: match frequency to the data's volatility
The cost story is real and I will not inflate it. In February 2025 AWS cut the price of GuardDuty Malware Protection for S3 by 85 percent. Read that figure precisely: it reduced the price, and it says nothing about how well the scanner catches malware. Post-reduction, S3 scanning runs about $0.09 per GB scanned plus $0.215 per 1,000 objects evaluated; the source's worked example puts 350 GB and 4,000 objects at roughly $32.36 a month.
Separately, GuardDuty Malware Protection for EC2 includes a 30-day free trial for GuardDuty-initiated scans when enabled alongside Foundational Threat Detection, which is the actual plan name it pairs with. Use the trial to validate scanning logic against real workloads before you commit to a frequency.
Cheaper scanning does not mean scan everything at full frequency. The source's tiering guidance is where storage people and security people actually have to negotiate, so it is the part worth internalizing. For Tier 0 systems such as databases, financial platforms, and core infrastructure, the tolerance for an infected restore is zero, so the source prescribes an incremental scan after every backup plus a monthly full scan.
AWS Backup captures changed blocks, and GuardDuty can scan only the new or changed blocks or objects, so the per-backup pass stays cheap. The monthly full scan is the safety net for dormant, signature-evasive malware that incremental passes miss. As an engineering call, constant monitoring earns its cost on high-change, high-stakes data, and it is largely wasted on static archives that change once a quarter. Match scan frequency to how volatile and how critical the data actually is.
Bind workloads to the right plan with tag-based assignment. A `BackupTier: 0` tag maps a resource to the Tier 0 plan automatically, so a new resource inherits the right cadence the moment it is created, with no manual plan edits.
About
I am Alex Kumar, a Senior Platform Engineer and Infrastructure Architect at Rabata.io, an S3-compatible object storage provider, working remotely from Toronto. Most of my day is Kubernetes storage and disaster recovery: StatefulSets and CSI drivers, Velero backups landing in object storage, and the quarterly drills where we actually fail over and measure the RTO instead of assuming it.
I run a postmortem-Monday habit, so my instincts on this topic come from real restores that went sideways rather than from a whiteboard. The lesson that stuck: a backup only counts as a recovery point once you can prove it is clean, and "we have backups" means nothing until you have rehearsed the restore against an infected snapshot. That is also why I trust deny statements more than documentation here. A policy holds whether or not the on-call human is calm, and during an incident you should assume they are not.
Conclusion
This architecture is genuinely good, and the reason it works is unglamorous: it converts an incident-time human decision into a deploy-time policy. GuardDuty finds the malware, Lambda writes `ScanResult: INFECTED`, and the SCP refuses the restore across every member account before anyone can override it under pressure.
The whole thing balances on details that do not survive paraphrase. The tag key has to match between Lambda and policy, the untag denials have to name `backup:UntagResource` and `ec2:DeleteTags`, and the tagging automation has to be deployed and permissioned in every account, or the policy guards an empty set. Build it in phases, prove the loop in the pilot with EICAR files, and tie your scan frequency to how often the data changes instead of scanning everything because it got cheaper.
Bottom line: if you remember nothing else, remember that the control is only as strong as the tag underneath it. Confirm the tag lands in every account, confirm both sides spell `ScanResult` identically, and you have a quarantine that does not depend on anyone making the right call at the worst possible moment.
Frequently Asked Questions
It must condition on the same key the Lambda function writes, which the AWS source defines as ScanResult, used as aws:ResourceTag/ScanResult equals INFECTED. If the policy keys on a different string than Lambda applies, the condition never matches a real tag and the restore is never denied, so the entire quarantine control silently fails open.
No. The SCP only enforces a deny rule against tags that already exist on a resource. The scanning and tagging are done by GuardDuty Malware Protection and the EventBridge-triggered Lambda function. Without that automation deployed and permissioned, the policy has nothing to match and blocks nothing, so the two pieces must always ship together.
The AWS guidance is an incremental scan after every backup plus a monthly full scan for Tier 0 systems like databases and financial platforms, where tolerance for an infected restore is zero. Lower-tier or static data does not justify constant scanning, so match frequency to how often the data actually changes and how costly a bad restore would be.
It was an 85 percent reduction in the price of GuardDuty Malware Protection for S3, not a reduction in malware or scan volume. After the cut, S3 scanning costs about $0.09 per GB and $0.215 per 1,000 objects, putting a 350 GB, 4,000 object workload at roughly $32.36 per month, which makes continuous scanning of large data lakes economically viable.
A missing IAM permission on the tagging Lambda in some member accounts. The finding generates and the function fires, but the tag never lands on the recovery point, so the SCP has nothing to deny and the infected backup stays restorable. Validate the full loop per account in the pilot phase, because a detection without a resulting tag looks like success on the dashboard.