Tagging Policy Enforcement with AWS Config

Untagged resources directly degrade FinOps maturity: they break cost allocation, obscure ownership, and invalidate budget forecasting. AWS Config is the continuous-evaluation mechanism that closes the gap left by pre-deployment checks — it watches every configuration item as it changes, evaluates it against a custom rule, and emits a compliance verdict you can route to automated remediation. This page covers the specific control plane: the config:PutEvaluations contract, the synchronous Lambda execution window, the IAM boundaries a custom rule needs, and the EventBridge path that turns a NON_COMPLIANT verdict into an idempotent tag fix. It is the real-time enforcement half of the broader resource tagging and validation pipelines architecture — the half that catches drift the moment it happens rather than on the next batch sweep.

Architecture Context & Data-Flow Position

This component sits at the validation stage of the four-stage pipeline (acquisition → normalization → allocation → persistence) defined in the parent resource tagging and validation pipelines reference. Infrastructure-as-Code scanners such as OPA, Checkov, and cfn-lint enforce tags before deployment, but they cannot account for manual console changes, third-party marketplace deployments, or legacy drift. AWS Config is the post-provisioning safety net: it acquires resource state from the AWS configuration recorder, normalizes it into a configuration item, and hands that to your rule for a strict verdict.

The evaluation cycle is deterministic:

resource state change → configuration recorder snapshot → Config rule invocation → Lambda evaluation → PutEvaluations verdict → EventBridge compliance-change event → asynchronous remediation.

The custom rule receives a synchronous payload and must answer within a hard window. The fields you actually parse out of that payload, and the verdict vocabulary you must answer with, are summarized below.

Config payload field	Meaning	Type	Constraint
`invokingEvent`	JSON string wrapping the configuration item	string	Must be `json.loads`-ed before use
`invokingEvent.configurationItem.resourceType`	CloudFormation-style type, e.g. `AWS::EC2::Instance`	string	Used to apply per-type exclusions
`invokingEvent.configurationItem.resourceId`	Provider resource identifier	string	The `ComplianceResourceId` you report on
`invokingEvent.configurationItem.tags`	Current tag map	object (nullable)	Can be empty during the propagation window
`invokingEvent.configurationItem.configurationItemStatus`	Lifecycle state, e.g. `ResourceDiscovered`, `OK`, `ResourceDeleted`	string	Drives `INSUFFICIENT_DATA` vs verdict
`resultToken`	Opaque token binding the verdict to this invocation	string	Required by `PutEvaluations`; expires
`ruleParameters`	JSON string of rule-level config	string	Where the required-tag set is injected

The verdict you return is one of COMPLIANT, NON_COMPLIANT, NOT_APPLICABLE, or INSUFFICIENT_DATA. Choosing the right one — especially INSUFFICIENT_DATA during the eventual-consistency window — is what separates a quiet, trustworthy pipeline from one that floods the remediation queue with false positives.

Core Implementation Patterns

1. Delivery channel and least-privilege IAM

Config writes evaluation history to an S3 delivery channel and optionally publishes change notifications to SNS. The bucket policy must restrict s3:PutObject to the config.amazonaws.com service principal and the remediation account, enable SSE-KMS, and carry lifecycle rules so the configuration-history objects do not become their own cost-allocation problem — exactly the kind of internal automation spend the governance boundaries in setting up FinOps governance boundaries in Terraform carve out of showback.

The IAM role attached to the custom rule is the single most important blast-radius control. Grant only what the evaluation and remediation paths use, and scope Resource to specific types and regions rather than *:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ConfigEvaluationWrite",
      "Effect": "Allow",
      "Action": ["config:PutEvaluations"],
      "Resource": "*"
    },
    {
      "Sid": "ReadResourceState",
      "Effect": "Allow",
      "Action": ["ec2:DescribeTags", "rds:ListTagsForResource", "tag:GetResources"],
      "Resource": "*"
    },
    {
      "Sid": "RemediationTagWrite",
      "Effect": "Allow",
      "Action": ["tag:TagResources"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {"aws:RequestedRegion": ["us-east-1", "eu-west-1"]}
      }
    },
    {
      "Sid": "Telemetry",
      "Effect": "Allow",
      "Action": ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
      "Resource": "arn:aws:logs:*:*:log-group:/aws/lambda/tag-enforcement-*"
    }
  ]
}

config:PutEvaluations cannot be resource-scoped (the API has no resource ARN), so constrain it with a region condition and keep the role dedicated to this function. Never attach a wildcard tag:* in production — a compromised evaluation Lambda with broad tagging rights can silently rewrite ownership metadata across the estate.

2. Policy schema design

Before deploying evaluation logic, externalize the policy. A versioned manifest mapping resource types to required tags makes the rule data-driven and lets you change policy without redeploying code:

tagging_policy:
  version: "1.2"
  defaults:
    required: ["CostCenter", "Environment", "Owner", "ManagedBy"]
    fallback:
      ManagedBy: "finops-automation"
  exclusions:
    resource_types:
      - "AWS::EC2::SpotInstance"
      - "AWS::Logs::LogStream"
      - "AWS::CloudFormation::Stack"
      - "AWS::ElasticLoadBalancingV2::TargetGroup"
    tags:
      - "aws:cloudformation:stack-name"
      - "aws:autoscaling:groupName"

Explicitly exclude immutable or ephemeral types that auto-manage tags or lack a tagging API, and ignore aws:-prefixed system tags when computing missing keys. Store the manifest in Systems Manager Parameter Store or an S3 object so a policy bump is a parameter update, not a deploy. Version it in Git with semantic versioning — the required set is a contract, and the same canonical model feeds the cross-provider normalization described in cross-cloud cost allocation strategies.

3. Custom-rule runtime and the synchronous window

Config invokes the rule’s Lambda with the payload described above and expects a PutEvaluations call against the supplied resultToken. The function runs inside a tight synchronous budget — treat 5 seconds as the effective ceiling for the evaluation itself. That means no per-resource Describe* round-trips inside the handler: validate against the tag snapshot Config already provided, and offload any heavy remediation to an asynchronous target. If the token expires or the function times out, Config marks the rule itself as errored, which silently blinds the control.

4. EventBridge routing and idempotent remediation

Evaluation does not fix drift; it only labels it. Wire an EventBridge rule on detail-type: "Config Rules Compliance Change" filtered to newEvaluationResult.complianceType: NON_COMPLIANT, and target a dedicated remediation Lambda:

{
  "source": ["aws.config"],
  "detail-type": ["Config Rules Compliance Change"],
  "detail": {
    "messageType": ["ComplianceChangeNotification"],
    "newEvaluationResult": {"complianceType": ["NON_COMPLIANT"]}
  }
}

Idempotency is non-negotiable: the remediation function reads current tag state immediately before mutating, applies only the missing keys, and uses a deterministic client token so a replay is a no-op. Cross-account targets are reached through sts:AssumeRole, and tag:TagResources calls are wrapped in exponential backoff for the same rate-limit reasons detailed in handling billing API rate limits and retries.

Production-Grade Python Ingestion Engine

The module below is a complete, self-contained custom-rule evaluation engine. It parses the Config payload into a typed dataclass, validates against an injected required-tag set, computes the verdict with eventual-consistency handling, and submits it through PutEvaluations with adaptive retries and structured JSON logging. The __main__ guard lets you replay a recorded event locally before deploying. The remediation half follows the same idempotent shape.

import json
import logging
import os
import sys
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional

import boto3
from botocore.config import Config as BotoConfig
from botocore.exceptions import ClientError

# Structured JSON logging so CloudWatch Insights can query verdict outcomes.
logger = logging.getLogger("tag_enforcement")
logger.setLevel(logging.INFO)
_handler = logging.StreamHandler(sys.stdout)
_handler.setFormatter(logging.Formatter("%(message)s"))
if not logger.handlers:
    logger.addHandler(_handler)

# Adaptive retry mode backs off automatically on Config/EC2 throttling, and
# the short timeouts keep us comfortably inside the 5-second evaluation window.
_BOTO_CONFIG = BotoConfig(
    retries={"max_attempts": 3, "mode": "adaptive"},
    connect_timeout=2,
    read_timeout=3,
)
_config_client = boto3.client("config", config=_BOTO_CONFIG)

# Policy injected via ruleParameters or environment, never hard-coded.
REQUIRED_TAGS = set(
    os.getenv("REQUIRED_TAGS", "CostCenter,Environment,Owner,ManagedBy").split(",")
)
EXCLUDED_TYPES = set(
    os.getenv("EXCLUDED_TYPES", "AWS::EC2::SpotInstance,AWS::Logs::LogStream").split(",")
)
SYSTEM_TAG_PREFIXES = ("aws:",)

COMPLIANT = "COMPLIANT"
NON_COMPLIANT = "NON_COMPLIANT"
NOT_APPLICABLE = "NOT_APPLICABLE"
INSUFFICIENT_DATA = "INSUFFICIENT_DATA"


def log(event: str, **fields: Any) -> None:
    """Emit a single structured JSON log line."""
    logger.info(json.dumps({"event": event, **fields}))


@dataclass
class ConfigurationItem:
    """The slice of the Config payload the evaluator actually needs."""

    resource_type: str
    resource_id: str
    status: str
    tags: Dict[str, str] = field(default_factory=dict)

    @classmethod
    def from_invoking_event(cls, invoking_event: Dict[str, Any]) -> "ConfigurationItem":
        item = invoking_event.get("configurationItem") or {}
        return cls(
            resource_type=item.get("resourceType", ""),
            resource_id=item.get("resourceId", ""),
            status=item.get("configurationItemStatus", ""),
            tags=item.get("tags") or {},
        )

    def user_tag_keys(self) -> set:
        """Tag keys excluding AWS-managed system tags."""
        return {
            k for k in self.tags
            if not any(k.startswith(p) for p in SYSTEM_TAG_PREFIXES)
        }


@dataclass
class Verdict:
    compliance_type: str
    annotation: str


def evaluate(item: ConfigurationItem, required: set) -> Verdict:
    """Pure verdict logic — deterministic and unit-testable in isolation."""
    if item.resource_type in EXCLUDED_TYPES:
        return Verdict(NOT_APPLICABLE, "Resource type excluded by policy schema")

    # Eventual consistency: Config can fire before tags propagate. Defer rather
    # than report a false NON_COMPLIANT that would trip remediation.
    if not item.user_tag_keys() and item.status == "ResourceDiscovered":
        return Verdict(INSUFFICIENT_DATA, "Tags not yet propagated; deferring verdict")

    missing = required - item.user_tag_keys()
    if not missing:
        return Verdict(COMPLIANT, "All required tags present")
    return Verdict(NON_COMPLIANT, f"Missing tags: {', '.join(sorted(missing))}")


def submit_evaluation(result_token: str, item: ConfigurationItem, verdict: Verdict) -> Dict[str, Any]:
    """Report the verdict back to AWS Config against the invocation token."""
    return _config_client.put_evaluations(
        Evaluations=[
            {
                "ComplianceResourceType": item.resource_type or "AWS::Config::ResourceCompliance",
                "ComplianceResourceId": item.resource_id,
                "ComplianceType": verdict.compliance_type,
                "Annotation": verdict.annotation[:256],  # API caps annotation length
                "OrderingTimestamp": datetime.now(timezone.utc),
            }
        ],
        ResultToken=result_token,
    )


def _required_tags_from_event(event: Dict[str, Any]) -> set:
    """Prefer per-rule ruleParameters over the function-wide default."""
    raw = event.get("ruleParameters")
    if not raw:
        return REQUIRED_TAGS
    try:
        params = json.loads(raw)
        injected = params.get("requiredTags")
        if injected:
            return {t.strip() for t in injected.split(",") if t.strip()}
    except (json.JSONDecodeError, AttributeError):
        log("rule_parameters_parse_failed", raw=str(raw))
    return REQUIRED_TAGS


def lambda_handler(event: Dict[str, Any], context: Optional[Any] = None) -> Dict[str, Any]:
    """AWS Config custom-rule entry point."""
    try:
        invoking_event = json.loads(event.get("invokingEvent", "{}"))
        item = ConfigurationItem.from_invoking_event(invoking_event)
        result_token = event.get("resultToken", "")
        required = _required_tags_from_event(event)

        # Deleted resources are no longer billable; never report on them.
        if item.status == "ResourceDeleted":
            verdict = Verdict(NOT_APPLICABLE, "Resource deleted")
        else:
            verdict = evaluate(item, required)

        submit_evaluation(result_token, item, verdict)
        log(
            "evaluation_submitted",
            resource_id=item.resource_id,
            resource_type=item.resource_type,
            compliance=verdict.compliance_type,
            annotation=verdict.annotation,
        )
        return {"statusCode": 200, "compliance": verdict.compliance_type}

    except ClientError as exc:
        # Fail open: log loudly but do not crash the rule's evaluation loop.
        log("config_api_error", error=str(exc))
        return {"statusCode": 500, "error": str(exc)}
    except Exception as exc:  # noqa: BLE001 — surface unexpected failures to Config health metrics
        log("unexpected_evaluation_failure", error=str(exc))
        raise


def remediate(event: Dict[str, Any], context: Optional[Any] = None) -> Dict[str, Any]:
    """EventBridge target: idempotently apply fallback tags to a NON_COMPLIANT resource."""
    detail = event.get("detail", {})
    ci = detail.get("configurationItem", {}) or detail.get("newEvaluationResult", {})
    resource_id = ci.get("resourceId") or detail.get("resourceId", "")
    fallback = {"ManagedBy": os.getenv("FALLBACK_OWNER", "finops-automation")}

    tagging = boto3.client("resourcegroupstaggingapi", config=_BOTO_CONFIG)
    # Read current state first so we only fill gaps and never clobber human/Terraform tags.
    current = tagging.get_resources(ResourceARNList=[resource_id]) if resource_id.startswith("arn:") else {}
    existing_keys = set()
    for mapping in current.get("ResourceTagMappingList", []):
        existing_keys |= {t["Key"] for t in mapping.get("Tags", [])}

    to_apply = {k: v for k, v in fallback.items() if k not in existing_keys}
    if not to_apply:
        log("remediation_noop", resource_id=resource_id)
        return {"applied": 0}

    tagging.tag_resources(ResourceARNList=[resource_id], Tags=to_apply)
    log("remediation_applied", resource_id=resource_id, tags=list(to_apply))
    return {"applied": len(to_apply)}


if __name__ == "__main__":
    # Local replay: feed a recorded Config event on stdin and print the verdict.
    sample = json.load(sys.stdin) if not sys.stdin.isatty() else {
        "invokingEvent": json.dumps({
            "configurationItem": {
                "resourceType": "AWS::EC2::Instance",
                "resourceId": "i-0abc123def456",
                "configurationItemStatus": "OK",
                "tags": {"Environment": "prod", "Owner": "team-data"},
            }
        }),
        "resultToken": "TESTMODE",
        "ruleParameters": json.dumps({"requiredTags": "CostCenter,Environment,Owner"}),
    }
    parsed = ConfigurationItem.from_invoking_event(json.loads(sample["invokingEvent"]))
    print(evaluate(parsed, _required_tags_from_event(sample)))

When resultToken is the literal TESTMODE, PutEvaluations runs in a dry mode that validates the payload without persisting — which is exactly why the __main__ replay short-circuits before the API call and prints the pure verdict instead.

Schema Reference Table

The evaluator collapses AWS-specific Config fields onto the same canonical compliance model used across every provider adapter, so a verdict from Config, GCP Cloud Asset Inventory, or Azure Resource Graph is structurally identical downstream.

Config / provider field	Normalized field	Type	Notes
`configurationItem.resourceId`	`resource_id`	string	Primary key for the compliance verdict and remediation target
`configurationItem.resourceType`	`resource_type`	string	Drives per-type exclusions; `AWS::*` namespace
`configurationItem.tags`	`tags{}`	map	User tags only after stripping `aws:` system prefixes
`configurationItem.configurationItemStatus`	`lifecycle_state`	enum	Maps `ResourceDiscovered` → defer, `ResourceDeleted` → skip
`newEvaluationResult.complianceType`	`verdict`	enum	One of `COMPLIANT` / `NON_COMPLIANT` / `NOT_APPLICABLE` / `INSUFFICIENT_DATA`
`newEvaluationResult.annotation`	`verdict_reason`	string	Human-readable detail; capped at 256 chars by the API
`resultToken`	`invocation_token`	string	Binds verdict to invocation; expires, never persisted
`awsAccountId`	`account_id`	string	Partition key for showback and unallocated-cost routing

Operational Considerations

Evaluation frequency. Set MaximumExecutionFrequency to Six_Hours for baseline drift detection and rely on ConfigurationItemChangeNotification for real-time enforcement. Avoid One_Hour unless a compliance framework demands it — it multiplies Lambda invocations and Config evaluation charges for little additional coverage.
Synchronous window. The custom-rule Lambda must answer within roughly 5 seconds; budget for it by validating against the supplied snapshot and never issuing per-resource Describe* calls inside the handler.
PutEvaluations limits. A single call accepts up to 100 evaluation results, and the resultToken is single-use and time-bound — submit once per invocation. Config retains rule evaluation results for 7 years of configuration history when continuous recording is on.
Tagging API throttling. Remediation through tag:TagResources and the Resource Groups Tagging API is rate-limited; wrap calls in adaptive backoff and cap per-account concurrency, mirroring the quota patterns in handling billing API rate limits and retries.
Eventual consistency window. A freshly created resource can surface a ResourceDiscovered event seconds before its tags propagate; returning INSUFFICIENT_DATA defers the verdict to the next evaluation instead of firing premature remediation.
Cost attribution. Config records and rule evaluations are billed per item and per evaluation. Tag the enforcement pipeline’s own resources with CostCenter and Owner, and exclude internal automation accounts from showback so the control plane does not pollute team-level reports — the same boundary logic as setting up FinOps governance boundaries in Terraform.
Monitoring hooks. Alarm on the AWS/Config CloudWatch metrics EvaluationsFailed and on a rising NON_COMPLIANT count, and emit a custom metric on remediation no-op vs applied ratio to catch a policy change that suddenly marks the whole estate non-compliant.

Troubleshooting

Rule shows INSUFFICIENT_DATA for every resource. Root cause: the configuration recorder is not recording the affected resource types, so tags is always empty. Detection: the Config console shows the rule as evaluated but no resources in scope. Remediation: confirm the recorder’s recordingGroup includes the types (or allSupported: true) and that the delivery channel is healthy before debugging the rule code.

Evaluations silently stop appearing after a deploy. Root cause: the resultToken expired because the Lambda exceeded the synchronous window, or the IAM role lost config:PutEvaluations. Detection: EvaluationsFailed climbs while the function logs success. Remediation: profile the handler under the 5-second budget, remove any in-handler Describe* calls, and re-attach the ConfigEvaluationWrite statement.

Remediation rewrites tags an operator just set. Root cause: the remediation function blanket-applied fallback tags instead of reading current state first. Detection: an audit trail shows ManagedBy flipping back to finops-automation minutes after a manual fix. Remediation: gate on get_resources and apply only keys absent from existing_keys, exactly as the remediate function above does.

NON_COMPLIANT floods the remediation queue after a policy update. Root cause: a new required tag was promoted straight to production, instantly marking the whole estate non-compliant. Detection: a step change in the NON_COMPLIANT metric correlated with a manifest version bump. Remediation: validate manifest changes against a staging rule first, roll out the new required set behind ruleParameters, and stage remediation with a dry-run flag before enabling writes.

Cross-account remediation fails with AccessDenied. Root cause: the remediation role cannot assume the target account’s tagging role. Detection: sts:AssumeRole errors in the remediation logs for specific accounts. Remediation: confirm the trust policy in each member account permits the central role and that tag:TagResources is granted there, scoped by aws:RequestedRegion.

Frequently Asked Questions

Why return INSUFFICIENT_DATA instead of NON_COMPLIANT when tags are missing?

Config is eventually consistent — a resource often fires a ResourceDiscovered event before its tags propagate. Reporting NON_COMPLIANT immediately would trigger remediation and alert noise on a resource that is actually compliant a second later. INSUFFICIENT_DATA defers the verdict to the next evaluation, eliminating that false-positive class.

Should I enforce tags with AWS Config or with IaC scanners like Checkov?

Both, at different stages. IaC scanners (OPA, Checkov, cfn-lint) block non-compliant definitions before deployment, but they cannot see console changes, marketplace deployments, or legacy drift. AWS Config is the continuous post-provisioning net that catches everything created or mutated outside the governed pipeline.

How do I stop remediation from overwriting tags set by Terraform or a human?

Read current tag state immediately before mutating and apply only the missing keys with tag:TagResources; never blanket-overwrite. A deterministic client token makes an unchanged resource a no-op on every replay, so the function can run repeatedly without churn.

What is the hard limit on a Config custom-rule evaluation?

Treat 5 seconds as the effective synchronous ceiling for the evaluation, and submit at most 100 results per PutEvaluations call against a single-use resultToken. Heavy work belongs on an asynchronous EventBridge target, not inside the rule handler.

How does this map onto GCP and Azure for multi-cloud parity?

Normalize all three onto one canonical verdict model. GCP uses Organization Policy constraints plus Cloud Asset Inventory feeds for continuous evaluation; Azure uses Azure Policy initiatives with a deny or modify effect and Resource Graph queries for drift detection. Each provider validates against the same shared JSON Schema so allocation stays consistent regardless of provisioning surface.

Resource Tagging & Validation Pipelines — the parent reference defining the acquisition→normalization→allocation→persistence model this AWS Config rule implements at the validation stage.
Cross-Cloud Cost Allocation Strategies — how compliant verdicts feed allocation and how untagged spend is routed to an unallocated owner across providers.
Handling Billing API Rate Limits & Retries — the backoff and quota patterns behind the remediation engine’s adaptive retry config.
Setting Up FinOps Governance Boundaries in Terraform — codifying the IAM boundaries, delivery channel, and showback exclusions this enforcement pipeline depends on.
FinOps Architecture & Billing Fundamentals — the broader discipline this enforcement control plane supports, from cost categories to amortization.

Up: Resource Tagging & Validation Pipelines · Home: Cloud Cost Optimization & FinOps Automation

Tagging Policy Enforcement with AWS Config

# Architecture Context & Data-Flow Position

# Core Implementation Patterns

# 1. Delivery channel and least-privilege IAM

# 2. Policy schema design

# 3. Custom-rule runtime and the synchronous window

# 4. EventBridge routing and idempotent remediation

# Production-Grade Python Ingestion Engine

# Schema Reference Table

# Operational Considerations

# Troubleshooting

# Frequently Asked Questions

# Related