Reserved Instance Coverage vs Utilization Metrics: Reconciling Two Divergent Cost Explorer APIs

Reserved Instance (RI) Coverage and Utilization are the two headline efficiency numbers every FinOps dashboard reports, and they are routinely conflated into a single broken metric. Coverage measures the percentage of eligible compute usage that a purchased reservation protects; Utilization measures the percentage of purchased reservation capacity that is actively consumed. The specific engineering bottleneck this page solves is the metric-divergence drift that appears when you merge these two numbers from AWS Cost Explorer: they arrive through two different API endpoints — GetReservationCoverage and GetReservationUtilization — with different denominators, different default GroupBy dimensions, and independent pagination, so a naive join produces phantom overcommitment, double-counted coverage, and false optimization alerts. This page details a deterministic Python reconciliation pipeline that pulls both feeds, normalizes their composite keys, and emits a single audit-ready row per (period, family, account) so the coverage and utilization figures finally line up against the invoice. It is the metric layer that sits directly downstream of Reserved Instance Mapping Logic inside the broader FinOps Architecture & Billing Fundamentals pipeline.

Root Cause & Failure Modes

The drift is not a bug in your code — it is structural in how AWS exposes reservation telemetry. The two endpoints compute fundamentally different ratios:

GetReservationCoverage returns both OnDemand and Reserved usage in the denominator: coverage = ReservedHours / (OnDemandHours + ReservedHours) * 100. It answers “how much of my eligible footprint is on a reservation?”
GetReservationUtilization isolates only the purchased pool: utilization = UsedHours / TotalPurchasedHours * 100. It answers “how much of what I bought am I actually burning?”

Because the numerator and denominator differ, you cannot derive one from the other, and merging their raw rows breaks in three quantifiable ways at scale:

Dimensional mismatch. Coverage groups naturally by INSTANCE_TYPE_FAMILY, REGION, TENANCY, and PURCHASE_OPTION; utilization is commonly split by LINKED_ACCOUNT and SUBSCRIPTION. A direct pandas.merge on mismatched keys either explodes rows (Cartesian fan-out) or drops them silently. You must project both feeds onto one composite key before joining.
Pagination and rate limits. Both endpoints cap each response and require NextToken iteration; the Cost Explorer surface throttles at roughly 5 transactions per second per account and bills $0.01 per paginated request. Unhandled pagination truncates consolidated-billing views, and an unbounded retry loop triggers cascading ThrottlingException errors — the same constraint covered in Handling Billing API Rate Limits & Retries.
Time-granularity drift. DAILY granularity hides intra-day cross-account RI sharing, so coverage looks artificially smooth; HOURLY granularity multiplies API call volume and pushes you straight into the throttle. The reconciliation must pin one granularity and re-ingest the trailing finalization window, because the last 24–72 hours of vendor-reported figures are provisional.

The most damaging failure mode is phantom overcommitment: when reserved_hours > total_purchased_hours for a family, it looks like you bought too little, but the real cause is cross-account RI sharing where LinkedAccountId was not propagated, so the same reservation is counted in two accounts. Acting on that signal — buying more reservations — burns budget on capacity you already own.

Production Pipeline Architecture

The reconciliation runs as a four-phase execution model, each phase a stateless, idempotent stage so a retry replaces a window rather than appending to it:

Extract — paginate GetReservationCoverage and GetReservationUtilization independently, with capped exponential backoff on ThrottlingException. This is the acquisition handoff from the AWS Cost Explorer Architecture surface.
Normalize — flatten the nested ResultsByTime → Groups → Metrics payloads into two flat frames keyed on the same seven-field composite key, coercing every metric to a typed decimal.
Reconcile — outer-join the two frames on the composite key (never an inner join — that drops rows that exist in only one feed), recompute coverage and utilization from raw hours, and flag rows where the recomputed value diverges from the API-reported value or where overcommitment appears.
Persist — write one Parquet partition per YYYY/MM/, keyed for idempotent upsert, so downstream cross-cloud cost allocation strategies and chargeback consume a single reconciled source of truth.

The covered-hours and purchased-hours that feed phases 2–3 are produced upstream by Reserved Instance Mapping Logic; this page turns that mapping output into the two reader-facing efficiency metrics. For multi-account orgs, the LinkedAccountId normalization here depends on the account taxonomy described in How to Structure AWS Cost Categories for Multi-Account Orgs.

Step-by-Step Python Implementation

The module below is self-contained and runnable. It paginates both endpoints with capped backoff, normalizes both feeds onto one composite key, performs a deterministic outer join, recomputes both metrics from raw hours, flags anomalies, and exposes a --dry-run style dry_run flag that skips persistence. Monetary and hour fields are parsed as Decimal to keep ledger math exact.

import argparse
import logging
import time
from decimal import Decimal
from typing import Dict, List

import boto3
import pandas as pd
from botocore.config import Config
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("ri_reconciliation")

KEY = ["start_date", "end_date", "instance_family", "linked_account_id"]


class RIReconciler:
    def __init__(self, region: str = "us-east-1", max_retries: int = 5):
        self.max_retries = max_retries
        self.ce = boto3.client(
            "ce",
            region_name=region,
            config=Config(retries={"max_attempts": 8, "mode": "adaptive"}),
        )

    def _paginate(self, method: str, kwargs: Dict) -> List[Dict]:
        """Iterate NextToken with capped exponential backoff on throttling."""
        pages, attempt = [], 0
        while True:
            try:
                resp = getattr(self.ce, method)(**kwargs)
            except ClientError as exc:
                if exc.response["Error"]["Code"] != "ThrottlingException":
                    raise
                attempt += 1
                if attempt > self.max_retries:
                    logger.error("throttled past %d retries; aborting", self.max_retries)
                    raise
                wait = min(2 ** attempt, 30)
                logger.warning("throttled; backoff %ds (%d/%d)", wait, attempt, self.max_retries)
                time.sleep(wait)
                continue
            pages.append(resp)
            attempt = 0
            token = resp.get("NextToken")
            if not token:
                return pages
            kwargs["NextToken"] = token

    @staticmethod
    def _dec(metric: Dict, key: str) -> Decimal:
        return Decimal(str(metric.get(key, {}).get("Value", "0")))

    def fetch_coverage(self, start: str, end: str) -> pd.DataFrame:
        logger.info("coverage %s -> %s", start, end)
        pages = self._paginate("get_reservation_coverage", {
            "TimePeriod": {"Start": start, "End": end},
            "Granularity": "DAILY",
            "GroupBy": [{"Type": "DIMENSION", "Key": "INSTANCE_TYPE_FAMILY"}],
        })
        rows = []
        for page in pages:
            for block in page.get("CoveragesByTime", []):
                period = block["TimePeriod"]
                for grp in block.get("Groups", []):
                    cov = grp.get("Coverage", {}).get("CoverageHours", {})
                    rows.append({
                        "start_date": period["Start"], "end_date": period["End"],
                        "instance_family": grp["Attributes"].get("instanceType", "unknown"),
                        "linked_account_id": grp["Attributes"].get("linkedAccount", "consolidated"),
                        "coverage_pct": Decimal(str(cov.get("CoverageHoursPercentage", "0"))),
                        "on_demand_hours": Decimal(str(cov.get("OnDemandHours", "0"))),
                        "reserved_hours": Decimal(str(cov.get("ReservedHours", "0"))),
                    })
        return pd.DataFrame(rows)

    def fetch_utilization(self, start: str, end: str) -> pd.DataFrame:
        logger.info("utilization %s -> %s", start, end)
        pages = self._paginate("get_reservation_utilization", {
            "TimePeriod": {"Start": start, "End": end},
            "Granularity": "DAILY",
            "GroupBy": [{"Type": "DIMENSION", "Key": "INSTANCE_TYPE_FAMILY"}],
        })
        rows = []
        for page in pages:
            for block in page.get("UtilizationsByTime", []):
                period = block["TimePeriod"]
                for grp in block.get("Groups", []):
                    util = grp.get("Utilization", {})
                    rows.append({
                        "start_date": period["Start"], "end_date": period["End"],
                        "instance_family": grp.get("Value", "unknown"),
                        "linked_account_id": grp.get("Attributes", {}).get("linkedAccount", "consolidated"),
                        "utilization_pct": Decimal(str(util.get("UtilizationPercentage", "0"))),
                        "used_hours": Decimal(str(util.get("TotalActualHours", "0"))),
                        "total_purchased_hours": Decimal(str(util.get("PurchasedHours", "0"))),
                    })
        return pd.DataFrame(rows)

    def reconcile(self, cov: pd.DataFrame, util: pd.DataFrame) -> pd.DataFrame:
        """Outer-join on the composite key; recompute and flag divergence."""
        if cov.empty and util.empty:
            return pd.DataFrame()
        df = pd.merge(cov, util, on=KEY, how="outer")
        for col in ["coverage_pct", "on_demand_hours", "reserved_hours",
                    "utilization_pct", "used_hours", "total_purchased_hours"]:
            if col not in df.columns:
                df[col] = Decimal("0")
            df[col] = df[col].apply(lambda v: v if isinstance(v, Decimal) else Decimal("0"))

        def cov_check(r):
            base = r.on_demand_hours + r.reserved_hours
            return (r.reserved_hours / base * 100) if base else Decimal("0")

        def util_check(r):
            return (r.used_hours / r.total_purchased_hours * 100) if r.total_purchased_hours else Decimal("0")

        df["coverage_drift"] = (df["coverage_pct"] - df.apply(cov_check, axis=1)).abs()
        df["utilization_drift"] = (df["utilization_pct"] - df.apply(util_check, axis=1)).abs()
        df["is_overcommitted"] = df["reserved_hours"] > df["total_purchased_hours"]
        logger.info("rows=%d overcommitted=%d", len(df), int(df["is_overcommitted"].sum()))
        return df


def run(start: str, end: str, dry_run: bool = False) -> pd.DataFrame:
    r = RIReconciler()
    result = r.reconcile(r.fetch_coverage(start, end), r.fetch_utilization(start, end))
    if dry_run:
        logger.info("dry-run: skipping persistence")
    elif not result.empty:
        result.to_parquet(f"s3://finops-metrics/ri/{start[:7].replace('-', '/')}/recon.parquet")
    return result


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("--start", required=True)
    p.add_argument("--end", required=True)
    p.add_argument("--dry-run", action="store_true")
    args = p.parse_args()
    run(args.start, args.end, dry_run=args.dry_run)

Verification & Testing

Correctness here means the reconciled metrics match what AWS would invoice, so verify against the API’s own reported percentages before trusting the pipeline:

Drift assertion. After a run, assert coverage_drift and utilization_drift stay below a tolerance (e.g. 0.5). A recomputed value that diverges from the API-reported coverage_pct/utilization_pct by more than rounding error means a key normalization bug or a partial-hour proration mismatch.
Row-conservation check. The outer join must never produce more rows than len(coverage) + len(utilization) minus the matched intersection. A larger count signals a Cartesian fan-out from a non-unique composite key — usually linked_account_id missing on one feed.
Dry-run parity. Run with --dry-run against a finalized month and diff the returned frame against a known-good fixture (checksum the sorted (KEY + metrics) tuples). Because the pipeline is deterministic, two runs over the same finalized window must be byte-identical.
Overcommitment audit. Every row where is_overcommitted is true should be manually reconciled against the consolidated-billing sharing flags before any purchase decision — confirm it is real shortfall, not unpropagated LinkedAccountId.

Common Pitfalls Checklist

Joining on a mismatched key. Project both feeds onto the same composite key first; never merge raw coverage rows against raw utilization rows. Fix: standardize on KEY = [start_date, end_date, instance_family, linked_account_id].
Inner-joining the two feeds. An inner join silently drops families that appear in only one endpoint. Fix: always how="outer" and fill missing metrics with Decimal("0").
Parsing percentages as float. IEEE-754 drift compounds across thousands of rows into reconciliation gaps. Fix: parse every metric and hour as Decimal.
Treating the trailing 72 hours as final. Vendor figures are provisional until billing finalizes. Fix: re-ingest and re-reconcile the trailing window every run; only freeze partitions older than the finalization lag.
Reacting to phantom overcommitment. reserved_hours > total_purchased_hours usually means a sharing-scope propagation gap, not a real shortfall. Fix: flag for audit, reconcile against Reserved Instance Mapping Logic, never auto-purchase on it.

Frequently Asked Questions

Can I derive RI Utilization from Coverage, or vice versa?

No. Coverage’s denominator is your full eligible footprint (OnDemandHours + ReservedHours), while Utilization’s denominator is only the purchased pool (TotalPurchasedHours). The two ratios share no common term you can algebraically rearrange, so you must pull both endpoints and join them on a shared composite key.

Why does my reservation show coverage above what I purchased?

That is phantom overcommitment: reserved_hours exceeds total_purchased_hours because cross-account RI sharing counted the same reservation in two accounts without propagating LinkedAccountId. Flag the row for audit and reconcile against the sharing scope rather than buying more capacity you already own.

DAILY or HOURLY granularity for reconciliation?

Pin DAILY for routine reporting — HOURLY multiplies request volume and pushes you into the ~5 TPS Cost Explorer throttle while billing $0.01 per paginated call. Use HOURLY only for targeted investigation of intra-day cross-account sharing, and always re-ingest the trailing finalization window because recent figures are provisional.

Should I inner-join or outer-join the two feeds?

Always outer-join. An inner join drops any instance family that appears in only one endpoint — common when a family has utilization but zero current coverage, or vice versa. Outer-join on the composite key and fill missing metrics with Decimal zero so no family silently disappears from the dashboard.

Reserved Instance Mapping Logic — the upstream engine that produces the covered-hours and purchased-hours this page turns into coverage and utilization metrics.
Cross-Cloud Cost Allocation Strategies — the allocation layer that consumes these reconciled metrics to attribute commitment discounts to consuming teams.
AWS Cost Explorer Architecture — the acquisition surface that exposes GetReservationCoverage and GetReservationUtilization.
How to Structure AWS Cost Categories for Multi-Account Orgs — the account taxonomy that makes LinkedAccountId normalization reliable across consolidated billing.
Handling Billing API Rate Limits & Retries — the backoff and pagination patterns the extraction phase depends on.

Up: Cross-Cloud Cost Allocation Strategies

Reserved Instance Coverage vs Utilization Metrics: Reconciling Two Divergent Cost Explorer APIs

# Root Cause & Failure Modes

# Production Pipeline Architecture

# Step-by-Step Python Implementation

# Verification & Testing

# Common Pitfalls Checklist

# Frequently Asked Questions

# Related