Should the deduplication key include PreTaxCost?

No. The key must be built only from immutable dimensions such as SubscriptionId, MeterId, UsageDate, InstanceId, and Quantity. Cost values are revised by Azure credit corrections, so including them makes a corrected row read as a brand-new record and double-counts.

How is deduplication different from rate-limit retries?

Retries solve transient throttling by surviving a 429 response. Deduplication solves semantic duplication where the same row is legitimately returned twice. They are orthogonal layers, and this pattern assumes the retry layer already wraps every call.

Azure Cost API Pagination and Deduplication Guide

The specific bottleneck this page solves is overlapping nextLink windows: the Azure Cost Management Query endpoint returns the same metered row across two adjacent pages, and a naive paginator writes both. At enterprise scope — cross-subscription or management-group queries against Azure Cost Management API Integration — this inflates reserved-instance commitments, double-books marketplace purchases, and produces reconciliation drift that no downstream GROUP BY can quietly absorb. The fix is not bigger machines or more retries; it is a deterministic pagination guard plus a cryptographic composite key applied before rows reach storage. This guide ships that pattern as a self-contained async paginator and closes with the production failure modes that defeat the obvious implementations.

Root Cause & Failure Modes

Azure Cost Management does not guarantee strict monotonic pagination. The Query API caps a single page at roughly 1,000 rows and hands back a properties.nextLink cursor for the remainder, but the underlying billing aggregation jobs run asynchronously and are only finalized 24–72 hours after usage. When a query spans multiple subscriptions or a management group, three failure modes emerge:

Cursor reset. If a backing aggregation job completes mid-traversal, the nextLink cursor can re-point at a window you have already yielded. The same meterId row surfaces on page 4 and again on page 7.
Boundary overlap from timezone drift. Filtering on properties/usageStart and properties/usageEnd without normalizing to properties/usageDate introduces UTC-to-local skew, so a single day’s consumption straddles two daily buckets and appears twice.
Provisional vs finalized rows. Re-ingesting a recent window pulls provisional figures that are later revised. Append-only writes leave both the stale and corrected rows in the table.

The instinct is to clean this up downstream with UPSERT, window functions, or a nightly dedup job. That works numerically but pushes the cost onto every query, delays cost visibility past the reconciliation deadline, and still leaves a window where dashboards are wrong. Deduplication belongs at ingestion, in-stream, keyed on immutable dimensions — the same idempotency contract enforced across Async Billing Data Processing Patterns.

Production Pipeline Architecture

The execution model is four phases, each with a single responsibility, so a fault in one never corrupts the others:

Fetch — POST the query body for the first page; GET each nextLink verbatim with no body. These are semantically different calls: the POST carries the full payload and returns page one plus the cursor; each GET URL encodes the server-side cursor. Rate-limit handling (429 + Retry-After) wraps both, mirroring the matrix in Handling Billing API Rate Limits & Retries.
Cycle-guard — every cursor is recorded in a seen_cursors set before it is followed. A repeat cursor signals a reset loop and breaks traversal deterministically instead of paging forever.
Hash-dedup — each row is reduced to a SHA-256 digest over its immutable dimensions (SubscriptionId, MeterId, UsageDate, InstanceId, Quantity). A seen_hashes set drops repeats in O(1) before they leave the process.
Idempotent sink — unique rows batch into a MERGE/INSERT OVERWRITE keyed on the same dimensions, so late-arriving Azure credit corrections overwrite stale rows rather than inflating totals.

The composite key is the load-bearing decision. For single-node ingestion a native set of hex digests holds millions of hashes in modest memory; for distributed workers, offload the registry to a Redis-backed cache and expire hashes older than the 30–45 day reconciliation window. The daily rollup that consumes these clean rows is detailed in Time-Series Aggregation for Daily Cloud Cost Tracking.

Step-by-Step Python Implementation

The module below combines async pagination, 429 backoff, cursor cycle detection, and streaming deduplication. It is import-complete and runnable against a real subscription with a valid bearer token.

import asyncio
import hashlib
import logging
from typing import Any, AsyncGenerator, Dict, List, Optional, Set

import httpx

logger = logging.getLogger(__name__)


class AzureCostPaginator:
    """Async paginator for the Azure Cost Management Query API.

    The Query endpoint uses POST for the initial request and GET for nextLink
    pages. This class handles both, plus rate-limit backoff, cursor cycle
    detection, and in-stream SHA-256 deduplication.
    """

    def __init__(self, token: str, base_url: str = "https://management.azure.com"):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={
                "Authorization": f"Bearer {token}",
                "Content-Type": "application/json",
            },
            timeout=httpx.Timeout(60.0, connect=10.0),
            limits=httpx.Limits(max_connections=10),
        )
        self._seen_cursors: Set[str] = set()

    async def aclose(self) -> None:
        await self.client.aclose()

    async def _request_with_backoff(
        self,
        method: str,
        url: str,
        body: Optional[Dict[str, Any]] = None,
        max_retries: int = 5,
    ) -> Dict[str, Any]:
        retries = 0
        while retries < max_retries:
            try:
                if method.upper() == "POST":
                    resp = await self.client.post(url, json=body)
                else:
                    resp = await self.client.get(url)

                if resp.status_code == 429:
                    retry_after = float(resp.headers.get("Retry-After", 2 ** retries))
                    logger.warning("Rate limited. Backing off for %.2fs", retry_after)
                    await asyncio.sleep(retry_after)
                    retries += 1
                    continue
                resp.raise_for_status()
                return resp.json()
            except httpx.HTTPStatusError as exc:
                if exc.response.status_code == 429:
                    retry_after = float(exc.response.headers.get("Retry-After", 2 ** retries))
                    await asyncio.sleep(retry_after)
                    retries += 1
                    continue
                raise
            except httpx.RequestError as exc:
                logger.error("Request failed: %s", exc)
                retries += 1
                await asyncio.sleep(min(2 ** retries, 30))
        raise RuntimeError(f"Max retries ({max_retries}) exceeded for {url}")

    async def paginate(
        self, query_url: str, query_body: Dict[str, Any]
    ) -> AsyncGenerator[Dict[str, Any], None]:
        """Yield raw pages: POST the first, GET each nextLink verbatim."""
        payload = await self._request_with_backoff("POST", query_url, body=query_body)
        yield payload

        while True:
            next_link = payload.get("properties", {}).get("nextLink")
            if not next_link:
                break
            if next_link in self._seen_cursors:
                logger.warning("Cycle detected in pagination at %s. Breaking.", next_link)
                break
            self._seen_cursors.add(next_link)
            payload = await self._request_with_backoff("GET", next_link)
            yield payload

    @staticmethod
    def generate_record_hash(columns: List[str], row: List[Any]) -> str:
        """Derive a deduplication hash from immutable, column-named dimensions."""
        row_dict = dict(zip(columns, row))
        composite = (
            f"{row_dict.get('SubscriptionId', '')}|"
            f"{row_dict.get('MeterId', '')}|"
            f"{row_dict.get('UsageDate', '')}|"
            f"{row_dict.get('InstanceId', '')}|"
            f"{row_dict.get('Quantity', 0)}"
        )
        return hashlib.sha256(composite.encode("utf-8")).hexdigest()

    async def stream_deduplicated(
        self, query_url: str, query_body: Dict[str, Any]
    ) -> AsyncGenerator[Dict[str, Any], None]:
        """Yield unique, deduplicated row dicts from every page."""
        seen_hashes: Set[str] = set()
        async for page in self.paginate(query_url, query_body):
            props = page.get("properties", {})
            columns = [col["name"] for col in props.get("columns", [])]
            for row in props.get("rows", []):
                record_hash = self.generate_record_hash(columns, row)
                if record_hash not in seen_hashes:
                    seen_hashes.add(record_hash)
                    yield dict(zip(columns, row))

Execution Pattern

The generator yields clean row dicts one at a time, enabling backpressure-friendly streaming into a warehouse. Batch before writing so the sink commits in idempotent chunks:

async def write_to_warehouse(records: List[Dict[str, Any]]) -> None:
    """Stub: replace with a MERGE/INSERT OVERWRITE keyed on the dedup dimensions."""
    logger.info("Writing %d records to warehouse.", len(records))


async def consume_billing_stream(
    query_url: str,
    query_body: Dict[str, Any],
    token: str,
    batch_size: int = 5000,
) -> None:
    paginator = AzureCostPaginator(token=token)
    buffer: List[Dict[str, Any]] = []
    try:
        async for record in paginator.stream_deduplicated(query_url, query_body):
            buffer.append(record)
            if len(buffer) >= batch_size:
                await write_to_warehouse(buffer)
                buffer.clear()
        if buffer:
            await write_to_warehouse(buffer)
    finally:
        await paginator.aclose()


async def main() -> None:
    import os

    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    token = os.environ["AZURE_BEARER_TOKEN"]  # Obtain via DefaultAzureCredential

    query_url = (
        f"https://management.azure.com/subscriptions/{subscription_id}"
        f"/providers/Microsoft.CostManagement/query?api-version=2023-11-01"
    )
    query_body = {
        "type": "ActualCost",
        "timeframe": "BillingMonthToDate",
        "dataset": {
            "granularity": "Daily",
            "aggregation": {"totalCost": {"name": "PreTaxCost", "function": "Sum"}},
            "grouping": [{"type": "Dimension", "name": "ResourceGroup"}],
        },
    }
    await consume_billing_stream(query_url, query_body, token=token)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    asyncio.run(main())

Verification & Testing

Correctness here means zero duplicate rows for a known input — assert it directly rather than eyeballing totals. Drive the dedup logic with synthetic pages that deliberately overlap and a cursor that loops:

def test_dedup_drops_overlapping_rows() -> None:
    columns = ["SubscriptionId", "MeterId", "UsageDate", "InstanceId", "Quantity"]
    row = ["sub-1", "meter-a", "2026-06-01", "vm-1", 4]
    h1 = AzureCostPaginator.generate_record_hash(columns, row)
    h2 = AzureCostPaginator.generate_record_hash(columns, list(row))  # same dims
    assert h1 == h2, "identical dimensions must hash identically"

    changed = ["sub-1", "meter-a", "2026-06-01", "vm-1", 5]  # quantity differs
    assert AzureCostPaginator.generate_record_hash(columns, changed) != h1


if __name__ == "__main__":
    test_dedup_drops_overlapping_rows()
    print("dedup invariants hold")

Operationally, verify three signals after each run: (1) the emitted row count is not an exact multiple of the page size — a clean multiple usually means a dropped nextLink; (2) SELECT dedup_key, COUNT(*) ... HAVING COUNT(*) > 1 returns zero rows in the sink; and (3) the post-dedup Sum(PreTaxCost) reconciles to the Azure invoice to the cent (store cost as Decimal, never float). For high-stakes loads, run a dry-run that paginates and hashes but skips the warehouse write, logging the duplicate-drop count — a non-zero count confirms overlap is actually occurring and being caught.

Common Pitfalls Checklist

Resending the body on nextLink GETs. Azure ignores the body and may reset the cursor. Fix: follow nextLink verbatim with no payload.
Hashing mutable fields. Including PreTaxCost or Currency in the key makes a revised row look new. Fix: hash only immutable dimensions; let the sink MERGE handle value changes.
Unpinned api-version. A drifting version reorders columns and shifts your projection. Fix: pin the version and resolve fields through the column name→index map, never fixed offsets.
Unbounded hash set on multi-TB exports. Memory grows until the worker OOMs. Fix: cap the window to the 30–45 day reconciliation period or offload to Redis with TTL.
Parallelizing within one scope. Concurrent calls share a per-scope rate budget and trigger sustained 429s. Fix: serialize per-scope behind a lease and scale out across subscriptions.

Frequently Asked Questions

Why does the same Azure billing row appear on two pages?

Billing aggregation jobs run asynchronously and finalize 24–72 hours after usage. If a job completes mid-traversal, the nextLink cursor can re-point at an already-yielded window, so the same meterId row surfaces twice. The cycle-detection guard and in-stream hash filter neutralize both cases.

Should the dedup key include PreTaxCost?

No. The key must be built only from immutable dimensions (SubscriptionId, MeterId, UsageDate, InstanceId, Quantity). Cost values are revised by Azure credit corrections; if they are in the hash, a corrected row reads as a brand-new record and you double-count.

Is a Python set enough, or do I need Redis?

A native set of SHA-256 hex digests handles millions of rows on a single node. Move to a Redis-backed registry only for distributed workers or multi-terabyte exports, and expire entries past the 30–45 day reconciliation window to bound memory.

Why deduplicate in-stream instead of with a database UPSERT?

Downstream dedup pushes compute onto every query, delays cost visibility past the reconciliation deadline, and leaves a window where dashboards double-count. In-stream hashing drops duplicates in O(1) before they ever reach storage.

How does this differ from rate-limit retries?

Retries solve transient throttling — surviving a 429. Deduplication solves semantic duplication — the same row legitimately returned twice. They are orthogonal layers; this page assumes the retry layer is already in place around every call.

Azure Cost Management API Integration — the parent integration whose nextLink pagination and dedup contract this guide implements in depth.
Cloud Billing Data Ingestion & Parsing — the pipeline whose idempotent-ingestion guarantee this pattern upholds for Azure.
Handling Billing API Rate Limits & Retries — the Retry-After and backoff layer that wraps every fetch in the paginator.
Async Billing Data Processing Patterns — the queue topology and cursor-state contract that scale this paginator across workers.
Time-Series Aggregation for Daily Cloud Cost Tracking — the daily rollup that consumes the deduplicated rows this page produces.

Up: Cloud Cost Optimization & FinOps Automation · Parent reference: Azure Cost Management API Integration

Azure Cost API Pagination and Deduplication Guide

# Root Cause & Failure Modes

# Production Pipeline Architecture

# Step-by-Step Python Implementation

# Execution Pattern

# Verification & Testing

# Common Pitfalls Checklist

# Frequently Asked Questions

# Related