GCP BigQuery Billing Export Sync

Google Cloud’s detailed billing export is the GCP equivalent of an authoritative spend ledger: per-SKU, resource-level rows with usage amounts, credits, currency conversion, and project labels that no console report exposes. Unlike AWS, GCP does not drop files in a bucket — Cloud Billing streams cost rows natively into a BigQuery dataset (gcp_billing_export_resource_v1_<BILLING_ACCOUNT_ID>) that Google owns and continuously restates. This page covers the GCP acquisition stage of the billing pipeline: the engineering problem is not parsing a file, it is syncing a vendor-managed, append-and-restate export table into your own curated, query-optimized FinOps table without double-counting late-arriving rows. It implements the GCP branch of the four-stage model defined in Cloud Billing Data Ingestion & Parsing: acquisition, normalization, allocation, persistence. The export table is partitioned on _PARTITIONTIME (load time), not on usage_start_time, and Google rewrites rows for up to ~30 days as credits and adjustments settle — so a naive WHERE DATE(usage_start_time) = yesterday append silently duplicates restated cost.

Architecture Context & Data-Flow Position

BigQuery export sync is the first stage of the GCP pipeline. Once detailed export is enabled (see GCP Billing Export Configuration for the account-level setup), Cloud Billing writes rows into the export dataset on a rolling, eventually-consistent cadence — typically within hours, but restated rows for credits, taxes, and adjustments can land days later under a new export_time. Because the raw export table is Google-managed, immutable to you, and partitioned by ingestion time rather than usage time, you cannot query it directly for FinOps reporting without scanning the whole table. The sync engine therefore reads only the rows whose export_time exceeds a stored watermark, computes which usage_start_time day-partitions those rows touch, and atomically replaces exactly those partitions in a curated table partitioned by DATE(usage_start_time) and clustered on service.id, project.id, and sku.id. Everything downstream — tag-based allocation, commitment amortization, anomaly detection — assumes this stage delivered an exactly-once, partition-stable record set.

The GCP export sync stage: Cloud Billing streams into a Google-managed, restatement-prone table; the engine rewinds the export_time watermark, selects the touched usage_start_time partitions, and atomically replaces them under DML backoff into a curated table pruned for downstream allocation.

The export schema is the contract. These nested fields drive every downstream decision:

Export field	Type	Purpose	Constraint
`usage_start_time`	TIMESTAMP	When usage occurred	Partition key of the curated table, not of the source
`export_time`	TIMESTAMP	When Google wrote/restated the row	The incremental watermark column
`cost`	FLOAT64	Pre-credit cost in `currency`	Net cost requires adding `credits.amount`
`credits`	REPEATED RECORD	Discounts, promotions, SUDs, CUDs	Negative amounts; must be summed for net cost
`service.id` / `sku.id`	STRING	Service and SKU identifiers	Clustering keys for query pruning
`project.id`	STRING	Consuming project	Primary showback dimension; nullable for account-level charges
`invoice.month`	STRING (`YYYYMM`)	Invoice period the row finalizes into	Reconciliation key against the published invoice

This is the partition-pruning counterpart to the manifest-driven discovery used by the sibling AWS CUR to Data Lake Pipeline and the cursor pagination used by Azure Cost Management API Integration. The deep treatment of watermark tracking, partition-level overwrite, and job deduplication lives in Incremental Sync Strategies for GCP Billing Exports.

Core Implementation Patterns

1. Least-Privilege IAM

The sync runs as a dedicated service account with the narrowest viable grants. It needs roles/bigquery.dataViewer on the export dataset (read-only — you never mutate Google’s table), roles/bigquery.dataEditor on the curated dataset, and roles/bigquery.jobUser on the project to run query jobs. roles/billing.viewer is useful only for cross-validating totals via the Cloud Billing API. In regulated environments, place both datasets inside a VPC Service Controls perimeter so export data cannot egress. Never grant roles/bigquery.admin; the sync creates no datasets at runtime.

from google.cloud import bigquery

# Authenticate via Workload Identity / attached service account — never a JSON key file.
# The account carries dataViewer on export, dataEditor on curated, jobUser on the project.
client = bigquery.Client(project="finops-prod")

2. Partition and Clustering Strategy

The source export table is partitioned by ingestion time, which is useless for usage-period reporting. The curated table must be partitioned by DATE(usage_start_time) so downstream WHERE clauses prune to the days they need, and clustered on the highest-cardinality query dimensions. Create it once with DDL:

CREATE TABLE IF NOT EXISTS `finops-prod.finops_billing.billing_curated`
PARTITION BY DATE(usage_start_time)
CLUSTER BY service_id, project_id, sku_id
AS SELECT * FROM `finops-prod.billing_export.gcp_billing_export_resource_v1_0X0X0X`
WHERE FALSE;  -- schema-only clone; rows arrive through the sync

Partition pruning is what keeps BigQuery’s $6.25/TiB on-demand scan charge bounded — a 90-day report should scan ~90 partitions, not the whole multi-terabyte history.

3. Watermark-Driven Incremental Selection

Full-table reloads are computationally and financially prohibitive at export scale. Track the maximum export_time already synced into the curated table, then select only newer rows — but rewind the watermark by a buffer (default 35 days) because Google restates old usage days under a fresh export_time. The affected usage partitions, not the raw rows, define the replacement set:

-- The set of usage-day partitions touched by rows written since the watermark.
SELECT DISTINCT DATE(usage_start_time) AS usage_date
FROM `finops-prod.billing_export.gcp_billing_export_resource_v1_0X0X0X`
WHERE export_time > @watermark;

4. Idempotent Partition Replacement

Billing export rows carry no natural primary key, so a row-level MERGE cannot deduplicate restatements. The deterministic pattern is partition replacement inside a multi-statement transaction: delete the affected usage_start_time partitions from the curated table, then re-insert them in full from the export. Re-running the same window reproduces byte-identical partitions, which is the definition of idempotent.

BEGIN TRANSACTION;
  DELETE FROM `finops-prod.finops_billing.billing_curated`
  WHERE DATE(usage_start_time) IN UNNEST(@affected_dates);

  INSERT INTO `finops-prod.finops_billing.billing_curated`
  SELECT * FROM `finops-prod.billing_export.gcp_billing_export_resource_v1_0X0X0X`
  WHERE DATE(usage_start_time) IN UNNEST(@affected_dates);
COMMIT TRANSACTION;

Production-Grade Python Ingestion Engine

The module below ties the patterns together into one self-contained sync orchestrator. It reads the export_time watermark from the curated table, rewinds by a restatement buffer, computes the affected usage_start_time partitions, and atomically replaces them in a single BigQuery transaction. A retry decorator handles transient InternalServerError/TooManyRequests job failures; structured logging and a dataclass config keep it operable. Dependencies: google-cloud-bigquery>=3.11.0.

import logging
import random
import time
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from functools import wraps
from typing import List, Optional

from google.cloud import bigquery
from google.api_core.exceptions import (
    GoogleAPICallError,
    InternalServerError,
    NotFound,
    ServiceUnavailable,
    TooManyRequests,
)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("gcp.billing.sync")

# Transient BigQuery job errors that warrant a backoff-and-retry.
TRANSIENT = (InternalServerError, ServiceUnavailable, TooManyRequests)


def with_backoff(max_attempts: int = 5):
    """Retry decorator: exponential backoff with jitter on transient BigQuery errors."""
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_attempts + 1):
                try:
                    return fn(*args, **kwargs)
                except TRANSIENT as exc:
                    if attempt == max_attempts:
                        raise
                    delay = min(2 ** attempt + random.uniform(0, 1), 30)
                    logger.warning(
                        "transient error (%s); retry %d/%d in %.1fs",
                        type(exc).__name__, attempt, max_attempts, delay,
                    )
                    time.sleep(delay)
            raise RuntimeError("unreachable")
        return wrapper
    return decorator


@dataclass(frozen=True)
class SyncConfig:
    """All identifiers and tuning knobs for one billing-account sync."""

    project: str
    export_table: str          # billing_export.gcp_billing_export_resource_v1_0X0X0X
    curated_table: str         # finops_billing.billing_curated
    restatement_days: int = 35  # how far to rewind the watermark for late restatements
    epoch: datetime = datetime(2020, 1, 1, tzinfo=timezone.utc)

    @property
    def export_ref(self) -> str:
        return f"`{self.project}.{self.export_table}`"

    @property
    def curated_ref(self) -> str:
        return f"`{self.project}.{self.curated_table}`"


class BillingExportSync:
    """Idempotently sync Google's native BigQuery billing export into a curated table."""

    def __init__(self, cfg: SyncConfig) -> None:
        self.cfg = cfg
        self.client = bigquery.Client(project=cfg.project)

    @with_backoff()
    def _query(self, sql: str, params: Optional[list] = None) -> bigquery.table.RowIterator:
        job_config = bigquery.QueryJobConfig(query_parameters=params or [])
        return self.client.query(sql, job_config=job_config).result()

    def read_watermark(self) -> datetime:
        """Highest export_time already synced; falls back to epoch on first run."""
        sql = f"SELECT MAX(export_time) AS wm FROM {self.cfg.curated_ref}"
        try:
            row = next(iter(self._query(sql)), None)
        except NotFound:
            logger.info("curated table missing; initialising watermark to epoch")
            return self.cfg.epoch
        if row is None or row["wm"] is None:
            logger.info("curated table empty; initialising watermark to epoch")
            return self.cfg.epoch
        return row["wm"]

    def affected_partitions(self, watermark: datetime) -> List[str]:
        """Usage-day partitions touched by rows written since the rewound watermark."""
        floor = watermark - timedelta(days=self.cfg.restatement_days)
        sql = f"""
            SELECT DISTINCT FORMAT_DATE('%Y-%m-%d', DATE(usage_start_time)) AS d
            FROM {self.cfg.export_ref}
            WHERE export_time > @floor
            ORDER BY d
        """
        params = [bigquery.ScalarQueryParameter("floor", "TIMESTAMP", floor)]
        dates = [r["d"] for r in self._query(sql, params)]
        logger.info("watermark=%s floor=%s -> %d affected partitions",
                    watermark.isoformat(), floor.isoformat(), len(dates))
        return dates

    @with_backoff()
    def replace_partitions(self, dates: List[str]) -> int:
        """Atomically delete and reload the affected usage-day partitions."""
        sql = f"""
            BEGIN TRANSACTION;
              DELETE FROM {self.cfg.curated_ref}
              WHERE DATE(usage_start_time) IN UNNEST(@dates);

              INSERT INTO {self.cfg.curated_ref}
              SELECT * FROM {self.cfg.export_ref}
              WHERE DATE(usage_start_time) IN UNNEST(@dates);
            COMMIT TRANSACTION;
        """
        params = [bigquery.ArrayQueryParameter("dates", "STRING", dates)]
        job = self.client.query(sql, job_config=bigquery.QueryJobConfig(query_parameters=params))
        job.result()  # blocks until the transaction commits
        rows = job.num_dml_affected_rows or 0
        logger.info("replaced %d partition(s); %d rows affected", len(dates), rows)
        return rows

    def run(self) -> None:
        logger.info("starting GCP billing export sync for %s", self.cfg.export_table)
        watermark = self.read_watermark()
        dates = self.affected_partitions(watermark)
        if not dates:
            logger.info("no new export rows since watermark; nothing to sync")
            return
        self.replace_partitions(dates)
        logger.info("sync complete")


def main() -> None:
    cfg = SyncConfig(
        project="finops-prod",
        export_table="billing_export.gcp_billing_export_resource_v1_0X0X0X",
        curated_table="finops_billing.billing_curated",
    )
    BillingExportSync(cfg).run()


if __name__ == "__main__":
    main()

Schema Reference Table

The detailed export carries dozens of nested columns; this is the minimal allocation-grade subset flattened onto the canonical model the rest of the pipeline consumes. Keep this map versioned in source control so a newly added export field is a code review, not a production surprise.

Export source field	Normalized field	Type	Notes
`usage_start_time`	`usage_start`	timestamp	Curated partition key; the usage-period anchor, not load time
`export_time`	`export_time`	timestamp	Incremental watermark; restated rows reappear with a newer value
`project.id`	`project_id`	string (nullable)	Primary showback dimension; null for billing-account-level charges
`service.id` / `service.description`	`service_id` / `service`	string	Cluster on `service_id` for query pruning
`sku.id` / `sku.description`	`sku_id` / `sku`	string	Finest-grained pricing unit
`cost`	`cost`	decimal	Pre-credit; `ROUND(cost, 6)` to match invoice precision
`credits` (REPEATED)	`credits_total`	decimal	`SUM(credits.amount)`; negative — add to `cost` for net cost
`usage.amount` / `usage.unit`	`usage_amount` / `usage_unit`	decimal / string	Raw consumption for unit-economics models
`labels` (REPEATED key/value)	`labels{}`	map	Flattened user labels; feeds tag-validation gating
`invoice.month`	`invoice_month`	string (`YYYYMM`)	Reconciliation key against the published invoice

Operational Considerations

Restatement window. Google updates the export for credits, taxes, and adjustments for up to ~30 days after usage. The 35-day watermark rewind covers this with a margin; shrinking it to “yesterday only” is the single most common cause of under- or over-counted historical months.
Net cost is not cost. A frequent reporting bug is summing cost and ignoring credits. Net cost is cost + SUM(credits.amount) (credits are negative). Sustained-use and committed-use discounts only appear in the credits array.
Cost is FLOAT64. BigQuery stores cost as a float, so SUM(cost) accumulates rounding drift across millions of rows. Group by invoice.month and ROUND(cost, 6) when reconciling to the published invoice to the cent.
Currency normalization. Multi-currency billing accounts emit rows in local currency with a currency_conversion_rate; multiply before aggregating across currencies, or pin every report to a single settlement currency.
Scan cost and quotas. On-demand BigQuery charges ~$6.25/TiB scanned and enforces a default 1,500 load/DML operations per table per day. Partition replacement keeps each run to a handful of partitions; the with_backoff decorator absorbs the transient TooManyRequests you will eventually hit at scale — see Handling Billing API Rate Limits & Retries.
Monitoring hooks. Emit metrics for watermark lag (now − max export_time), partitions replaced per run, and rows affected; alarm when no rows arrive for 48 hours, which signals a disabled export or broken billing-account hierarchy rather than a quiet spend day.

Troubleshooting

Month-to-date total drifts above the GCP invoice after a re-run. Root cause: rows appended without partition replacement, so a restated usage day was loaded twice. Detection: curated SUM(cost) for a closed month exceeds the export table’s SUM(cost) for the same invoice.month. Remediation: route every load through the DELETE+INSERT transaction keyed on usage_start_time partitions; never use bare INSERT against the curated table.

Last few days of cost look too low and keep changing. Root cause: the trailing partitions are still inside the restatement window and have not received all credits/adjustments yet. Detection: recent days rise on each subsequent sync. Remediation: this is expected — mark the trailing 30 days “provisional” in dashboards and rely on the watermark rewind to converge them; do not treat early values as final.

Sync misses restated rows from three weeks ago. Root cause: the watermark was advanced to MAX(export_time) without rewinding, so older partitions Google later restated were never re-selected. Detection: a credit applied to an old month never appears in curated data. Remediation: always subtract restatement_days from the watermark before computing affected partitions, as the engine does.

NotFound: Dataset/table not found on first run. Root cause: the curated table does not exist yet, so the watermark query fails. Detection: first invocation raises NotFound. Remediation: create the curated table once with the partition/cluster DDL above; the engine then falls back to the epoch watermark and backfills.

Query scans terabytes for a one-day report. Root cause: downstream models query the raw export table (partitioned by load time) or a curated table missing PARTITION BY DATE(usage_start_time). Detection: the BigQuery “bytes processed” estimate is orders of magnitude above the day’s data. Remediation: query the curated table and ensure every report filters on DATE(usage_start_time) so the optimizer prunes partitions.

Frequently Asked Questions

Why sync into a curated table instead of querying the export directly?

The native export table is owned by Google, partitioned by ingestion time rather than usage time, and continuously restated. Querying it directly means full or near-full scans and non-deterministic results mid-restatement. A curated table partitioned by DATE(usage_start_time) gives you partition pruning, stable historical reads, and a place to attach the normalized dimensional model the rest of the pipeline expects.

How do I handle the 30-day restatement window?

Rewind the export_time watermark by a buffer (35 days here) before computing which usage_start_time partitions to reload, then replace those partitions in full. Because replacement is idempotent, reprocessing a day Google has restated overwrites the old values exactly, with no duplicates.

Should I use the BigQuery export or stream from Pub/Sub?

Use the BigQuery export for the granular, finalized, resource-level historical record that allocation and amortization need — it is the cheapest path to full-fidelity data. Reserve Pub/Sub-style streaming for near-real-time budget alerting where minutes of latency matter more than completeness, accepting that streamed figures are provisional until the export restates them.

Why does net cost differ from the cost column?

cost is the pre-credit amount. Sustained-use discounts, committed-use discounts, promotions, and free-tier credits live in the repeated credits array as negative amounts. Net cost is cost + SUM(credits.amount); reporting cost alone overstates spend by the value of every discount applied.

Cloud Billing Data Ingestion & Parsing — the parent reference defining the four-stage acquisition→normalization→allocation→persistence model this GCP sync implements.
Incremental Sync Strategies for GCP Billing Exports — the deep dive on watermark tracking, partition-level overwrite, and job deduplication behind this engine.
GCP Billing Export Configuration — the account-level setup that enables the detailed export this page consumes.
AWS CUR to Data Lake Pipeline — the AWS-equivalent acquisition stage, using manifest discovery instead of partition pruning.
Handling Billing API Rate Limits & Retries — the retry, backoff, and quota patterns behind the engine’s with_backoff decorator.

Up: Cloud Billing Data Ingestion & Parsing · Home: Cloud Cost Optimization & FinOps Automation

GCP BigQuery Billing Export Sync

# Architecture Context & Data-Flow Position

# Core Implementation Patterns

# 1. Least-Privilege IAM

# 2. Partition and Clustering Strategy

# 3. Watermark-Driven Incremental Selection

# 4. Idempotent Partition Replacement

# Production-Grade Python Ingestion Engine

# Schema Reference Table

# Operational Considerations

# Troubleshooting

# Frequently Asked Questions

# Related