Skip to content

feat(datum-platform): add analyze-gcp-spend skill and cost-analyst agent#10

Draft
drewr wants to merge 3 commits into
mainfrom
feat/analyze-gcp-spend
Draft

feat(datum-platform): add analyze-gcp-spend skill and cost-analyst agent#10
drewr wants to merge 3 commits into
mainfrom
feat/analyze-gcp-spend

Conversation

@drewr
Copy link
Copy Markdown
Contributor

@drewr drewr commented Apr 25, 2026

Summary

  • Adds analyze-gcp-spend skill (3 files) covering BigQuery billing queries, live cluster state commands, and a full report template with mermaid trend charts
  • Adds cost-analyst agent that the weekly scheduler invokes
  • Bumps datum-platform plugin to 1.7.0

What it does

Weekly GCP spend report for all datum-cloud infrastructure (staging + production). On the first 7 days of a month it produces a full prior-month retrospective; all other weeks produce an MTD snapshot. Queries BigQuery billing exports for service/SKU/storage breakdown, pulls live node pool and PVC inventory, computes a trailing-4-month trend, and generates xychart-beta mermaid charts. Output is a PR to datum-cloud/engineering at reports/gcp-spend/YYYY-MM-DD-gcp-spend.md.

Test plan

  • Review SKILL.md cadence logic (day ≤ 7 vs MTD)
  • Review BigQuery SQL in queries.md against actual billing export schema
  • Review mermaid chart stubs in report-format.md render correctly in GitHub preview
  • Confirm GCP project IDs (datum-cloud-prod, datum-cloud-staging) and cluster names match actual infra
  • Run cost-analyst agent manually with a date in the first 7 days to validate full-month mode

drewr and others added 2 commits April 25, 2026 16:47
Weekly GCP cost report covering datum-cloud staging and production. Queries
BigQuery billing exports and live cluster state (node pools, PVCs, Cloud SQL),
generates mermaid xychart-beta trend charts for the trailing 4 months, and
files a PR to datum-cloud/engineering. Includes cadence logic to produce a
full prior-month retrospective when run in the first 7 days of a month.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cluster names datum-prod/datum-staging do not exist; correct to
infrastructure-control-plane-{prod,staging} in SKILL.md and queries.md
(node pool and PVC inventory commands).

Add a Preflight Checks section to SKILL.md that validates BigQuery
billing export access before proceeding. BigQuery is a hard requirement —
storage and Cloud SQL costs are invisible without it, causing the report
to understate spend by $2,000–$3,000/month. The skill now requires
stopping and surfacing the access error rather than publishing an
estimate-based report as authoritative.
@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 25, 2026

Pushed two fixes based on a real run today:

1. Wrong cluster namesdatum-prod and datum-staging don't exist. Corrected to infrastructure-control-plane-prod / infrastructure-control-plane-staging in both SKILL.md and queries.md (node pool and PVC inventory commands).

2. Missing preflight checks — the skill had no guard against BigQuery being inaccessible. Without billing export access, storage and Cloud SQL are invisible, which caused a $5,800/month understatement (~54% off) on today's run. Added a Preflight Checks section that explicitly requires verifying BQ access before proceeding, and makes clear the skill should halt and surface the error rather than publish an estimate as authoritative.

@drewr drewr requested a review from scotwells May 1, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant