Skip to content

ROX-35269: clean up managed clusters before destroying infra-pr GKE clusters#1868

Draft
stehessel wants to merge 1 commit into
masterfrom
ROX-35269/clean-up-clusters-on-destroy
Draft

ROX-35269: clean up managed clusters before destroying infra-pr GKE clusters#1868
stehessel wants to merge 1 commit into
masterfrom
ROX-35269/clean-up-clusters-on-destroy

Conversation

@stehessel

Copy link
Copy Markdown
Contributor

When a GKE cluster running a test infra-server is destroyed, any test clusters it manages are orphaned because their Argo destroy phases never run. Add a pre-destroy cleanup step that connects to the target cluster, stops all active Argo workflows (triggering their onExit handlers), and waits for cloud resource cleanup to complete before tearing down the cluster.

The cleanup is gated by a new has-infra-server workflow parameter (default false) so it only runs on clusters that host an infra-server instance. PR.yaml passes has-infra-server=true for infra-pr clusters.

@stehessel stehessel force-pushed the ROX-35269/clean-up-clusters-on-destroy branch from f14e05e to 76e7632 Compare June 23, 2026 13:47
@stehessel stehessel changed the title feat: clean up managed clusters before destroying infra-pr GKE clusters ROX-35269: clean up managed clusters before destroying infra-pr GKE clusters Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@stehessel, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 1 second. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0ef1bc4a-785b-4f9c-9dd3-0a1e241270af

📥 Commits

Reviewing files that changed from the base of the PR and between 187f4f7 and ca22df4.

📒 Files selected for processing (5)
  • .github/workflows/PR.yaml
  • Makefile
  • chart/infra-server/static/flavors.yaml
  • chart/infra-server/static/workflow-gke-default.yaml
  • chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml
📝 Walkthrough

Walkthrough

The PR adds a new cleanup-infra-clusters Argo WorkflowTemplate, wires gke-default workflows to call it during stop when has-infra-server is true, adds the has-infra-server flavor parameter, and extends argo-workflow-lint to include workflow templates.

Changes

Managed cluster cleanup on infra-server destroy

Layer / File(s) Summary
cleanup-infra-clusters WorkflowTemplate
chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml
Adds the cleanup-infra-clusters workflow template, including optional kubeconfig handling, listing of active managed workflows, stop calls for matching workflows, and polling until completion or timeout.
gke-default wiring and lint coverage
chart/infra-server/static/flavors.yaml, chart/infra-server/static/workflow-gke-default.yaml, Makefile
Adds has-infra-server to the gke-default flavor and workflow, sets the workflow namespace and default parameter, inserts the conditional cleanup step in stop, marks kubeconfig optional, and expands argo-workflow-lint to include workflow templates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: pre-destroy cleanup for managed infra-pr GKE clusters.
Description check ✅ Passed The description accurately describes the added pre-destroy cleanup step, workflow stopping, and the new gating parameter.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ROX-35269/clean-up-clusters-on-destroy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@rhacs-bot

Copy link
Copy Markdown
Contributor

A single node development cluster (infra-pr-1868) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server: to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1868 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🔓 You must go to to export an . Your token from the prod infra instance will not work with dev environments.

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make helm-deploy

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

@stehessel stehessel force-pushed the ROX-35269/clean-up-clusters-on-destroy branch 2 times, most recently from a8be563 to 7f65e1d Compare June 24, 2026 21:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml (1)

20-20: 🔒 Security & Privacy | 🔵 Trivial

Pin the Argo CLI image to a fixed tag or digest. quay.io/argoproj/argocli:latest makes this destroy-time cleanup path change without a chart update; use a specific release that matches the Argo Workflows version you deploy, or add an explicit version value instead of relying on latest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml`
at line 20, The cleanup workflow template is using an unpinned Argo CLI image,
which makes the cleanup behavior drift unexpectedly. Update the image reference
in the cleanup template to use a fixed tag or digest, ideally matching the
deployed Argo Workflows version, and wire it through an explicit chart value if
needed instead of relying on the argocli:latest reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@chart/infra-server/static/workflow-gke-default.yaml`:
- Around line 24-27: The default value for has-infra-server in
workflow-gke-default.yaml is still set to "true", which causes cleanup to run
for every gke-default workflow instead of only infra-server clusters. Update the
workflow default to "false" and ensure the infra-pr path is the only place that
passes "true", keeping the behavior aligned with the has-infra-server handling
in the workflow/flavor config.
- Around line 47-50: The cleanup-infra-clusters step is missing the required
kubeconfig artifact, so it can skip cleanup. Update the workflow step that uses
templateRef cleanup-infra-clusters to pass
workflow.outputs.artifacts.global-kubeconfig into its inputs/artifacts, using
the cleanup-infra-clusters and global-kubeconfig symbols so the template
receives the kubeconfig it expects.

In `@chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml`:
- Around line 33-61: The cleanup script in the workflow template currently hides
Argo failures and can exit successfully even when managed workflows are still
active. Update get_active_workflows and the argo stop loop so command failures
are not swallowed, and make the timeout path in the cleanup step fail instead of
falling through with success. Keep the failure signal in this script so the
caller can use continueOn to proceed with destroy while still detecting cleanup
problems.

---

Nitpick comments:
In `@chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml`:
- Line 20: The cleanup workflow template is using an unpinned Argo CLI image,
which makes the cleanup behavior drift unexpectedly. Update the image reference
in the cleanup template to use a fixed tag or digest, ideally matching the
deployed Argo Workflows version, and wire it through an explicit chart value if
needed instead of relying on the argocli:latest reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 74c32da0-d7d0-4bec-8bb0-f41d328e964e

📥 Commits

Reviewing files that changed from the base of the PR and between 21104ff and 7f65e1d.

📒 Files selected for processing (4)
  • Makefile
  • chart/infra-server/static/flavors.yaml
  • chart/infra-server/static/workflow-gke-default.yaml
  • chart/infra-server/templates/workflowtemplates/cleanup-infra-clusters.yaml

Comment thread chart/infra-server/static/workflow-gke-default.yaml Outdated
Comment thread chart/infra-server/static/workflow-gke-default.yaml
…lusters

When a GKE cluster running a test infra-server is destroyed, any test
clusters it manages are orphaned because their Argo destroy phases never
run. Add a pre-destroy cleanup step that connects to the target cluster,
stops all active Argo workflows (triggering their onExit handlers), and
waits for cloud resource cleanup to complete before tearing down the
cluster.

The cleanup is gated by a new has-infra-server workflow parameter
(default false) so it only runs on clusters that host an infra-server
instance. PR.yaml passes has-infra-server=true for infra-pr clusters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stehessel stehessel force-pushed the ROX-35269/clean-up-clusters-on-destroy branch from 742a5bc to ca22df4 Compare June 25, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants