Skip to content

feat: wire validated-but-unused config (feature 965 backlog)#23

Merged
mattmattox merged 11 commits into
mainfrom
feat/wire-validated-config-965
Jun 25, 2026
Merged

feat: wire validated-but-unused config (feature 965 backlog)#23
mattmattox merged 11 commits into
mainfrom
feat/wire-validated-config-965

Conversation

@mattmattox

@mattmattox mattmattox commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Drains the feature-965 "config validated but never wired" backlog PLUS the remediation-wiring epic discovered mid-work (TaskForge project 41). Stacked on #22 (feat/ipv6-dualstack) — review/merge that first; base retargets to main once #22 lands. Full go test ./... -short green (first try; the long-standing freePort flake is fixed here).

Feature 965 (7 tasks)

  • #15214 Multi-strategy remediation dispatch (first-success-wins) + detector fallback.
  • #15215 Implemented the correlator InjectProblemPattern stub (was a no-op) + first package tests.
  • #15216 Enforce MaxRemediationsPerMinute (x/time/rate token bucket).
  • #15217 Structured slog logging from config (pkg/logger + [LEVEL] prefix bridge; zero-churn over 374 sites).
  • #15218 FallbackOnUnreachable lease tests + cluster-wide remediation-storm warnings.
  • #15219 test-ci Makefile gate (unit + integration).
  • #15220 SQLite WAL + synchronous=NORMAL for controller storage.

Remediation wiring epic — #19263 (high) + closes #17222

Discovered that no built-in remediators were registered at all (Remediate(strategy) failed for every type). Fixed across 3 phases:

  • Phase 1: RegisterBuiltinRemediators for safe strategies (systemd-restart, custom-script); per-strategy params threaded via Problem.Metadata; remediators read with config fallback.
  • Phase 2: RegisterClusterRemediators (fail-closed without a k8s client) + new node-reboot (cordon→drain→reboot, dry-run-safe, skips DaemonSet/mirror/self, injected runner) and pod-delete (metadata-targeted, guarded) remediators; in-cluster clientset built in main.
  • Phase 3: network operations (flush-dns/restart-interface/reset-routing/flush-ipv6-route) added to the strategy enum + registered (interface via metadata) → #17222 closed.

All 8 remediation strategies are now registered and reachable.

Test infra

  • #19260 Eliminated the freePort TOCTOU bind flake (exporter exposes BoundPort(); tests bind ephemeral + read back). Verified -count=20 + -race.

Per project policy, not auto-merged.

…ask #15214)

Add RemediatorRegistry.RemediateWithStrategies(ctx, []string, problem):
tries strategy types in order, stops on first success, returns wrapped
last error if all fail; reuses Remediate so circuit-breaker/rate-limit/
cooldown/dry-run all apply. detector.evaluateRemediation builds the
ordered list from remCfg.Strategies (each nested .Strategy) when present,
else falls back to [remCfg.Strategy] (backward compatible). Tests cover
first-fail-then-success ordering, short-circuit, all-fail, and fallback.

Note: real config types are Strategy string + Strategies
[]MonitorRemediationConfig (task premise said []string); per-strategy
params don't flow through Remediate today, so dispatch is by type string.
…k #15215)

Replace the InjectProblemPattern no-op stub: store injected patterns
(mutex-guarded, name-keyed/idempotent) and feed them into
detectCommonCauseCorrelation alongside built-ins (deduped by name) via
the same findNodesWithAllProblems/MinNodesForCorrelation/confidence
logic. Signature now returns error (no existing callers): empty name/
problems rejected. Add correlator_test.go (first tests for the package):
detection thresholds, confidence, GetStats, injection validation, and
injected-pattern participation. Race-clean.

Note: task referenced GetCorrelationSummary/AddPattern which don't exist;
real equivalents are GetStats/GetActiveCorrelations.
…k #15216)

MaxRemediationsPerMinute was parsed/defaulted(2)/validated but never
enforced. Add a perMinuteBucket *rate.Limiter (golang.org/x/time/rate)
on RemediatorRegistry via SetMaxRemediationsPerMinute (matching the
Set* idiom; 0 = unlimited), wired in main from config. Enforced in
Remediate AFTER the non-consuming per-hour window check so a per-hour
rejection never burns a per-minute token and a rejected call isn't
counted as executed. Updated the stale config comment. Tests: N succeed
/ N+1 rejected within a minute, and unlimited at 0. go mod tidy promoted
x/time (+modernc/sqlite) to direct, dropped unused gogo/protobuf.
New pkg/logger.Init(cfg) builds a slog JSON/text handler at the configured
level writing to stdout/stderr/file (consuming LogOutput/LogFile, which
were validated but unused), sets slog default, and bridges the stdlib log
package via an io.Writer that strips [LEVEL] prefixes -> slog levels — so
all 374 existing log.Printf sites honor the configured format/destination
with zero churn. Wired in main after config validation (warn-and-continue
on error). Converted 7 detector hot paths to native structured slog with
attributes. Updated stale config comment. Tests: JSON parseable, text,
file output, prefix-bridge level mapping, level filtering.

Also poll-for-stat fix to 3 pre-existing flaky remediation tests exposed
by the slog timing change (production logic unchanged).
…nings (Task #15218)

Add lease_client_test.go covering RequestLease: reachable+grants,
reachable+denies, unreachable+fallback=true (FAKE approval — test
documents the cluster-wide remediation storm this enables), and
unreachable+fallback=false (error). Add prominent storm-risk WARNING
above the FallbackOnUnreachable config field and in the helm values
comment (value stays false). Committed only my hunks (pre-existing image.tag bump excluded).
…219)

Add a test-ci target that runs the fast unit suite (-short) AND the
integration suite (go test ./test/integration/... no -short, 10m timeout)
so coverage gaps like TestLeaseCoordinationFlow and
TestCorrelationDetectionFlow (test/integration/controller/) are exercised
in one gate. Build-tagged integration tests (kind dual-stack) still need
-tags=integration and are not run here. Added to .PHONY.
Apply PRAGMA journal_mode=WAL and synchronous=NORMAL in SQLiteStorage.
Initialize (after ping) so concurrent RequestLease reads don't block the
single writer and writes are faster while staying crash-safe in WAL mode.
Connection pool was already pinned (SetMaxOpenConns/IdleConns(1),
ConnMaxLifetime(0)). Add TestSQLiteStorage_WALAndSynchronous (file-backed
temp DB) asserting journal_mode=wal, synchronous=1 (NORMAL), and
MaxOpenConnections=1.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce2e6ff2fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/logger/logger.go
Comment on lines +161 to +162
if !strings.HasPrefix(trimmed, "[") {
return slog.LevelInfo, line

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve fatal log messages at high log levels

No-prefix standard-log writes are downgraded to INFO here, but several post-logger.Init fatal paths in cmd/node-doctor/main.go use log.Fatalf("Failed ...") without a [FATAL]/[ERROR] prefix. When operators configure logLevel: warn, error, or fatal, slog filters those INFO records and the daemon can exit without emitting the startup failure reason.

Useful? React with 👍 / 👎.

remediationErr = err
return err
}
if err := r.checkPerMinuteRate(); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Consume minute tokens only for executable remediations

The per-minute limiter consumes a token before controller lease acquisition and before resolving/checking the remediator, so non-executed attempts such as a lease denial, unreachable controller, unknown strategy, or CanRemediate cooldown still drain the bucket. With maxRemediationsPerMinute: 1, one denied lease can block the next actually-approved remediation for up to a minute even though no remediation ran.

Useful? React with 👍 / 👎.

…rategy config (Task #19263 phase 1)

Built-in remediators were never registered, so registry.Remediate failed
for every strategy. Add RegisterBuiltinRemediators (wired in main) for the
two SAFE strategies: systemd-restart->SystemdRemediator, custom-script->
CustomRemediator. Thread per-strategy params via Problem.Metadata
(service/scriptPath/args): buildStrategyList now returns the ordered
[]MonitorRemediationConfig and the detector loops them (first-success-wins
preserved), encoding each strategy's params into the dispatched Problem.
Systemd/Custom remediators read params from metadata with config fallback;
CustomRemediator keeps abs-path/no-.. validation on the metadata path.
node-reboot/pod-delete intentionally NOT registered (destructive; phase 2).
Tests: registration, per-strategy metadata threading, dry-run dispatch,
scriptPath validation. detector race-clean.
… #19263 phase 2)

Add RegisterClusterRemediators(registry,cfg,client,nodeName,self...) wiring
the two destructive strategies, fail-closed: registers node-reboot +
pod-delete ONLY when a k8s client and nodeName are present (else logs and
registers nothing). main builds an in-cluster clientset (non-fatal on
error) + passes downward-API POD_NAME/POD_NAMESPACE for self-skip.

NodeRebootRemediator: cordon -> drain (evict, skipping DaemonSet/mirror/
self pods) -> reboot via injected runner (default systemctl reboot);
dry-run returns before ANY mutation; cordon failure aborts reboot; bounded
per-phase timeouts; 30m cooldown. PodDeleteRemediator: deletes only the
pod named in Problem.Metadata (namespace/pod), refuses mirror/self, dry-run
log-only. Tested with fake clientset incl. dry-run-does-not-execute,
drain ordering/skips, and fail-closed registration.
…222 (Task #19263 phase 3)

Add flush-dns/restart-interface/reset-routing/flush-ipv6-route to
validRemediationStrategies (+ restart-interface requires an interface) and
register all four in RegisterBuiltinRemediators as NetworkRemediator
closures (per-op). flush-ipv6-route is now registered + reachable, closing
#17222. NetworkRemediator resolves the target interface from
Problem.Metadata["interface"] (new MonitorRemediationConfig.Interface
field threaded by the detector) with config fallback; AllowMetadataInterface
lets the registered restart-interface singleton construct without a fixed
NIC. Tests: 6 safe strategies registered, dry-run + executor dispatch of
flush-ipv6-route, interface-from-metadata, missing-interface error.
Remove the freePort helper (bind :0 -> read port -> close -> rebind race).
The exporter now captures its real bound address in Start (from the
eagerly-bound listener) and exposes BoundAddr()/BoundPort(); tests
construct an ephemeral-port exporter (Port 0, 9100 default applied only on
the production path) and read the actual port back instead of
pre-allocating. Migrated all freePort call sites. Verified: package
-count=20 and -race -count=5 pass with zero bind flakes; full -short
suite stable.
@mattmattox mattmattox changed the base branch from feat/ipv6-dualstack to main June 25, 2026 14:00
@mattmattox mattmattox merged commit c10c072 into main Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant