Skip to content

fix(repo): platform reliability hardening — agents, config, ui, forecast #380

Description

@w7-mgfcode

Summary

Five reliability defects sit flat and unlabeled in the backlog with no shared scope owner: agent fallback failures surface as generic errors with no actionable detail (#335), validate_model_identifier accepts doubled provider prefixes like google-gla:google-gla:gemini-… which then 404 at the provider (#334 — a confirmed root cause of one #335 failure leg), the Showcase page white-screens on LAN HTTP because crypto.randomUUID is undefined outside secure contexts (#332, frontend/src/components/demo/RunHistoryStrip.tsx:75), ModelFamily lives in app/features/forecasting/schemas.py forcing documented lazy-import workarounds across the registry boundary (#268, 6 import sites), and a model_exogenous re-forecast returns a 0.0 delta regardless of price assumptions (#237, root cause unverified — wiring looks correct at app/features/scenarios/feature_frame.py:155). Baseline: .flow/state.md (2026-06-11), brainstorm Round 3.

Approach

Pure hardening inside existing slices — no new router, no new slice, no new runtime dependency, no schema change except what #268's relocation requires (import-path moves only, no DB migration). Each epic lands as an ordinary fix//refactor/ branch through the standard validation gates. #334 lands first (Foundation) because it removes one class of #335 failures and changes the failure surface the #335 error-classification work tests against. #237 is gated investigate-first: a reproduction test decides wiring-bug vs zero-learned-elasticity before any fix is committed. Explicitly NOT changing: observability stack (none, by design), agent approval surface (agent_require_approval untouched), pgvector/FastAPI/SQLAlchemy versions.

Decomposition

Out of scope (explicit)

Success criteria

Risks

Risk Mitigation
#268 reintroduces the alembic cold-boot import cycle CI migration-check + a cold-boot import test in the epic's PR; documented cross-slice pattern (docs/_base/ARCHITECTURE.md) is the spec
#237 root cause is modeling (zero learned elasticity), not code Investigate-first gate: repro test decides scope before any fix commits; epic may close with a documented verdict + seeder follow-up issue
Fixing #334 changes the failure surface #335 tests against Phase ordering: E1 Foundation lands before E2 starts its test matrix
LAN-HTTP behavior not covered by CI Documented manual dogfood step (agent-browser/webapp-testing) in E3's acceptance; vitest covers the fallback util logic

Tracking

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixBug fixflowflow: command-suite workumbrellaUmbrella initiative (scope owner)

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions