fix(repo): platform reliability hardening — agents, config, ui, forecast

## Summary
Five reliability defects sit flat and unlabeled in the backlog with no shared scope owner: agent fallback failures surface as generic errors with no actionable detail (#335), `validate_model_identifier` accepts doubled provider prefixes like `google-gla:google-gla:gemini-…` which then 404 at the provider (#334 — a confirmed root cause of one #335 failure leg), the Showcase page white-screens on LAN HTTP because `crypto.randomUUID` is undefined outside secure contexts (#332, `frontend/src/components/demo/RunHistoryStrip.tsx:75`), `ModelFamily` lives in `app/features/forecasting/schemas.py` forcing documented lazy-import workarounds across the registry boundary (#268, 6 import sites), and a `model_exogenous` re-forecast returns a 0.0 delta regardless of price assumptions (#237, root cause unverified — wiring looks correct at `app/features/scenarios/feature_frame.py:155`). Baseline: `.flow/state.md` (2026-06-11), brainstorm Round 3.

## Approach
Pure hardening inside existing slices — no new router, no new slice, no new runtime dependency, no schema change except what #268's relocation requires (import-path moves only, no DB migration). Each epic lands as an ordinary `fix/`/`refactor/` branch through the standard validation gates. #334 lands first (Foundation) because it removes one class of #335 failures and changes the failure surface the #335 error-classification work tests against. #237 is gated investigate-first: a reproduction test decides wiring-bug vs zero-learned-elasticity before any fix is committed. Explicitly NOT changing: observability stack (none, by design), agent approval surface (`agent_require_approval` untouched), pgvector/FastAPI/SQLAlchemy versions.

## Decomposition
- [x] **E1 — Foundation** (blocks E2–E6): #334 — reject doubled provider prefixes in `validate_model_identifier` (config + core settings paths) with regression tests — linked as sub-issue
- [x] **E2 — Parallel**: #335 — surface agent fallback-model failures with actionable, classified details (404/429/auth) over REST + WebSocket error events — linked as sub-issue
- [x] **E3 — Parallel**: #332 — safe-UUID fallback util for non-secure contexts; Showcase survives LAN HTTP end-to-end — linked as sub-issue
- [x] **E4 — Parallel**: #268 — move `ModelFamily` to `app/shared/`, retire the lazy-import workarounds at the 6 mapped import sites, prove alembic cold-boot clean — linked as sub-issue
- [x] **E5 — Parallel**: #237 — investigate-first: reproduction test discriminating wiring-bug vs zero-elasticity, then fix per verdict — linked as sub-issue
- [x] **E6 — Release gate** (closes after Foundation + all Parallel): full-gate verification — showcase_rich dogfood green, all five flat issues closed, umbrella closed — #387

## Out of scope (explicit)
- Ollama streaming chat 400 on `/agents/stream` — reason: already fixed and merged (PR #343, issue #342).
- `_llm_key_present()` skipping Ollama in the demo pipeline — reason: already fixed and merged (PR #341, issue #340).
- HITL `pending_action` loss on model misbehavior/retry exhaustion — reason: already fixed and merged (PRs #337/#345, issues #336/#344/#346).
- `feature_frame_version` V≥3 clamp in ops/registry — reason: already fixed and merged (PR #339, issue #338).
- LLM retry/circuit-breaker middleware + Prometheus metrics — reason: violates the no-external-observability / single-host principle (`docs/_base/ARCHITECTURE.md` § Observability, `product-vision.md`).

## Success criteria
- [x] `PATCH /config/ai` and `Settings` boot reject any model id whose model_name contains a provider prefix (`a:b:c`) with a clear 422 / startup error; unit tests cover both validation paths (#334 closed)
- [x] When primary and fallback models both fail, REST chat and `/agents/stream` emit a classified, actionable error (model-not-found / quota / auth) instead of a generic failure; route test covers ≥2 failure legs (#335 closed)
- [x] `/showcase` completes a full run over plain-HTTP LAN origin without white-screen; UUID fallback util has a vitest (#332 closed)
- [x] `ModelFamily` imports resolve from `app/shared/`; zero lazy-import NOTE comments reference the forecasting↔registry ModelFamily cycle; `alembic upgrade head` cold-boots clean in CI migration-check (#268 closed)
- [x] A committed reproduction test demonstrates whether price assumptions reach the `model_exogenous` re-forecast; the verdict (wiring fix or elasticity/seeder explanation) is implemented or documented in the issue (#237 closed)
- [x] All five validation gates green on every epic PR; `showcase_rich` demo pipeline runs green end-to-end after E6

## Risks
| Risk | Mitigation |
|------|------------|
| #268 reintroduces the alembic cold-boot import cycle | CI migration-check + a cold-boot import test in the epic's PR; documented cross-slice pattern (`docs/_base/ARCHITECTURE.md`) is the spec |
| #237 root cause is modeling (zero learned elasticity), not code | Investigate-first gate: repro test decides scope before any fix commits; epic may close with a documented verdict + seeder follow-up issue |
| Fixing #334 changes the failure surface #335 tests against | Phase ordering: E1 Foundation lands before E2 starts its test matrix |
| LAN-HTTP behavior not covered by CI | Documented manual dogfood step (agent-browser/webapp-testing) in E3's acceptance; vitest covers the fallback util logic |

## Tracking
- Source of truth: `docs/flow-pack-methodology.md` + working state `.flow/state.md` (Round 3, `.flow/brainstorm-log.md`)
- Milestone: reliability-hardening
- Dogfood note: this umbrella is the E5 second-initiative evidence for flow-pack umbrella #368 / epic #375
- **One-pass confidence: 8/10** (all 5 ship items are open issues with file-level grounding from 3-agent research; −1 #237 root cause unverified, −1 #268 cold-boot cycle history)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repo): platform reliability hardening — agents, config, ui, forecast #380

Summary

Approach

Decomposition

Out of scope (explicit)

Success criteria

Risks

Tracking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Risk	Mitigation
#268 reintroduces the alembic cold-boot import cycle	CI migration-check + a cold-boot import test in the epic's PR; documented cross-slice pattern (`docs/_base/ARCHITECTURE.md`) is the spec
#237 root cause is modeling (zero learned elasticity), not code	Investigate-first gate: repro test decides scope before any fix commits; epic may close with a documented verdict + seeder follow-up issue
Fixing #334 changes the failure surface #335 tests against	Phase ordering: E1 Foundation lands before E2 starts its test matrix
LAN-HTTP behavior not covered by CI	Documented manual dogfood step (agent-browser/webapp-testing) in E3's acceptance; vitest covers the fallback util logic

fix(repo): platform reliability hardening — agents, config, ui, forecast #380

Description

Summary

Approach

Decomposition

Out of scope (explicit)

Success criteria

Risks

Tracking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions