From fdf72c10e34bb48594101150609bc6c44b05e2c4 Mon Sep 17 00:00:00 2001 From: Gabor Szabo Date: Fri, 12 Jun 2026 12:56:08 +0200 Subject: [PATCH 1/2] docs(repo): track reliability E6 prp (#387) --- PRPs/PRP-reliability-E6-release-gate.md | 551 ++++++++++++++++++++++++ 1 file changed, 551 insertions(+) create mode 100644 PRPs/PRP-reliability-E6-release-gate.md diff --git a/PRPs/PRP-reliability-E6-release-gate.md b/PRPs/PRP-reliability-E6-release-gate.md new file mode 100644 index 00000000..3cc5344d --- /dev/null +++ b/PRPs/PRP-reliability-E6-release-gate.md @@ -0,0 +1,551 @@ +name: "PRP reliability-E6 — release gate: showcase_rich dogfood + per-epic spot checks + umbrella close-out" +description: | + Issue #387 (epic E6 of umbrella #380, milestone reliability-hardening). + Release-gate epic: NO new production code. The deliverable is executed + verification — a green end-to-end showcase_rich dogfood run on a fresh stack, + one live spot check per closed reliability epic (E1 #334, E2 #335, E3 #332, + E4 #268, E5 #237), all five validation gates green on dev, evidence recorded + on #387, and umbrella #380 closed. If any check fails, the gate STOPS and + files a fix issue — it never fixes forward inside this epic. + +--- + +## Goal + +Prove that the five reliability fixes hold **as one system** on `dev`, not just as +isolated epic PRs, then close the reliability-hardening umbrella: + +1. **Fresh-stack showcase_rich dogfood** — `docker compose down -v` → up → migrate, + then a `/showcase` run with `scenario=showcase_rich` + **Re-seed first**: all + 24 steps / 10 phases green (PRP-41 layout). Provider-dependent steps may ⏭️ skip + or ⚠️ warn per `docs/_base/RUNBOOKS.md`; the pipeline must still end green + (`pipeline_complete`, no ❌ step). +2. **Five per-epic spot checks** on `dev` — each is a committed regression test + re-run PLUS (where meaningful) a live HTTP probe against the running stack. +3. **All five validation gates green** on `dev` (ruff check, ruff format --check, + mypy --strict, pyright --strict, pytest unit). +4. **Close-out** — evidence comment on #387, tick every satisfied checkbox on + #380 (the live body has drifted — see Known Gotchas), close #380 with a + close-out comment linking the evidence, close #387. + +**End state**: #387 and #380 are CLOSED with linked evidence; `dev` is demonstrably +green end-to-end; this PRP file is committed as `docs(repo)` (the E1–E5 precedent). + +## Why + +- "showcase_rich demo pipeline runs green end-to-end after E6" is the **last open + success criterion** on umbrella #380 — every other epic (#334, #335, #332, #268, + #237) is closed as of 2026-06-12. Nothing verifies their *combined* behavior yet. +- The five fixes interact: E1 changed the failure surface E2 classifies; E5's seeder + coupling changes the data every showcase step trains on; E4 moved an import the + alembic cold-boot path exercises; E3 only manifests in a real browser over LAN HTTP. + An isolated-PR-green ≠ system-green. +- The umbrella is also the flow-pack dogfood evidence (#368/#375) — a clean, + evidence-linked close-out is part of the methodology being proven. + +## What + +A verification campaign, not a feature. No `app/` or `frontend/` source change is +in scope. The only repo change this PRP itself produces is the PRP file +(`PRPs/PRP-reliability-E6-release-gate.md`) committed as +`docs(repo): track reliability E6 prp (#387)`. + +### Success Criteria (mirror of #387 exit criteria) + +- [ ] Fresh stack rebuilt: `docker compose down -v && docker compose up -d && + uv run alembic upgrade head` exits clean (this is ALSO the E4 cold-boot proof). +- [ ] showcase_rich dogfood green end-to-end via the `/showcase` page loaded over a + **plain-HTTP LAN origin** (covers E3 simultaneously) — evidence: final step + summary + screenshot; no white-screen, no ❌ step. +- [ ] E1 #334 spot check passes: doubled provider prefix → 422 (live PATCH + tests). +- [ ] E2 #335 spot check passes: exhausted fallback → 502 `AGENT_FALLBACK_EXHAUSTED` + with classified `failures[]` (committed route test on fresh DB + optional live probe). +- [ ] E3 #332 spot check passes: LAN-HTTP page load completes a run without + white-screen; `safeRandomUUID` vitest green. +- [ ] E4 #268 spot check passes: `ModelFamily` imports from + `app.shared.model_taxonomy`; zero lazy-import NOTEs reference the old + registry↔forecasting cycle; alembic cold-boot clean (from the fresh-stack step). +- [ ] E5 #237 spot check passes: seeded grain → train `regression` → price-cut + simulate → `method == "model_exogenous"` and `units_delta != 0.0` + (committed integration test + optional live curl chain). +- [ ] All five validation gates green on `dev`. +- [ ] Evidence comment posted on #387; all satisfied checkboxes ticked on #380; + #380 closed with close-out comment; #387 closed. + +## All Needed Context + +### Documentation & References + +```yaml +# ── The gate's contract ────────────────────────────────────────────────────── +- issue: "#387 — gh issue view 387" + why: The epic's sub-task list and exit criteria this PRP encodes verbatim. + +- issue: "#380 — gh issue view 380" + why: Umbrella. Success-criteria checklist — the LAST unchecked item + ("showcase_rich demo pipeline runs green end-to-end after E6") gets ticked + here; then the issue is closed with a close-out comment. + +# ── showcase_rich pipeline (what 'green' means) ───────────────────────────── +- file: app/features/demo/pipeline.py + why: "_phase_table() (~line 2464) is the step registry. showcase_rich = 24 steps / + 10 phases: data(7: precheck, reset, seed, status, features, phase2_enrichment, + historical_backfill), modeling(2: train, v2_train), decision(5: backtest, + register, champion_compat_compare, stale_alias_trigger, safer_promote_flow), + portfolio(1: batch_preset), planning(2: scenario_simulate_and_save, + multi_plan_compare), knowledge(3: embedding_provider_probe, rag_index_subset, + rag_retrieve_probe), verify(1), agents(1: agent_hitl_flow), ops(1: + ops_snapshot), cleanup(1). READ-ONLY." + +- file: docs/_base/RUNBOOKS.md + why: "'Showcase page (/showcase) pipeline fails at step X' — items 1–27 are the + per-step diagnosis table. Defines which skips/warns are ACCEPTABLE on a green + run (see Known Gotchas below). Consult before treating any non-✅ as failure." + +- file: docs/_base/API_CONTRACTS.md + why: "WS /demo/stream contract — start frame, StepEvent shape, pipeline_complete + fields (winner_model_type, winner_wape, winning_run_id, alias, wall_clock_s, + v2_run_id). The headless fallback path drives this directly." + +- file: frontend/src/pages/showcase.tsx + why: "UI controls and their request mapping (~line 110-115): + start({ seed: 42, skip_seed: !reseed, reset: resetDb, scenario }). + 'Re-seed first' checkbox → skip_seed=false. 'Reset database' → reset=true. + ScenarioPicker carries demo_minimal | showcase_rich | sparse." + +# ── E1 spot-check surface (#334) ───────────────────────────────────────────── +- file: app/core/config.py + why: "validate_model_identifier (line 20) — rejects nested provider prefix + ('google-gla:google-gla:…') with the 'Did you mean' ValueError; ollama + multi-colon tags stay valid. Settings.agent_default_model (192) / + agent_fallback_model (193) field_validator at line 231. READ-ONLY." + +- file: app/features/config/tests/test_routes.py + why: "test_patch_rejects_doubled_provider_prefix (line 120) — the live-route 422 + regression test to re-run." + +- file: app/features/config/tests/test_schemas.py + why: "test_rejects_doubled_provider_prefix (55), test_rejects_mixed_provider_prefix + (60), test_rejects_doubled_prefix_via_model_validate (134)." + +- file: app/features/agents/tests/test_config_validation.py + why: "test_doubled_prefix_rejected_at_settings_boot (line 41) — the Settings-boot + validation path." + +# ── E2 spot-check surface (#335) ───────────────────────────────────────────── +- file: app/core/exceptions.py + why: "AgentFallbackExhaustedError → 502 problem+json, code=AGENT_FALLBACK_EXHAUSTED, + type=…/errors/agent-fallback-exhausted, failures[] extension (line ~272)." + +- file: app/features/agents/service.py + why: "chat fallback-exhausted path (~line 316) and stream path (~line 717, + error_type='fallback_exhausted', recoverable=true). READ-ONLY." + +- file: app/features/agents/tests/test_routes.py + why: "TestChatRoutes (integration-marked) :: + test_chat_fallback_exhausted_returns_502_problem_json (line 167) — asserts 502, + code, two classified failures (model_not_found + quota_exhausted), secret + scrubbing. This is the committed ≥2-failure-leg proof; re-run it." + +# ── E3 spot-check surface (#332) ───────────────────────────────────────────── +- file: frontend/src/lib/uuid-utils.ts + why: "safeRandomUUID — crypto.randomUUID → getRandomValues-v4 → Math.random-v4 + fallback chain." + +- file: frontend/src/lib/uuid-utils.test.ts + why: "vitest incl. the explicit 'LAN-HTTP shape' case (randomUUID undefined). + Run: cd frontend && pnpm test --run src/lib/uuid-utils.test.ts" + +- file: frontend/eslint.config.js + why: "no-restricted-properties guard (~lines 30-44) banning raw crypto.randomUUID." + +# ── E4 spot-check surface (#268) ───────────────────────────────────────────── +- file: app/shared/model_taxonomy.py + why: "Exports ModelFamily (str Enum: BASELINE/TREE/ADDITIVE) + model_family_for + + _MODEL_FAMILY_MAP. Module docstring documents the resolved cycle. READ-ONLY." + +- file: docs/_base/ARCHITECTURE.md + why: "'Cross-slice read-only import pattern' section — records #268 as RESOLVED; + the ONLY legitimately remaining lazy pair is forecasting↔jobs." + +# ── E5 spot-check surface (#237) ───────────────────────────────────────────── +- file: app/features/scenarios/tests/test_routes_integration.py + why: "TestModelExogenousOnSeededData::test_seeded_train_simulate_price_cut_moves_demand + (line 480) — THE committed end-to-end proof: seeded elastic grain → train + regression → simulate -20% price cut → method=='model_exogenous' && + units_delta != 0.0. Re-run it on the fresh DB. Also shows the exact live-curl + request bodies (train: lines 486-496, simulate: lines 503-516)." + +- file: PRPs/PRP-reliability-E5-model-exogenous-price-inertia.md + why: "The E5 verdict + fix narrative — context for interpreting a failure here + (seeder coupling flag RetailPatternConfig.price_sales_coupling=True)." + +# ── Close-out mechanics ────────────────────────────────────────────────────── +- file: .claude/rules/umbrella-issue.md + why: "Write discipline for gh mutations: dry-run echo → idempotent check → + approval gate → confirm. Applies to the #380 body edit + closes." + +- file: .claude/rules/output-formatting.md + why: "Evidence comment format: emoji status indicators, box separators, ≤40 lines." +``` + +### Current Codebase tree (verification-relevant subset) + +```bash +app/core/config.py # validate_model_identifier (E1) +app/core/exceptions.py # AGENT_FALLBACK_EXHAUSTED (E2) +app/shared/model_taxonomy.py # ModelFamily home (E4) +app/features/demo/pipeline.py # _phase_table — 24-step showcase_rich registry +app/features/config/tests/ # E1 regression tests +app/features/agents/tests/ # E2 route test (integration), E1 boot test +app/features/scenarios/tests/test_routes_integration.py # E5 e2e test (integration) +frontend/src/lib/uuid-utils.{ts,test.ts} # E3 +frontend/src/pages/showcase.tsx # dogfood entry point +scripts/run_demo.py # legacy CLI pipeline (NOT the dogfood target) +``` + +### Desired Codebase tree + +```bash +PRPs/PRP-reliability-E6-release-gate.md # this file — the ONLY tracked change +# No app/, frontend/, alembic/, or docs/_base/ source change is in scope. +``` + +### Known Gotchas & Environment Quirks + +```python +# ── STOP RULE (governs the whole epic) ─────────────────────────────────────── +# If ANY spot check or the dogfood FAILS: capture evidence (response body / +# screenshot / log excerpt), open a NEW fix issue referencing #380 + the failed +# epic issue, comment the failure on #387, and STOP. The release gate never +# fixes forward — a fix is a new branch/PR through the normal flow, and the +# gate re-runs after it merges. + +# ── Fresh stack / processes ────────────────────────────────────────────────── +# GOTCHA: a stale uvicorn from a prior session can hold :8123 — curl then hits +# OLD code while you think you're testing dev. Before starting the backend: +# lsof -iTCP:8123 -sTCP:LISTEN # kill any stale PID first +# GOTCHA: `docker compose down -v` ERASES the DB incl. RAG corpus and app_config +# runtime overrides (agent model settings revert to .env values on next boot). +# That's desired here (clean gate), but means: re-check GET /config/ai after boot. +# GOTCHA: run the BACKEND AS LOCAL UVICORN (uv run uvicorn app.main:app --port +# 8123), NOT the compose backend container — model artifacts must land on the +# host filesystem for verify/feature-metadata steps, and the docker-compose.yml +# default brings up Postgres only on :5433 anyway. +# GOTCHA: pnpm 11 depsStatusCheck can stall `pnpm dev` — start Vite directly: +# cd frontend && ./node_modules/.bin/vite --host 0.0.0.0 + +# ── Dogfood / browser ──────────────────────────────────────────────────────── +# CRITICAL (E3): crypto.randomUUID is undefined only in NON-SECURE contexts. +# http://localhost:5173 IS a secure context — it cannot reproduce #332. Load the +# page via a real LAN IP: http://$(hostname -I | awk '{print $1}'):5173/showcase. +# frontend/.env VITE_API_BASE_URL=http://localhost:8123 still works when browsing +# from this same host (the browser resolves localhost locally), and the backend +# CORS dev regex already allows 10.x/192.168.x/172.16-31.x origins. +# GOTCHA: Playwright MCP and `playwright install` both fail on this host. Use +# native Python Playwright with executable_path="/snap/bin/chromium", or the +# agent-browser skill. Verify the chromium path exists before relying on it. +# ACCEPTABLE NON-GREEN STEPS on showcase_rich (RUNBOOKS items 9-26): per #387, +# "provider-dependent steps may ⏭️ skip per RUNBOOKS, pipeline still green": +# - agent_hitl_flow ⏭️ — no key for agent_default_model provider / approval +# timeout / model didn't call save_scenario (known recurring skip on this host) +# - rag_index_subset / rag_retrieve_probe ⏭️ — embedding provider unreachable +# or rejected credentials (#329); embedding_provider_probe ✅ even when +# reachable=False +# - verify ⏭️ — expected on a prophet_like (V2) winner: artifact roots differ +# - champion_compat_compare / safer_promote_flow ⏭️ — missing V2 run or V1 +# baseline (should NOT happen with Re-seed first ticked — investigate if hit) +# - batch_preset ⚠️ — 90 s poll timeout on a loaded laptop (non-fatal) +# - ops_snapshot ⚠️ — /ops/* unavailable (warn, never fail) +# ANY ❌ step = gate failure → STOP RULE. +# GOTCHA: only one pipeline runs at a time (module asyncio.Lock); a second start +# gets one `error` event / POST gets 409. Stop button releases the lock in ~5 s. +# Wall-clock: budget ~5-10 min for showcase_rich on this laptop; per-step HTTP +# timeout is 120 s, batch poll 90 s, HITL approval 90 s. + +# ── Spot-check mechanics ───────────────────────────────────────────────────── +# E2 (integration test): TestChatRoutes is @pytest.mark.integration — needs the +# compose Postgres up + migrations applied. Run TARGETED tests, NOT the full +# integration suite: the full suite is known to pollute shared DB state mid-run +# (destructive seeder tests) and produce false negatives. Run the E2 + E5 tests +# individually, E5 BEFORE anything that mutates seeded data, or on a fresh DB. +# E2 (optional live probe): PATCH /config/ai persists overrides to app_config +# AND applies live. To provoke real exhaustion: GET /config/ai (record current +# agent_default_model/agent_fallback_model), PATCH both to +# "ollama:nonexistent-model-e6" (valid format — passes E1 validation; Ollama at +# localhost:11434 returns 404 → reason="model_not_found"), create session, chat, +# expect 502; then PATCH the recorded values BACK. NEVER leave the override in +# place — it would break the showcase agent step on the next run. +# E5 (live curl variant): the /scenarios/* run_id is the ARTIFACT KEY parsed +# from TrainResponse.model_path ("model_{key}.joblib" → stem minus "model_"), +# NOT the registry model_run.run_id. Different ID spaces. +# E5 (live curl variant): the seeder does NOT reset Postgres ID sequences — +# discover real store/product IDs + date window via GET /dimensions/stores, +# GET /dimensions/products, and the seeded calendar range; never assume id=1. +# (Fresh `down -v` stack makes IDs 1-based again, but discover anyway.) +# E4: the ONLY remaining lazy-import NOTE in app/ must be the forecasting↔jobs +# pair (app/features/forecasting/service.py:~1050). Anything mentioning a +# ModelFamily / registry↔forecasting cycle = E4 regression → STOP RULE. + +# ── Validation gates / frontend ────────────────────────────────────────────── +# GOTCHA: `pnpm tsc --noEmit` is VACUOUS here (solution-style tsconfig checks 0 +# files) and `tsc -b` has known pre-existing failures on dev — frontend +# type-check is NOT one of this gate's five criteria. Frontend evidence = the +# uuid-utils vitest + the browser dogfood. +# The five gates (#387 wording): ruff check, ruff format --check, mypy app/, +# pyright app/, pytest -m "not integration". +# GOTCHA: app/core/tests/test_config.py settings tests can fail if they pick up +# the local .env — known issue, fixed via Settings(_env_file=None) in the tests +# already; if a gate failure looks like .env-bleed, see RUNBOOKS before STOPping. + +# ── GitHub close-out ───────────────────────────────────────────────────────── +# Write discipline (.claude/rules/umbrella-issue.md): echo each gh mutation +# before running it. +# DRIFT WARNING (verified 2026-06-12): #380's live body has ALL 12 checkboxes +# unticked — the five per-epic success criteria were never ticked when E1-E5 +# closed, and the E6 Decomposition line still says "not yet created". Closing +# the umbrella with unticked boxes contradicts umbrella-issue.md ("checkbox list +# an outside reviewer uses as the close-or-not decision"). So: tick EVERY +# satisfied box (5 success criteria + 5 E1-E5 decomposition lines + the final +# showcase_rich criterion + the E6 line), and update the E6 line's "not yet +# created" → "#387". Do NOT pattern-match checkbox text literally — the live +# body contains backticks (`showcase_rich`) the issue text elsewhere omits; +# fetch with `gh issue view 380 --json body`, edit the markdown, push back via +# `gh issue edit 380 --body-file`. Preserve everything else byte-identical — +# the body carries an HTML provenance comment. +# Close order: evidence comment on #387 → tick #380 → close #380 (comment links +# #387 evidence) → close #387 last (it's the epic doing the closing). +``` + +## Implementation Blueprint + +### Data models and structure + +None. This epic ships zero schemas, zero migrations, zero source changes. + +### List of tasks in execution order + +```yaml +Task 0 — Preflight: + VERIFY branch: git switch dev && git pull → clean, up to date with origin/dev. + VERIFY no stale server: lsof -iTCP:8123 -sTCP:LISTEN → kill stale PIDs. + VERIFY chromium for dogfood: ls /snap/bin/chromium (else plan agent-browser skill). + RECORD: git rev-parse HEAD → the SHA all evidence refers to. + +Task 1 — Fresh stack (E4 cold-boot proof rides along): + RUN: docker compose down -v + RUN: docker compose up -d # Postgres+pgvector on :5433 + RUN: uv run alembic upgrade head # MUST exit 0 on the EMPTY db — E4 evidence + RUN: uv run python scripts/check_db.py # connectivity sanity + START backend: uv run uvicorn app.main:app --port 8123 (background, log to file) + VERIFY: curl -s http://localhost:8123/health → {"status":"ok"} + START frontend: cd frontend && ./node_modules/.bin/vite --host 0.0.0.0 (background) + VERIFY: curl -sI http://localhost:5173 → 200. + +Task 2 — showcase_rich dogfood over LAN origin (primary deliverable; covers E3): + DISCOVER LAN IP: hostname -I | awk '{print $1}' + DRIVE BROWSER (native Python Playwright, executable_path=/snap/bin/chromium): + - goto http://:5173/showcase # NON-secure context — E3 surface + - assert page renders (no white-screen), zero console errors mentioning + randomUUID / crypto + - select scenario "showcase_rich"; tick "Re-seed first" + (→ {seed:42, skip_seed:false, reset:false, scenario:"showcase_rich"}) + - click Run; poll up to ~10 min for the completion banner + - if the HITL step card shows an Approve button within its 90 s window, + click it (a ⏭️ skip on agent_hitl_flow is acceptable per RUNBOOKS 23-25) + - capture: full-page screenshot + the per-step status list (24 rows) + ASSERT: pipeline green — every step ✅/⏭️/⚠️ per the acceptable-list in Known + Gotchas; zero ❌. Record winner_model_type / winner_wape / v2_run_id from the + summary if surfaced. + FALLBACK (only if browser automation is unusable): drive WS /demo/stream + headlessly with start frame {"seed":42,"reset":false,"skip_seed":false, + "scenario":"showcase_rich"}, assert pipeline_complete + zero fail events — + THEN still do a LAN-origin page load + one demo_minimal UI run for E3. + ON ANY ❌ STEP: STOP RULE (RUNBOOKS items 1-27 give the diagnosis per step). + +Task 3 — E1 #334 spot check (doubled provider prefix → 422): + LIVE: curl -s -o /dev/null -w '%{http_code}' -X PATCH \ + http://localhost:8123/config/ai -H 'Content-Type: application/json' \ + -d '{"agent_default_model":"google-gla:google-gla:gemini-2.0-flash"}' + → expect 422; re-run without -o to capture the problem+json body + (RFC 7807, mentions nested provider prefix). NOTE: a 422 means nothing + was persisted — no restore needed. + TESTS: uv run pytest \ + app/features/config/tests/test_schemas.py \ + app/features/config/tests/test_routes.py::TestUpdateAIConfig \ + "app/features/agents/tests/test_config_validation.py::TestModelIdentifierValidation::test_doubled_prefix_rejected_at_settings_boot" \ + -v -k "doubled or mixed or prefix" + +Task 4 — E2 #335 spot check (fallback exhaustion classified): + TEST (the committed ≥2-leg proof; integration-marked, fresh DB is up): + uv run pytest "app/features/agents/tests/test_routes.py::TestChatRoutes::test_chat_fallback_exhausted_returns_502_problem_json" -v -m integration + OPTIONAL LIVE PROBE (only if Ollama responds on localhost:11434): + - GET /config/ai → record agent_default_model + agent_fallback_model + - PATCH /config/ai {"agent_default_model":"ollama:nonexistent-model-e6", + "agent_fallback_model":"ollama:nonexistent-model-e6"} + - POST /agents/sessions {"agent_type":"experiment"} → session_id + - POST /agents/sessions/{id}/chat {"message":"hello"} + → expect 502 application/problem+json, code=AGENT_FALLBACK_EXHAUSTED, + failures[] with reason model_not_found, no secret values in body + - DELETE the session; PATCH /config/ai back to the recorded values; GET to + confirm restore. (MANDATORY restore — see Known Gotchas.) + +Task 5 — E4 #268 spot check (taxonomy home + no stale cycle NOTEs): + RUN: uv run python -c "from app.shared.model_taxonomy import ModelFamily, model_family_for; print(model_family_for('regression'), model_family_for('prophet_like'), model_family_for('naive'))" + → "ModelFamily.TREE ModelFamily.ADDITIVE ModelFamily.BASELINE" + RUN: grep -rn "ModelFamily" app/ --include="*.py" | grep -v "model_taxonomy" \ + | grep -iE "lazy|cycle|circular|NOTE" → MUST be empty + RUN: grep -rn "NOTE" app/ --include="*.py" | grep -iE "lazy|cycle|circular" + → ONLY the forecasting↔jobs pair (app/features/forecasting/service.py). + EVIDENCE: alembic cold-boot already proven in Task 1 (upgrade head on empty DB). + +Task 6 — E5 #237 spot check (price cut moves model_exogenous demand): + TEST (the committed e2e proof; run BEFORE anything further mutates seed data — + Task 2's run is fine, the test seeds its own isolated grain and cleans up): + uv run pytest "app/features/scenarios/tests/test_routes_integration.py::TestModelExogenousOnSeededData::test_seeded_train_simulate_price_cut_moves_demand" -v -m integration + OPTIONAL LIVE CURL CHAIN (mirrors the test, against the showcase-seeded data): + - GET /dimensions/stores + /dimensions/products → pick a real (store_id, + product_id) with sales (never assume id=1) + - POST /forecasting/train {"store_id":S,"product_id":P, + "train_start_date":"","train_end_date":"", + "config":{"model_type":"regression"}} → 200; model_path + - run_id = basename(model_path) minus "model_" prefix minus ".joblib" + - POST /scenarios/simulate {"run_id":run_id,"horizon":14,"assumptions": + {"price":{"change_pct":-0.20,"start_date":"","end_date":""}}} + → 200, method=="model_exogenous", units_delta != 0.0 + +Task 7 — Five validation gates on dev: + RUN: uv run ruff check . && uv run ruff format --check . + RUN: uv run mypy app/ && uv run pyright app/ + RUN: uv run pytest -v -m "not integration" + PLUS frontend E3 unit evidence: cd frontend && pnpm test --run src/lib/uuid-utils.test.ts + ALL must pass. A failure here on untouched dev = regression → STOP RULE. + +Task 8 — Evidence + close-out (gh write discipline: echo each command first): + COMMIT this PRP file FIRST (before any close): branch docs/reliability-e6-prp + off dev, `docs(repo): track reliability E6 prp (#387)`, PR into dev (E5 + precedent: commit 82300eb). NOTE: the PR needs 1 approving review + CI — + it will NOT merge autonomously; opening it is enough to proceed, the merge + lands through the normal flow. + COMMENT on #387: evidence block per .claude/rules/output-formatting.md — + HEAD SHA, fresh-stack proof, dogfood result table (24 steps with ✅/⏭️/⚠️ and + skip reasons), the five spot-check results with the exact commands run, + gate results, screenshot attached or path referenced. + EDIT #380 body (see DRIFT WARNING in Known Gotchas): tick ALL satisfied + checkboxes — the 5 per-epic success criteria, the 5 E1-E5 Decomposition + lines, the E6 Decomposition line (updating "not yet created" → "#387"), and + the final "...showcase_rich demo pipeline runs green end-to-end after E6" + criterion. Preserve everything else byte-identical. + CLOSE #380: gh issue close 380 --comment "" + CLOSE #387: gh issue close 387 --comment "" + +Task 9 — Teardown: + STOP the background uvicorn + vite processes started in Task 1. + LEAVE the seeded DB in place (operator-visible artefacts are fine post-gate). +``` + +### Integration Points + +```yaml +GITHUB: + - issue #387: evidence comment + close + - issue #380: body checkbox tick + close-out comment + close + - PR: docs(repo) commit of this PRP file into dev + +RUNTIME (no code integration — consumers only): + - docker compose Postgres :5433, local uvicorn :8123, Vite :5173 (LAN-bound) + - Ollama localhost:11434 (optional, E2 live probe + agent/knowledge steps) +``` + +## Validation Loop + +### Level 1 — environment sanity (before anything else) + +```bash +git -C . status --short && git rev-parse --abbrev-ref HEAD # dev, clean +lsof -iTCP:8123 -sTCP:LISTEN # must be empty +docker compose ps # postgres healthy +curl -s http://localhost:8123/health # {"status":"ok"} after Task 1 +``` + +### Level 2 — targeted regression tests (the per-epic committed proofs) + +```bash +# E1 +uv run pytest app/features/config/tests/ app/features/agents/tests/test_config_validation.py -v -k "doubled or mixed or prefix" +# E2 (integration — fresh DB) +uv run pytest "app/features/agents/tests/test_routes.py::TestChatRoutes::test_chat_fallback_exhausted_returns_502_problem_json" -v -m integration +# E3 +cd frontend && pnpm test --run src/lib/uuid-utils.test.ts && cd .. +# E4 +uv run python -c "from app.shared.model_taxonomy import ModelFamily, model_family_for; print(model_family_for('regression'))" +# E5 (integration — self-seeding, self-cleaning) +uv run pytest "app/features/scenarios/tests/test_routes_integration.py::TestModelExogenousOnSeededData::test_seeded_train_simulate_price_cut_moves_demand" -v -m integration +``` + +### Level 3 — live system (dogfood + probes) + +```bash +# Dogfood: browser at http://:5173/showcase, scenario=showcase_rich, +# Re-seed first ticked → green pipeline, screenshot captured. (Task 2.) + +# E1 live: +curl -s -X PATCH http://localhost:8123/config/ai -H 'Content-Type: application/json' \ + -d '{"agent_default_model":"google-gla:google-gla:gemini-2.0-flash"}' | head -c 400 +# → 422 problem+json mentioning the nested provider prefix + +# E5 live: train→simulate chain per Task 6 (IDs discovered, never assumed). +``` + +### Level 4 — repo gates + +```bash +uv run ruff check . && uv run ruff format --check . +uv run mypy app/ && uv run pyright app/ +uv run pytest -v -m "not integration" +``` + +## Final validation Checklist + +- [ ] Fresh stack: `down -v` → `up -d` → `alembic upgrade head` clean (E4 cold-boot) +- [ ] showcase_rich dogfood: 24 steps / 10 phases, zero ❌, over plain-HTTP LAN + origin, screenshot + step table captured (E3 white-screen proof included) +- [ ] E1: live PATCH → 422; doubled/mixed-prefix tests green +- [ ] E2: `test_chat_fallback_exhausted_returns_502_problem_json` green + (+ optional live 502 probe, config RESTORED afterwards) +- [ ] E3: uuid-utils vitest green; LAN page load clean +- [ ] E4: taxonomy import one-liner correct; zero stale cycle NOTEs + (only forecasting↔jobs remains) +- [ ] E5: `test_seeded_train_simulate_price_cut_moves_demand` green + (+ optional live chain: method=model_exogenous, units_delta != 0.0) +- [ ] Five gates green: ruff, format, mypy, pyright, unit pytest +- [ ] Evidence comment on #387; #380 checkbox ticked; #380 closed; #387 closed +- [ ] This PRP committed via `docs(repo): track reliability E6 prp (#387)` +- [ ] Background servers stopped; no config overrides left in app_config + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't fix forward inside the gate — a failed check files a new issue and STOPS +- ❌ Don't treat a RUNBOOKS-sanctioned ⏭️/⚠️ as failure — but don't hand-wave a ❌ either +- ❌ Don't verify E3 on localhost — it's a secure context; #332 only manifests on LAN IP +- ❌ Don't run the FULL integration suite as a gate — known shared-state pollution; + run the targeted tests listed above +- ❌ Don't leave `ollama:nonexistent-model-e6` (or any probe override) in app_config +- ❌ Don't assume store/product IDs or date windows — discover via /dimensions/* +- ❌ Don't rewrite #380's body beyond ticking satisfied checkboxes + the E6 line update +- ❌ Don't `gh pr merge --merge` anything dev→main here — this epic ends at `dev`; + the release cut is a separate decision (stop-and-ask gate) + +## Confidence Score: 8.5/10 + +One-pass success likelihood is high: every spot check maps to a committed, +named regression test plus an exact live command; the dogfood path, acceptable +skip list, and environment traps (stale uvicorn, LAN secure-context, ID +discovery, config restore) are all pinned with file:line grounding. Residual +risk (−1.5): the showcase_rich browser run has non-deterministic legs +(agent_hitl_flow, provider reachability, batch timing on a loaded laptop) that +may force a re-run or RUNBOOKS triage, and host browser automation has a known +fragile setup (snap chromium path). From 62a2463cde67ef5142a7167d21e041f5c4da2669 Mon Sep 17 00:00:00 2001 From: Gabor Szabo Date: Fri, 12 Jun 2026 12:59:56 +0200 Subject: [PATCH 2/2] docs(repo): address review wording nits on e6 prp (#387) --- PRPs/PRP-reliability-E6-release-gate.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/PRPs/PRP-reliability-E6-release-gate.md b/PRPs/PRP-reliability-E6-release-gate.md index 3cc5344d..834c9033 100644 --- a/PRPs/PRP-reliability-E6-release-gate.md +++ b/PRPs/PRP-reliability-E6-release-gate.md @@ -1,8 +1,8 @@ name: "PRP reliability-E6 — release gate: showcase_rich dogfood + per-epic spot checks + umbrella close-out" description: | Issue #387 (epic E6 of umbrella #380, milestone reliability-hardening). - Release-gate epic: NO new production code. The deliverable is executed - verification — a green end-to-end showcase_rich dogfood run on a fresh stack, + Release-gate epic: NO new production code. The deliverable is an executed + verification: a green end-to-end showcase_rich dogfood run on a fresh stack, one live spot check per closed reliability epic (E1 #334, E2 #335, E3 #332, E4 #268, E5 #237), all five validation gates green on dev, evidence recorded on #387, and umbrella #380 closed. If any check fails, the gate STOPS and @@ -442,7 +442,7 @@ Task 8 — Evidence + close-out (gh write discipline: echo each command first): Task 9 — Teardown: STOP the background uvicorn + vite processes started in Task 1. - LEAVE the seeded DB in place (operator-visible artefacts are fine post-gate). + LEAVE the seeded DB in place (operator-visible artifacts are fine post-gate). ``` ### Integration Points