fix(agents): constrain experiment read-only queries by w7-mgfcode · Pull Request #348 · w7-mgfcode/ForecastLabAI

w7-mgfcode · 2026-06-01T02:10:40Z

Root cause

A manual Chat UI test against the Experiment Agent ("List the most recent model runs and tell me which has the lowest WAPE.") derailed into an unrelated what-if proposal ("Proposed what-if for store 123 / product 456 … Cut price 15% …").

Trace (session 0bba65bc…, ollama:llama3.1:8b primary / ollama:qwen3:8b fallback):

The model correctly called tool_list_runs({"status":"success"}) and got the run data including WAPE.
It then emitted malformed structured output missing ExperimentReport.summary (a required field).
PydanticAI issued an output-format validation retry.
On the retry the weak 8B model derailed and called tool_propose_scenario with hallucinated store_id=123/product_id=456, then summarized that proposal.

So the model selected the right tool but failed the structured-output step, and on the validation retry grabbed an unrelated scenario tool. Two gaps enabled it: no guard telling the model to keep read-only intents on read tools (and to treat a validation retry as a reformat, not a new action), and tool_propose_scenario happily accepting non-existent entity IDs.

Fix summary

Generalized read-only intent guard (READ_ONLY_INTENT_GUARD in agents/base.py, embedded in EXPERIMENT_SYSTEM_PROMPT). It instructs the model that for read-only intents — listing/ranking stores or products, top products, sales/revenue/units summaries, forecast summaries, model-run & metric comparisons (WAPE/MAE/RMSE), registry aliases & deployment status, backtest metrics, RAG/document questions — it must:
- use only read-only tools (tool_list_runs, tool_get_run, tool_compare_runs, tool_compare_backtest_results);
- never call tool_propose_scenario, tool_save_scenario, tool_create_alias, tool_archive_run, or tool_run_backtest unless the user explicitly asks to create/save/promote/archive/run/experiment;
- on an output-format validation retry, not call new tools — only reformat the prior result into a valid ExperimentReport with a concise summary;
- ask a clarifying question for an ambiguous ranking ("Top by revenue, units sold, forecasted demand, or model error?");
- state plainly when no read-only tool exists for a metric instead of inventing data;
- never invent store_id/product_id/run_id.
Secondary validation gap fixed: propose_scenario now probes the store/product dimension tables and returns a non-persistable validation error ({"valid": false, "persistable": false, "missing": [...]}) for a non-existent grain — the hallucinated 123/456 case can never return a normal, save-able proposal. Read-only; persists nothing.

No new tool surfaces were added, no mutation surfaces widened, the HITL approval gates and the #344 pending_action salvage are untouched, and no public API contracts changed.

Read-only tool coverage findings

Already available to the experiment agent (read-only): tool_list_runs, tool_get_run, tool_compare_runs (incl. WAPE/MAE/RMSE via metrics), tool_compare_backtest_results. The original WAPE query was fully answerable with these — the failure was structured-output formatting, not missing tools.

Missing read-only coverage (guard handles by stating the limitation / asking clarification — not adding tools, per scope):

registry aliases / deployment status — no list-aliases tool exposed to the agent
product/store rankings ("top products") — no analytics/dimensions tool exposed
sales / revenue / units summaries — none exposed
forecast summaries / highest forecasted demand — none exposed

These are candidates for a future additive read-only tool PR (see follow-up).

Tests added

app/features/agents/tests/test_read_only_guard.py — deterministic, no live model calls:
- guard is embedded in the experiment prompt and delivered in the system prompt to the model (captured via a stub FunctionModel);
- guard names every prohibited tool and the allowed read tools;
- guard forbids new tool calls on a validation retry (incl. the exact summary: Field required trigger);
- guard requires the "top products" clarification question and prohibits invented IDs;
- parametrized coverage over the exact WAPE prompt + "top products" + "highest forecasted demand" + "current deployment alias" + revenue/units + backtest metrics.
app/features/scenarios/tests/test_agent_tools.py — propose_scenario entity validation: existing grain succeeds (unit, mocked DB), non-existent rejected, 123/456 covered explicitly, missing-product-only rejected, and an integration test asserting a non-existent grain is rejected and persists no scenario_plan row. Existing integration propose tests now seed the grain via a new existing_grain fixture.

Validation results

ruff check . → All checks passed
ruff format --check . → 334 files already formatted
mypy app/ → only the pre-existing xgboost/lightgbm optional-extra import errors in untouched files (forecasting/models.py, forecasting/persistence.py, registry/service.py); none in changed files
pyright app/ → same pre-existing optional-extra import errors only; 0 errors in changed files
pytest -m "not integration" → 1690 passed, 12 skipped
pytest -m "not integration" app/features/agents app/features/scenarios → 218 passed

Notes

No live model/agent calls were required to reproduce or to test — the investigation used the persisted DB message history, and all new tests are deterministic.
Integration tests for the scenarios slice were not executed here (they need docker compose up Postgres + seeded dimensions); they are structured to pass in CI.

Known limitations / follow-up

The guard is a prompt-level mitigation; on very weak local models it reduces but cannot fully eliminate derailment. Pairing with a stronger model for structured-output analytical queries is recommended (out of scope here).
Missing read-only tools (aliases/deployment, product/store rankings, sales/forecast summaries) — recommend a future additive read-only tool PR so the agent can answer those directly instead of stating a limitation.

Closes #347

sourcery-ai

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-06-01T02:10:46Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec626987-e867-4b0b-a963-3ae5dc62285d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/agents-read-only-query-guard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

w7-mgfcode added 2 commits June 1, 2026 04:09

fix(agents): constrain read-only experiment queries (#347)

ba6da82

fix(agents): validate scenario proposal entity ids (#347)

0e05ec2

sourcery-ai Bot reviewed Jun 1, 2026

View reviewed changes

w7-mgfcode merged commit 79c33c0 into dev Jun 1, 2026
8 checks passed

w7-mgfcode deleted the fix/agents-read-only-query-guard branch June 12, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): constrain experiment read-only queries#348

fix(agents): constrain experiment read-only queries#348
w7-mgfcode merged 2 commits into
devfrom
fix/agents-read-only-query-guard

w7-mgfcode commented Jun 1, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

coderabbitai Bot commented Jun 1, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

w7-mgfcode commented Jun 1, 2026

Root cause

Fix summary

Read-only tool coverage findings

Tests added

Validation results

Notes

Known limitations / follow-up

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 1, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant