Skip to content

fix(agents): constrain experiment read-only queries#348

Merged
w7-mgfcode merged 2 commits into
devfrom
fix/agents-read-only-query-guard
Jun 1, 2026
Merged

fix(agents): constrain experiment read-only queries#348
w7-mgfcode merged 2 commits into
devfrom
fix/agents-read-only-query-guard

Conversation

@w7-mgfcode

Copy link
Copy Markdown
Owner

Root cause

A manual Chat UI test against the Experiment Agent ("List the most recent model runs and tell me which has the lowest WAPE.") derailed into an unrelated what-if proposal ("Proposed what-if for store 123 / product 456 … Cut price 15% …").

Trace (session 0bba65bc…, ollama:llama3.1:8b primary / ollama:qwen3:8b fallback):

  1. The model correctly called tool_list_runs({"status":"success"}) and got the run data including WAPE.
  2. It then emitted malformed structured output missing ExperimentReport.summary (a required field).
  3. PydanticAI issued an output-format validation retry.
  4. On the retry the weak 8B model derailed and called tool_propose_scenario with hallucinated store_id=123/product_id=456, then summarized that proposal.

So the model selected the right tool but failed the structured-output step, and on the validation retry grabbed an unrelated scenario tool. Two gaps enabled it: no guard telling the model to keep read-only intents on read tools (and to treat a validation retry as a reformat, not a new action), and tool_propose_scenario happily accepting non-existent entity IDs.

Fix summary

  • Generalized read-only intent guard (READ_ONLY_INTENT_GUARD in agents/base.py, embedded in EXPERIMENT_SYSTEM_PROMPT). It instructs the model that for read-only intents — listing/ranking stores or products, top products, sales/revenue/units summaries, forecast summaries, model-run & metric comparisons (WAPE/MAE/RMSE), registry aliases & deployment status, backtest metrics, RAG/document questions — it must:
    • use only read-only tools (tool_list_runs, tool_get_run, tool_compare_runs, tool_compare_backtest_results);
    • never call tool_propose_scenario, tool_save_scenario, tool_create_alias, tool_archive_run, or tool_run_backtest unless the user explicitly asks to create/save/promote/archive/run/experiment;
    • on an output-format validation retry, not call new tools — only reformat the prior result into a valid ExperimentReport with a concise summary;
    • ask a clarifying question for an ambiguous ranking ("Top by revenue, units sold, forecasted demand, or model error?");
    • state plainly when no read-only tool exists for a metric instead of inventing data;
    • never invent store_id/product_id/run_id.
  • Secondary validation gap fixed: propose_scenario now probes the store/product dimension tables and returns a non-persistable validation error ({"valid": false, "persistable": false, "missing": [...]}) for a non-existent grain — the hallucinated 123/456 case can never return a normal, save-able proposal. Read-only; persists nothing.

No new tool surfaces were added, no mutation surfaces widened, the HITL approval gates and the #344 pending_action salvage are untouched, and no public API contracts changed.

Read-only tool coverage findings

Already available to the experiment agent (read-only): tool_list_runs, tool_get_run, tool_compare_runs (incl. WAPE/MAE/RMSE via metrics), tool_compare_backtest_results. The original WAPE query was fully answerable with these — the failure was structured-output formatting, not missing tools.

Missing read-only coverage (guard handles by stating the limitation / asking clarification — not adding tools, per scope):

  • registry aliases / deployment status — no list-aliases tool exposed to the agent
  • product/store rankings ("top products") — no analytics/dimensions tool exposed
  • sales / revenue / units summaries — none exposed
  • forecast summaries / highest forecasted demand — none exposed

These are candidates for a future additive read-only tool PR (see follow-up).

Tests added

  • app/features/agents/tests/test_read_only_guard.py — deterministic, no live model calls:
    • guard is embedded in the experiment prompt and delivered in the system prompt to the model (captured via a stub FunctionModel);
    • guard names every prohibited tool and the allowed read tools;
    • guard forbids new tool calls on a validation retry (incl. the exact summary: Field required trigger);
    • guard requires the "top products" clarification question and prohibits invented IDs;
    • parametrized coverage over the exact WAPE prompt + "top products" + "highest forecasted demand" + "current deployment alias" + revenue/units + backtest metrics.
  • app/features/scenarios/tests/test_agent_tools.pypropose_scenario entity validation: existing grain succeeds (unit, mocked DB), non-existent rejected, 123/456 covered explicitly, missing-product-only rejected, and an integration test asserting a non-existent grain is rejected and persists no scenario_plan row. Existing integration propose tests now seed the grain via a new existing_grain fixture.

Validation results

  • ruff check . → All checks passed
  • ruff format --check . → 334 files already formatted
  • mypy app/ → only the pre-existing xgboost/lightgbm optional-extra import errors in untouched files (forecasting/models.py, forecasting/persistence.py, registry/service.py); none in changed files
  • pyright app/ → same pre-existing optional-extra import errors only; 0 errors in changed files
  • pytest -m "not integration"1690 passed, 12 skipped
  • pytest -m "not integration" app/features/agents app/features/scenarios218 passed

Notes

  • No live model/agent calls were required to reproduce or to test — the investigation used the persisted DB message history, and all new tests are deterministic.
  • Integration tests for the scenarios slice were not executed here (they need docker compose up Postgres + seeded dimensions); they are structured to pass in CI.

Known limitations / follow-up

  • The guard is a prompt-level mitigation; on very weak local models it reduces but cannot fully eliminate derailment. Pairing with a stronger model for structured-output analytical queries is recommended (out of scope here).
  • Missing read-only tools (aliases/deployment, product/store rankings, sales/forecast summaries) — recommend a future additive read-only tool PR so the agent can answer those directly instead of stating a limitation.

Closes #347

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec626987-e867-4b0b-a963-3ae5dc62285d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/agents-read-only-query-guard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant