fix(agents): constrain experiment read-only queries#348
Conversation
There was a problem hiding this comment.
Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Root cause
A manual Chat UI test against the Experiment Agent ("List the most recent model runs and tell me which has the lowest WAPE.") derailed into an unrelated what-if proposal ("Proposed what-if for store 123 / product 456 … Cut price 15% …").
Trace (session
0bba65bc…,ollama:llama3.1:8bprimary /ollama:qwen3:8bfallback):tool_list_runs({"status":"success"})and got the run data including WAPE.ExperimentReport.summary(a required field).tool_propose_scenariowith hallucinatedstore_id=123/product_id=456, then summarized that proposal.So the model selected the right tool but failed the structured-output step, and on the validation retry grabbed an unrelated scenario tool. Two gaps enabled it: no guard telling the model to keep read-only intents on read tools (and to treat a validation retry as a reformat, not a new action), and
tool_propose_scenariohappily accepting non-existent entity IDs.Fix summary
READ_ONLY_INTENT_GUARDinagents/base.py, embedded inEXPERIMENT_SYSTEM_PROMPT). It instructs the model that for read-only intents — listing/ranking stores or products, top products, sales/revenue/units summaries, forecast summaries, model-run & metric comparisons (WAPE/MAE/RMSE), registry aliases & deployment status, backtest metrics, RAG/document questions — it must:tool_list_runs,tool_get_run,tool_compare_runs,tool_compare_backtest_results);tool_propose_scenario,tool_save_scenario,tool_create_alias,tool_archive_run, ortool_run_backtestunless the user explicitly asks to create/save/promote/archive/run/experiment;ExperimentReportwith a concisesummary;store_id/product_id/run_id.propose_scenarionow probes thestore/productdimension tables and returns a non-persistable validation error ({"valid": false, "persistable": false, "missing": [...]}) for a non-existent grain — the hallucinated123/456case can never return a normal, save-able proposal. Read-only; persists nothing.No new tool surfaces were added, no mutation surfaces widened, the HITL approval gates and the
#344pending_actionsalvage are untouched, and no public API contracts changed.Read-only tool coverage findings
Already available to the experiment agent (read-only):
tool_list_runs,tool_get_run,tool_compare_runs(incl. WAPE/MAE/RMSE via metrics),tool_compare_backtest_results. The original WAPE query was fully answerable with these — the failure was structured-output formatting, not missing tools.Missing read-only coverage (guard handles by stating the limitation / asking clarification — not adding tools, per scope):
These are candidates for a future additive read-only tool PR (see follow-up).
Tests added
app/features/agents/tests/test_read_only_guard.py— deterministic, no live model calls:FunctionModel);summary: Field requiredtrigger);app/features/scenarios/tests/test_agent_tools.py—propose_scenarioentity validation: existing grain succeeds (unit, mocked DB), non-existent rejected,123/456covered explicitly, missing-product-only rejected, and an integration test asserting a non-existent grain is rejected and persists noscenario_planrow. Existing integration propose tests now seed the grain via a newexisting_grainfixture.Validation results
ruff check .→ All checks passedruff format --check .→ 334 files already formattedmypy app/→ only the pre-existingxgboost/lightgbmoptional-extra import errors in untouched files (forecasting/models.py,forecasting/persistence.py,registry/service.py); none in changed filespyright app/→ same pre-existing optional-extra import errors only; 0 errors in changed filespytest -m "not integration"→ 1690 passed, 12 skippedpytest -m "not integration" app/features/agents app/features/scenarios→ 218 passedNotes
docker compose upPostgres + seeded dimensions); they are structured to pass in CI.Known limitations / follow-up
Closes #347