Skip to content

fix(agents): salvage experiment answer when weak model fails structured output #351

Description

@w7-mgfcode

Summary

Follow-up to #347/#348 and #349/#350. The Experiment Agent still fails read-only queries on local Ollama models, but the true root cause (captured via a full capture_run_messages trace) is not a tool-loop the model chooses — it is structured-output (PromptedOutput(ExperimentReport)) incompatibility with weak 8B models, which no prompt change can fix.

Trace evidence (live model ollama:llama3.1:8b)

msg[1] TOOLCALL tool_list_runs {"status":"success"}   ← real tool call
msg[2] TOOLRETURN → 10 runs incl. WAPE                ← data obtained
msg[3] TEXT: {"runs":[...],"total":10,"page":1}       ← model returns RAW tool data as its "answer"
msg[4] RETRY: summary: Field required                 ← not an ExperimentReport
msg[5] TOOLCALL tool_list_runs {} ... loops 3x → "Exceeded maximum output retries (3)" → UnexpectedModelBehavior
  • The model calls the tool and gets the data, but emits the raw tool-result shape ({"runs":[...]}) as its final output instead of ExperimentReport{summary}. PromptedOutput rejects it (missing summary), and it loops to retry-exhaustion.
  • qwen3:8b is worse: it emits the tool call itself as text ({"tool":"tool_list_runs","arguments":{...}}) and never makes a real call.
  • Both prompt-based mitigations (fix(agents): constrain experiment read-only queries #348 guard, fix(agents): stop experiment read-only tool-call loop on weak models #349 one-pass rule) cannot fix this — the model isn't choosing to loop; it can't produce the required structured output while juggling tools.

Fix

Service-layer graceful finalizer fallback: when an agent run raises UnexpectedModelBehavior ("Exceeded maximum output retries") and there is no pending HITL action to salvage, extract the tool results already obtained during the run (via capture_run_messages) and make ONE tool-less, plain-text (output_type=str) follow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce; this converts the structured-output failure into a correct answer, on Ollama, without a cloud model.

Acceptance

  • Finalizer fallback wired into both chat() and stream_chat() misbehavior handlers.
  • Deterministic unit tests for tool-payload extraction + handler behavior (returns salvaged answer when data exists; generic error when none). No live model calls in tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions