You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #347/#348 and #349/#350. The Experiment Agent still fails read-only queries on local Ollama models, but the true root cause (captured via a full capture_run_messages trace) is not a tool-loop the model chooses — it is structured-output (PromptedOutput(ExperimentReport)) incompatibility with weak 8B models, which no prompt change can fix.
Trace evidence (live model ollama:llama3.1:8b)
msg[1] TOOLCALL tool_list_runs {"status":"success"} ← real tool call
msg[2] TOOLRETURN → 10 runs incl. WAPE ← data obtained
msg[3] TEXT: {"runs":[...],"total":10,"page":1} ← model returns RAW tool data as its "answer"
msg[4] RETRY: summary: Field required ← not an ExperimentReport
msg[5] TOOLCALL tool_list_runs {} ... loops 3x → "Exceeded maximum output retries (3)" → UnexpectedModelBehavior
The model calls the tool and gets the data, but emits the raw tool-result shape ({"runs":[...]}) as its final output instead of ExperimentReport{summary}. PromptedOutput rejects it (missing summary), and it loops to retry-exhaustion.
qwen3:8b is worse: it emits the tool call itself as text ({"tool":"tool_list_runs","arguments":{...}}) and never makes a real call.
Service-layer graceful finalizer fallback: when an agent run raises UnexpectedModelBehavior ("Exceeded maximum output retries") and there is no pending HITL action to salvage, extract the tool results already obtained during the run (via capture_run_messages) and make ONE tool-less, plain-text (output_type=str) follow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce; this converts the structured-output failure into a correct answer, on Ollama, without a cloud model.
Keeps PromptedOutput(ExperimentReport) as the happy path (cloud models unaffected).
No tools on the finalizer → cannot loop; str output → cannot fail schema validation.
Finalizer fallback wired into both chat() and stream_chat() misbehavior handlers.
Deterministic unit tests for tool-payload extraction + handler behavior (returns salvaged answer when data exists; generic error when none). No live model calls in tests.
Summary
Follow-up to #347/#348 and #349/#350. The Experiment Agent still fails read-only queries on local Ollama models, but the true root cause (captured via a full
capture_run_messagestrace) is not a tool-loop the model chooses — it is structured-output (PromptedOutput(ExperimentReport)) incompatibility with weak 8B models, which no prompt change can fix.Trace evidence (live model
ollama:llama3.1:8b){"runs":[...]}) as its final output instead ofExperimentReport{summary}. PromptedOutput rejects it (missingsummary), and it loops to retry-exhaustion.qwen3:8bis worse: it emits the tool call itself as text ({"tool":"tool_list_runs","arguments":{...}}) and never makes a real call.Fix
Service-layer graceful finalizer fallback: when an agent run raises
UnexpectedModelBehavior("Exceeded maximum output retries") and there is no pending HITL action to salvage, extract the tool results already obtained during the run (viacapture_run_messages) and make ONE tool-less, plain-text (output_type=str) follow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce; this converts the structured-output failure into a correct answer, on Ollama, without a cloud model.PromptedOutput(ExperimentReport)as the happy path (cloud models unaffected).stroutput → cannot fail schema validation.Acceptance
chat()andstream_chat()misbehavior handlers.