fix(agents): salvage experiment answer when weak model fails structured output

## Summary

Follow-up to #347/#348 and #349/#350. The Experiment Agent still fails read-only queries on local Ollama models, but the **true root cause** (captured via a full `capture_run_messages` trace) is not a tool-loop the model *chooses* — it is **structured-output (`PromptedOutput(ExperimentReport)`) incompatibility with weak 8B models**, which no prompt change can fix.

## Trace evidence (live model `ollama:llama3.1:8b`)

```
msg[1] TOOLCALL tool_list_runs {"status":"success"}   ← real tool call
msg[2] TOOLRETURN → 10 runs incl. WAPE                ← data obtained
msg[3] TEXT: {"runs":[...],"total":10,"page":1}       ← model returns RAW tool data as its "answer"
msg[4] RETRY: summary: Field required                 ← not an ExperimentReport
msg[5] TOOLCALL tool_list_runs {} ... loops 3x → "Exceeded maximum output retries (3)" → UnexpectedModelBehavior
```

- The model calls the tool and gets the data, but emits the **raw tool-result shape** (`{"runs":[...]}`) as its final output instead of `ExperimentReport{summary}`. PromptedOutput rejects it (missing `summary`), and it loops to retry-exhaustion.
- `qwen3:8b` is worse: it emits the tool call itself **as text** (`{"tool":"tool_list_runs","arguments":{...}}`) and never makes a real call.
- Both prompt-based mitigations (#348 guard, #349 one-pass rule) cannot fix this — the model isn't choosing to loop; it can't produce the required structured output while juggling tools.

## Fix

Service-layer **graceful finalizer fallback**: when an agent run raises `UnexpectedModelBehavior` ("Exceeded maximum output retries") and there is no pending HITL action to salvage, extract the tool results already obtained during the run (via `capture_run_messages`) and make ONE **tool-less, plain-text** (`output_type=str`) follow-up call to the same model to answer the user's question from that data. Plain text is what weak models *can* produce; this converts the structured-output failure into a correct answer, on Ollama, without a cloud model.

- Keeps `PromptedOutput(ExperimentReport)` as the happy path (cloud models unaffected).
- No tools on the finalizer → cannot loop; `str` output → cannot fail schema validation.
- HITL approval salvage (#344) takes precedence and is untouched.

## Acceptance

- Finalizer fallback wired into both `chat()` and `stream_chat()` misbehavior handlers.
- Deterministic unit tests for tool-payload extraction + handler behavior (returns salvaged answer when data exists; generic error when none). No live model calls in tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): salvage experiment answer when weak model fails structured output #351

Summary

Trace evidence (live model `ollama:llama3.1:8b`)

Fix

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix(agents): salvage experiment answer when weak model fails structured output #351

Description

Summary

Trace evidence (live model ollama:llama3.1:8b)

Fix

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Trace evidence (live model `ollama:llama3.1:8b`)