Skip to content

fix(agents): surface fallback model failures with actionable details #335

Description

@w7-mgfcode

Sub-issue of #380 (umbrella: fix(repo): platform reliability hardening — agents, config, ui, forecast). Parallel after Foundation (E1 #334).

Purpose

Surface agent fallback-model failures with actionable, classified details (model-not-found / quota / auth) over REST chat and the /agents/stream WebSocket error events, instead of a generic failure.

Sub-tasks

To be decomposed via issue-to-subtasks when this epic is picked up.


Summary

When every model in the FallbackModel chain fails, the frontend shows only a generic Stream error: All models from FallbackModel failed (2 sub-exceptions). The backend logs contain the actionable per-model causes — these should be surfaced (safely) to the API/WebSocket client so issues are diagnosable from the UI.

Observed

Frontend (experiment agent, streaming chat):

Error: Stream error: All models from FallbackModel failed (2 sub-exceptions)

Backend agents.websocket_stream_error log held the real causes:

  • PrimaryModelHTTPError 404: invalid model name (models/google-gla:gemini-3-flash-preview is not found).
  • FallbackModelHTTPError 429: quota exhausted (RESOURCE_EXHAUSTED, free-tier generate_content_free_tier_requests limit 20/day for gemini-2.5-flash).

Expected

The WebSocket error event (and the equivalent REST chat error) should preserve a safe summary of each fallback sub-failure, e.g. per-model { model_name, status_code, reason }:

  • 404 → "model not found / invalid model name"
  • 429 → "quota/rate limit exhausted"
  • 401/403 → "authentication/permission error"

So the UI can distinguish a model-name problem from a quota problem without a maintainer reading container logs.

Constraints

  • Never expose secrets — no API keys, bearer tokens, Authorization headers, or AIza… values in the surfaced summary or logs. Include only status code + provider message text (which is secret-free) or a mapped reason string.
  • Keep the RFC 7807 / ErrorEvent shape; classification should be additive.

Tests

  • A FallbackExceptionGroup with mixed sub-errors (404 + 429) produces an error payload listing each model's safe reason.
  • Assert no secret-like material leaks into the surfaced error.

Context

Found while investigating the experiment-agent stream failure on 2026-06-01. Companion issue covers rejecting the doubled provider prefix that caused the 404 leg.

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicEpic — a delivery surface under an umbrellafixBug fixflowflow: command-suite work

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions