Skip to content

fix(producer): stop retrying capture attempts that made zero progress#1725

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/producer-retry-zero-progress
Jun 25, 2026
Merged

fix(producer): stop retrying capture attempts that made zero progress#1725
miguel-heygen merged 2 commits into
mainfrom
fix/producer-retry-zero-progress

Conversation

@miguel-heygen

@miguel-heygen miguel-heygen commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What

Caps the adaptive capture retry loop so a structurally broken composition fails fast instead of being re-run at ever-lower parallelism.

Why

A composition that never reaches window.__hf readiness — a never-ready page, zero duration, or unparseable HTML — captures zero frames. executeDiskCaptureWithAdaptiveRetry only checks "are frames still missing?" before retrying, so it walks the same broken render down 16 → 8 → 4 → 2 → 1 workers. Each attempt burns a full playerReadyTimeout (45s) or protocolTimeout (5 min) per worker, turning a render that can never succeed into a multi-minute hang before it finally fails.

Change

captureAttemptMadeProgress(attemptTargetFrameCount, remainingFrameCount): an attempt that leaves at least as many frames missing as it set out to capture made no forward progress → the composition is broken, not the workers flaky → bail immediately. Wired into both retry branches (clean-but-incomplete capture, and recoverable-error capture).

A partially-captured attempt (some frames landed) still retries, so genuine flaky-worker gaps are unaffected. Worst case for the heuristic is one render that would've recovered now fails and is re-run — versus today's repeated full-timeout burn on every broken render.

Test

captureAttemptMadeProgress unit tests added (retry-when-progress, bail-when-zero-progress incl. the >= target guard). Full renderOrchestrator.test.ts suite passes.

A structurally broken composition (never-ready page, zero duration, or
unparseable HTML) captures no frames, so the adaptive retry loop kept
re-running it at halved parallelism — 16->8->4->2->1 workers — each attempt
burning a full readiness/protocol timeout per worker. That multiplied
wall-clock to ~46min on broken renders and was the driver of the render
P95 blowup (~370k -> 2.79M ms) seen Jun 20-22.

Add captureAttemptMadeProgress(): when an attempt leaves at least as many
frames missing as it set out to capture, it made no forward progress, so the
composition is broken rather than the workers being flaky. Bail immediately
instead of retrying. A partially-captured attempt still retries, so genuine
flaky-worker gaps are unaffected.

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — well-scoped 48-line guard, clean unit tests, sound predicate. Race-fix classification: PREEMPT for the broken-composition class (never-ready window.__hf, zero duration, unparseable HTML) — remainingCount >= frameCount is a structural signal, not a probability tweak. A retry only ever covers the missing ranges, so once an attempt fails to capture any of its own target, the parallelism-halving loop has no path to a different outcome; the bail is total, not narrower.

Reviewed at 2a2a2c8.

Findings

💚 Predicate semantics are tight. captureAttemptMadeProgress(target, remaining) := remaining < target correctly handles both the clean-but-incomplete branch (line 698) and the catch branch (line 726). On attempt 0 the target is options.totalFrames; on retry N≥1 the target is countFrameRanges(missingRanges) from the prior attempt — both views are disk-consistent because findMissingFrameRanges always reads the framesDir. Crucially, in the catch branch a partial-progress throw still retries (one frame landed before the timeout ⇒ remainingCount < frameCount), so genuine flaky-worker mid-attempt failures are unaffected. That preserves the existing recovery shape.

💚 The >= target defensive clamp is the right shape. remainingCount > frameCount shouldn t happen given how findMissingFrameRanges is computed against options.totalFrames, but treating it as "no progress" instead of crashing is the safe default. Test at :110 pins it.

⚠️ Helper-only coverage; the loop integration is unverified. renderOrchestrator.test.ts adds 4 assertions on the helper itself but no integration test mocks executeParallelCapture to drive the retry loop through the new bail. Placement matters more than the predicate here — if a future refactor moves the findMissingFrameRanges call below the gate, or recomputes frameCount post-attempt, the helper tests still pass. Not blocking on a 48-line fix, but worth a // TODO(test): or a follow-up that exercises both branches end-to-end. Mirror the shape from the EF#40577 streaming-retry telemetry tests if you take it.

🟡 No telemetry on the new bail. EF#40577 wired stats.incr("hyperframes.render_streaming.retry", tags={attempt, reason}) for the Temporal-activity layer. This is a different layer (in-process producer, not workflow-side retry), but the same diagnostic question applies: how will an oncall know why a render bailed fast vs. exhausted retries? Right now both paths throw the same "[Render] Capture completed but N frame(s) are missing" message. A log.warn("[Render] Bailing — capture attempt made zero progress", { attempt, frameCount, workers: currentWorkers }) before the throw at :700 and :728 would let you cohort "broken composition" vs. "flaky worker exhaustion" in Datadog without parsing call-stack context. Cheap, high-leverage for the observability story the Slack thread says is the motivation.

🟡 Sibling check — executeStreamingCapture? Grep confirms executeDiskCaptureWithAdaptiveRetry has exactly one in-repo caller (captureStage.ts:194) and there s no parallel adaptive-retry helper for the streaming path in this repo. So the gate is correctly scoped. If the streaming path ever grows a similar adaptive-worker-halving retry, this predicate is a candidate for promotion to a shared module — but no action required today. Mentioning it so the answer is on the record.

🟡 Docstring typo. :559"ponytail: total zero-progress == broken" reads like a dictation/copilot residue (probably "TL;DR:"). Doesn t change behavior; one-line fix.

💭 allowRetry: job.config.workers === undefined at captureStage.ts:200 means user-pinned --workers N already disables the entire adaptive retry, so users who want to soak a broken render at fixed parallelism (debugging?) keep the old behavior. Just noting — that gate is fine.

Cross-PR coherence with EF#40577

These compose cleanly. EF#40577 is the outer retry policy (Temporal workflow-activity layer, classifier-driven, 3 attempts with 5s initial backoff). HF#1725 is the inner in-process retry (producer-side adaptive parallelism). Both are now move-to-non-retryable on structural failure, just at different scopes. The diagnostic shapes don t conflict; the only thing missing on this side is the matching telemetry counter (hyperframes.render.adaptive_retry.bail or similar) so an oncall can correlate "broken render bailed fast in producer" with "workflow gave up after 3 activity attempts" in the same query.

Nothing here blocks merge once regression-shards green.

— Rames D Jusso

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caps the adaptive capture retry loop on a structural-failure signal: an attempt that captured zero frames against its target means the composition is broken, not the workers flaky, so further retries at lower parallelism only burn more playerReadyTimeout / protocolTimeout budget. The heuristic — remainingFrameCount < attemptTargetFrameCount — is wired into both retry branches in executeDiskCaptureWithAdaptiveRetry: the clean-but-incomplete capture path (renderOrchestrator.ts:695-700) and the recoverable-error catch path (renderOrchestrator.ts:722-728). Pure-function unit tests cover retry-when-some-frames-captured, bail-when-zero-progress, and a defensive remainingCount >= target guard.

Verdict: LGTM at 2a2a2c84df396b5a5496c3c51358a73dc2b0f855 — converging with Rames's read at the same SHA.

Verified the per-attempt frameCount semantics line up with the heuristic. At renderOrchestrator.ts:626 frameCount is reassigned each loop iteration: missingRanges ? countFrameRanges(missingRanges) : options.totalFrames. So on attempt 0 it's the global total; on retry attempts it's the count of frames the prior pass left missing (missingRanges = remaining at lines 710 / 739). The variable name in the helper (attemptTargetFrameCount) matches that contract — this is "did this specific attempt close at least one frame from the set it set out to capture," not a global-total comparison. Walk-through:

  • Attempt 0: frameCount=100, captures 0 → remainingCount=100 → 100 < 100 false → bail. ✓ (the broken-comp case the PR targets)
  • Attempt 0: frameCount=100, captures 50 → remainingCount=50 → 50 < 100 true → retry.
  • Attempt 1: frameCount=50 (missing from prior), captures 0 more → remainingCount=50 → 50 < 50 false → bail. ✓ (caught even at depth)
  • Attempt 1: frameCount=50, captures 10 more → remainingCount=40 → 40 < 50 true → retry. ✓ (progress preserved)

Rames's "PREEMPT, not probability-tweak" framing nails the structural argument: since a retry only ever re-runs the missing ranges, once an attempt fails to capture any of its own target the worker-halving chain has zero remaining path to a different outcome. The bail is total at that point, not "we're giving up earlier than we could have."

Verified the remainingCount > frameCount defensive case. findMissingFrameRanges(options.totalFrames, framesDir, ...) (lines 686-690 / 713-717) scans for missing frames across the entire composition, not just the attempt's targeted subset. So remainingCount is bounded above by frameCount invariantly under normal operation — a retry only re-runs frames that were already missing, and previously-captured frames stay on disk. The remainingCount > frameCount branch (test captureAttemptMadeProgress(100, 120) === false) only triggers on a worker that corrupts/deletes a prior frame mid-capture; the heuristic correctly treats that as "no forward progress." Belt-and-suspenders, no behavior change in the normal path.

Verified the false-negative cost stated in the PR body. "Worst case is one render that would've recovered now fails" holds: a 16-worker attempt that captured zero frames purely due to a transient infra flake (e.g. start-of-render network glitch before any frame writes) will now bail instead of retrying at 8 workers. Narrow scenario — the throughput win (~46-min broken-render savings) versus the rare false-negative is the right trade. Rames's allowRetry: job.config.workers === undefined check at captureStage.ts:200 is also load-bearing here: a user pinning --workers N for debugging keeps the old behavior entirely, so the heuristic only affects the auto-adaptive default path.

Findings (numbered, with severity tag):

  1. NIT (copy-edit) — concur with Rames #4renderOrchestrator.ts:559 docstring contains ponytail: total zero-progress == broken;. The "ponytail:" prefix is almost certainly a Cursor/Copilot completion artifact (Rames's read of "probably TL;DR:" is the sharpest — that's what was in the dictation buffer). grep -n "ponytail" packages/producer returns only this line at HEAD; no codebase convention. One-line fix, but a future reader will trip on it.

  2. NIT (observability) — concur with Rames #3, identical proposed fix — When the new heuristic fires, the thrown error message is unchanged: [Render] Capture completed but ${remainingCount} frame(s) are missing (line 700) or the original error is re-thrown (line 728). A render that bailed on attempt 0 with !madeProgress is indistinguishable in logs from one that walked all the way down to currentWorkers=1 and exhausted retries — exactly the cohort distinction an oncall needs to confirm the heuristic is doing its job. A options.log.warn("[Render] Bailing early — capture made no forward progress (composition likely structurally broken)", { attempt, frameCount, remainingCount, currentWorkers }) immediately before each throw would land that signal. Rames also flagged the sibling counter idea (stats.incr("hyperframes.render.adaptive_retry.bail", tags={reason})) mirroring the EF#40577 telemetry; that would close the full diagnostic loop across both retry layers (in-process producer + outer Temporal workflow). Worth bundling.

  3. NIT (test coverage) — concur with Rames #2 — Helper-only unit tests are clean, but renderOrchestrator.test.ts doesn't drive executeDiskCaptureWithAdaptiveRetry end-to-end through the new bail. A future refactor that moves the findMissingFrameRanges call relative to the gate, or recomputes frameCount post-attempt, would keep the helper tests green while breaking the integration. Not blocking on a 48-line fix, but worth a // TODO(test): or a follow-up that mocks executeParallelCapture with a zero-capture return and asserts attempts.length === 1. Rames's pointer to mirror the shape of EF#40577's streaming-retry telemetry tests is the right reference.

  4. Style (optional)attemptTargetFrameCount is a precise name; consider mirroring it at the call site by renaming the local frameCount (line 626) to attemptTargetFrameCount too. There's currently a one-step naming hop between caller and helper that costs readers a beat. Trivial; entirely optional.

Cross-PR coherence with EF#40577 (concur with Rames): these two retry-discipline PRs compose cleanly. EF#40577 (yesterday, also Miguel-author) is the outer retry — Temporal workflow-activity layer, classifier-driven _terminal_video_render_failure wrap, maximum_attempts=3 policy with deterministic-error escape now closed. HF#1725 is the inner in-process producer retry, adaptive-parallelism halving with the new no-progress preempt. Both PRs move "structurally failed" toward non-retryable, just at different scopes. The only thing missing on this side to close the diagnostic loop is the matching telemetry counter Rames proposes — so an oncall query like "broken-comp renders that bailed fast in producer" can be joined against "workflow gave up after 3 activity attempts" in the same dashboard.

Side observation (not a finding): the broken-render scenario this fixes is the same structural pattern as a few prior incidents in the producer composition-readiness space (the GSAP loader-validator drift family on HF #1448, AI Studio's STUDIO-5215 "Failed to load GSAP animation," etc.). This PR fixes the retry-layer half of the problem (don't waste budget on something that can't succeed); the readiness-detection half (faster fail on never-ready pages) is orthogonal and would compose well with this. Out of scope for this PR; flagging only because "zero captures across an attempt" and "window.__hf never reached ready state" are tightly coupled signals.

CI state at 2a2a2c84df396b5a5496c3c51358a73dc2b0f855: Required gates all green at read time — Build, Lint, Typecheck, Test, Fallow audit, Format, Semantic PR title, CodeQL JS/TS + Python + actions, Render on windows-latest, Tests on windows-latest, player-perf, preview-regression, SDK unit/contract/smoke, Test: runtime contract, Studio: load smoke, CLI smoke (required), Smoke: global install, File size check. regression-shards 1-8 IN_PROGRESS at read time but the change is a narrow conditional gate addition — should land clean. Test: skills and CLI: npx shim SKIPPED (path-filtered — no skills/ or CLI shim changes). No required failures.

Prior reviewer state: Rames COMMENTED LGTM on 2a2a2c8 at 20:56 UTC with five non-blocking findings (semantics walkthrough, defensive-clamp, helper-only coverage, no telemetry, docstring typo) + a thought-bubble on allowRetry gating and a cross-PR coherence note vs. EF#40577. Convergent with my read at the same SHA — no daylight on verdict; differences are angle (his PREEMPT framing and EF#40577 cross-PR observability framing are sharper than my per-attempt walkthrough framing).

Review by Via

…ion test

Address review feedback on the no-progress capture guard:

- Warn before bailing so an oncall can tell a structurally-broken render that
  bailed fast apart from one that exhausted worker-halving retries (both
  previously threw the same "frame(s) are missing" message).
- Add an integration test that drives executeDiskCaptureWithAdaptiveRetry
  through the bail (capture functions mocked to write nothing) and asserts a
  single attempt runs — the gate would otherwise walk 4->2->1 workers. Guards
  the placement of the gate, not just the predicate.
- Reword the helper docstring (drop stray prefix and internal incident detail).
@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Thanks both — addressed the review points in 80ac896a:

  • No signal on the new bail → added log.warn("[Render] Capture attempt made no forward progress; composition is likely structurally broken — not retrying.", { attempt, frameCount, remainingCount, workers }) before the throw in both retry branches, fired only when !madeProgress. An oncall can now distinguish a fast structural bail from a worker-halving retry exhaustion in the logs.

  • Helper-only coverage → added an integration test that drives executeDiskCaptureWithAdaptiveRetry end-to-end through the bail: the two capture functions are mocked to write nothing, so the loop runs for real against a temp framesDir, and it asserts executeParallelCapture is called exactly once (without the gate it would walk 4→2→1 = three calls). That guards the gate's placement relative to findMissingFrameRanges, not just the predicate.

  • Docstring → reworded; dropped the stray prefix and trimmed it to the mechanism only.

Left out, with reasoning:

  • Dedicated metrics counter — this layer logs via options.log rather than a stats client, so the log.warn is the idiomatic fit here; a cross-layer counter is a separate change and out of scope for this gate.
  • Rename local frameCountattemptTargetFrameCount — the local feeds the typed CaptureAttemptSummary.frameCount field via shorthand, so renaming it would desync the field name; left as-is.
  • Sibling streaming path — confirmed there's no parallel adaptive-worker-halving retry to mirror this into; correctly scoped to the disk path.

@miguel-heygen miguel-heygen merged commit 690cf1b into main Jun 25, 2026
45 checks passed
@miguel-heygen miguel-heygen deleted the fix/producer-retry-zero-progress branch June 25, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants