fix(producer): stop retrying capture attempts that made zero progress#1725
Conversation
A structurally broken composition (never-ready page, zero duration, or unparseable HTML) captures no frames, so the adaptive retry loop kept re-running it at halved parallelism — 16->8->4->2->1 workers — each attempt burning a full readiness/protocol timeout per worker. That multiplied wall-clock to ~46min on broken renders and was the driver of the render P95 blowup (~370k -> 2.79M ms) seen Jun 20-22. Add captureAttemptMadeProgress(): when an attempt leaves at least as many frames missing as it set out to capture, it made no forward progress, so the composition is broken rather than the workers being flaky. Bail immediately instead of retrying. A partially-captured attempt still retries, so genuine flaky-worker gaps are unaffected.
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
LGTM — well-scoped 48-line guard, clean unit tests, sound predicate. Race-fix classification: PREEMPT for the broken-composition class (never-ready window.__hf, zero duration, unparseable HTML) — remainingCount >= frameCount is a structural signal, not a probability tweak. A retry only ever covers the missing ranges, so once an attempt fails to capture any of its own target, the parallelism-halving loop has no path to a different outcome; the bail is total, not narrower.
Reviewed at 2a2a2c8.
Findings
💚 Predicate semantics are tight. captureAttemptMadeProgress(target, remaining) := remaining < target correctly handles both the clean-but-incomplete branch (line 698) and the catch branch (line 726). On attempt 0 the target is options.totalFrames; on retry N≥1 the target is countFrameRanges(missingRanges) from the prior attempt — both views are disk-consistent because findMissingFrameRanges always reads the framesDir. Crucially, in the catch branch a partial-progress throw still retries (one frame landed before the timeout ⇒ remainingCount < frameCount), so genuine flaky-worker mid-attempt failures are unaffected. That preserves the existing recovery shape.
💚 The >= target defensive clamp is the right shape. remainingCount > frameCount shouldn t happen given how findMissingFrameRanges is computed against options.totalFrames, but treating it as "no progress" instead of crashing is the safe default. Test at :110 pins it.
renderOrchestrator.test.ts adds 4 assertions on the helper itself but no integration test mocks executeParallelCapture to drive the retry loop through the new bail. Placement matters more than the predicate here — if a future refactor moves the findMissingFrameRanges call below the gate, or recomputes frameCount post-attempt, the helper tests still pass. Not blocking on a 48-line fix, but worth a // TODO(test): or a follow-up that exercises both branches end-to-end. Mirror the shape from the EF#40577 streaming-retry telemetry tests if you take it.
🟡 No telemetry on the new bail. EF#40577 wired stats.incr("hyperframes.render_streaming.retry", tags={attempt, reason}) for the Temporal-activity layer. This is a different layer (in-process producer, not workflow-side retry), but the same diagnostic question applies: how will an oncall know why a render bailed fast vs. exhausted retries? Right now both paths throw the same "[Render] Capture completed but N frame(s) are missing" message. A log.warn("[Render] Bailing — capture attempt made zero progress", { attempt, frameCount, workers: currentWorkers }) before the throw at :700 and :728 would let you cohort "broken composition" vs. "flaky worker exhaustion" in Datadog without parsing call-stack context. Cheap, high-leverage for the observability story the Slack thread says is the motivation.
🟡 Sibling check — executeStreamingCapture? Grep confirms executeDiskCaptureWithAdaptiveRetry has exactly one in-repo caller (captureStage.ts:194) and there s no parallel adaptive-retry helper for the streaming path in this repo. So the gate is correctly scoped. If the streaming path ever grows a similar adaptive-worker-halving retry, this predicate is a candidate for promotion to a shared module — but no action required today. Mentioning it so the answer is on the record.
🟡 Docstring typo. :559 — "ponytail: total zero-progress == broken" reads like a dictation/copilot residue (probably "TL;DR:"). Doesn t change behavior; one-line fix.
💭 allowRetry: job.config.workers === undefined at captureStage.ts:200 means user-pinned --workers N already disables the entire adaptive retry, so users who want to soak a broken render at fixed parallelism (debugging?) keep the old behavior. Just noting — that gate is fine.
Cross-PR coherence with EF#40577
These compose cleanly. EF#40577 is the outer retry policy (Temporal workflow-activity layer, classifier-driven, 3 attempts with 5s initial backoff). HF#1725 is the inner in-process retry (producer-side adaptive parallelism). Both are now move-to-non-retryable on structural failure, just at different scopes. The diagnostic shapes don t conflict; the only thing missing on this side is the matching telemetry counter (hyperframes.render.adaptive_retry.bail or similar) so an oncall can correlate "broken render bailed fast in producer" with "workflow gave up after 3 activity attempts" in the same query.
Nothing here blocks merge once regression-shards green.
— Rames D Jusso
vanceingalls
left a comment
There was a problem hiding this comment.
Caps the adaptive capture retry loop on a structural-failure signal: an attempt that captured zero frames against its target means the composition is broken, not the workers flaky, so further retries at lower parallelism only burn more playerReadyTimeout / protocolTimeout budget. The heuristic — remainingFrameCount < attemptTargetFrameCount — is wired into both retry branches in executeDiskCaptureWithAdaptiveRetry: the clean-but-incomplete capture path (renderOrchestrator.ts:695-700) and the recoverable-error catch path (renderOrchestrator.ts:722-728). Pure-function unit tests cover retry-when-some-frames-captured, bail-when-zero-progress, and a defensive remainingCount >= target guard.
Verdict: LGTM at 2a2a2c84df396b5a5496c3c51358a73dc2b0f855 — converging with Rames's read at the same SHA.
Verified the per-attempt frameCount semantics line up with the heuristic. At renderOrchestrator.ts:626 frameCount is reassigned each loop iteration: missingRanges ? countFrameRanges(missingRanges) : options.totalFrames. So on attempt 0 it's the global total; on retry attempts it's the count of frames the prior pass left missing (missingRanges = remaining at lines 710 / 739). The variable name in the helper (attemptTargetFrameCount) matches that contract — this is "did this specific attempt close at least one frame from the set it set out to capture," not a global-total comparison. Walk-through:
- Attempt 0: frameCount=100, captures 0 → remainingCount=100 →
100 < 100false → bail. ✓ (the broken-comp case the PR targets) - Attempt 0: frameCount=100, captures 50 → remainingCount=50 →
50 < 100true → retry. - Attempt 1: frameCount=50 (missing from prior), captures 0 more → remainingCount=50 →
50 < 50false → bail. ✓ (caught even at depth) - Attempt 1: frameCount=50, captures 10 more → remainingCount=40 →
40 < 50true → retry. ✓ (progress preserved)
Rames's "PREEMPT, not probability-tweak" framing nails the structural argument: since a retry only ever re-runs the missing ranges, once an attempt fails to capture any of its own target the worker-halving chain has zero remaining path to a different outcome. The bail is total at that point, not "we're giving up earlier than we could have."
Verified the remainingCount > frameCount defensive case. findMissingFrameRanges(options.totalFrames, framesDir, ...) (lines 686-690 / 713-717) scans for missing frames across the entire composition, not just the attempt's targeted subset. So remainingCount is bounded above by frameCount invariantly under normal operation — a retry only re-runs frames that were already missing, and previously-captured frames stay on disk. The remainingCount > frameCount branch (test captureAttemptMadeProgress(100, 120) === false) only triggers on a worker that corrupts/deletes a prior frame mid-capture; the heuristic correctly treats that as "no forward progress." Belt-and-suspenders, no behavior change in the normal path.
Verified the false-negative cost stated in the PR body. "Worst case is one render that would've recovered now fails" holds: a 16-worker attempt that captured zero frames purely due to a transient infra flake (e.g. start-of-render network glitch before any frame writes) will now bail instead of retrying at 8 workers. Narrow scenario — the throughput win (~46-min broken-render savings) versus the rare false-negative is the right trade. Rames's allowRetry: job.config.workers === undefined check at captureStage.ts:200 is also load-bearing here: a user pinning --workers N for debugging keeps the old behavior entirely, so the heuristic only affects the auto-adaptive default path.
Findings (numbered, with severity tag):
-
NIT (copy-edit) — concur with Rames #4 —
renderOrchestrator.ts:559docstring containsponytail: total zero-progress == broken;. The "ponytail:" prefix is almost certainly a Cursor/Copilot completion artifact (Rames's read of "probablyTL;DR:" is the sharpest — that's what was in the dictation buffer).grep -n "ponytail" packages/producerreturns only this line at HEAD; no codebase convention. One-line fix, but a future reader will trip on it. -
NIT (observability) — concur with Rames #3, identical proposed fix — When the new heuristic fires, the thrown error message is unchanged:
[Render] Capture completed but ${remainingCount} frame(s) are missing(line 700) or the originalerroris re-thrown (line 728). A render that bailed on attempt 0 with!madeProgressis indistinguishable in logs from one that walked all the way down tocurrentWorkers=1and exhausted retries — exactly the cohort distinction an oncall needs to confirm the heuristic is doing its job. Aoptions.log.warn("[Render] Bailing early — capture made no forward progress (composition likely structurally broken)", { attempt, frameCount, remainingCount, currentWorkers })immediately before eachthrowwould land that signal. Rames also flagged the sibling counter idea (stats.incr("hyperframes.render.adaptive_retry.bail", tags={reason})) mirroring the EF#40577 telemetry; that would close the full diagnostic loop across both retry layers (in-process producer + outer Temporal workflow). Worth bundling. -
NIT (test coverage) — concur with Rames #2 — Helper-only unit tests are clean, but
renderOrchestrator.test.tsdoesn't driveexecuteDiskCaptureWithAdaptiveRetryend-to-end through the new bail. A future refactor that moves thefindMissingFrameRangescall relative to the gate, or recomputesframeCountpost-attempt, would keep the helper tests green while breaking the integration. Not blocking on a 48-line fix, but worth a// TODO(test):or a follow-up that mocksexecuteParallelCapturewith a zero-capture return and assertsattempts.length === 1. Rames's pointer to mirror the shape of EF#40577's streaming-retry telemetry tests is the right reference. -
Style (optional) —
attemptTargetFrameCountis a precise name; consider mirroring it at the call site by renaming the localframeCount(line 626) toattemptTargetFrameCounttoo. There's currently a one-step naming hop between caller and helper that costs readers a beat. Trivial; entirely optional.
Cross-PR coherence with EF#40577 (concur with Rames): these two retry-discipline PRs compose cleanly. EF#40577 (yesterday, also Miguel-author) is the outer retry — Temporal workflow-activity layer, classifier-driven _terminal_video_render_failure wrap, maximum_attempts=3 policy with deterministic-error escape now closed. HF#1725 is the inner in-process producer retry, adaptive-parallelism halving with the new no-progress preempt. Both PRs move "structurally failed" toward non-retryable, just at different scopes. The only thing missing on this side to close the diagnostic loop is the matching telemetry counter Rames proposes — so an oncall query like "broken-comp renders that bailed fast in producer" can be joined against "workflow gave up after 3 activity attempts" in the same dashboard.
Side observation (not a finding): the broken-render scenario this fixes is the same structural pattern as a few prior incidents in the producer composition-readiness space (the GSAP loader-validator drift family on HF #1448, AI Studio's STUDIO-5215 "Failed to load GSAP animation," etc.). This PR fixes the retry-layer half of the problem (don't waste budget on something that can't succeed); the readiness-detection half (faster fail on never-ready pages) is orthogonal and would compose well with this. Out of scope for this PR; flagging only because "zero captures across an attempt" and "window.__hf never reached ready state" are tightly coupled signals.
CI state at 2a2a2c84df396b5a5496c3c51358a73dc2b0f855: Required gates all green at read time — Build, Lint, Typecheck, Test, Fallow audit, Format, Semantic PR title, CodeQL JS/TS + Python + actions, Render on windows-latest, Tests on windows-latest, player-perf, preview-regression, SDK unit/contract/smoke, Test: runtime contract, Studio: load smoke, CLI smoke (required), Smoke: global install, File size check. regression-shards 1-8 IN_PROGRESS at read time but the change is a narrow conditional gate addition — should land clean. Test: skills and CLI: npx shim SKIPPED (path-filtered — no skills/ or CLI shim changes). No required failures.
Prior reviewer state: Rames COMMENTED LGTM on 2a2a2c8 at 20:56 UTC with five non-blocking findings (semantics walkthrough, defensive-clamp, helper-only coverage, no telemetry, docstring typo) + a thought-bubble on allowRetry gating and a cross-PR coherence note vs. EF#40577. Convergent with my read at the same SHA — no daylight on verdict; differences are angle (his PREEMPT framing and EF#40577 cross-PR observability framing are sharper than my per-attempt walkthrough framing).
Review by Via
…ion test Address review feedback on the no-progress capture guard: - Warn before bailing so an oncall can tell a structurally-broken render that bailed fast apart from one that exhausted worker-halving retries (both previously threw the same "frame(s) are missing" message). - Add an integration test that drives executeDiskCaptureWithAdaptiveRetry through the bail (capture functions mocked to write nothing) and asserts a single attempt runs — the gate would otherwise walk 4->2->1 workers. Guards the placement of the gate, not just the predicate. - Reword the helper docstring (drop stray prefix and internal incident detail).
|
Thanks both — addressed the review points in
Left out, with reasoning:
|
What
Caps the adaptive capture retry loop so a structurally broken composition fails fast instead of being re-run at ever-lower parallelism.
Why
A composition that never reaches
window.__hfreadiness — a never-ready page, zero duration, or unparseable HTML — captures zero frames.executeDiskCaptureWithAdaptiveRetryonly checks "are frames still missing?" before retrying, so it walks the same broken render down16 → 8 → 4 → 2 → 1workers. Each attempt burns a fullplayerReadyTimeout(45s) orprotocolTimeout(5 min) per worker, turning a render that can never succeed into a multi-minute hang before it finally fails.Change
captureAttemptMadeProgress(attemptTargetFrameCount, remainingFrameCount): an attempt that leaves at least as many frames missing as it set out to capture made no forward progress → the composition is broken, not the workers flaky → bail immediately. Wired into both retry branches (clean-but-incomplete capture, and recoverable-error capture).A partially-captured attempt (some frames landed) still retries, so genuine flaky-worker gaps are unaffected. Worst case for the heuristic is one render that would've recovered now fails and is re-run — versus today's repeated full-timeout burn on every broken render.
Test
captureAttemptMadeProgressunit tests added (retry-when-progress, bail-when-zero-progress incl. the>= targetguard). FullrenderOrchestrator.test.tssuite passes.