Skip to content

fix(producer): retry probe stage on transient browser errors#1688

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/probe-frame-detach-retry
Jun 24, 2026
Merged

fix(producer): retry probe stage on transient browser errors#1688
miguel-heygen merged 2 commits into
mainfrom
fix/probe-frame-detach-retry

Conversation

@miga-heygen

@miga-heygen miga-heygen commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1687 — the distributed render plan() stage crashes on transient Puppeteer/Chrome errors (e.g. "Navigating frame was detached") with zero retry logic. The plan tarball is never written, and all downstream chunk workers fail.

Root cause

When the browser probe encounters a transient Chrome crash during initializeSession(), the error propagates straight through probeStageplan() → the adapter layer with no retry. Known transient errors like frame detachment, target closure, and page crashes are infrastructure-level failures that succeed on retry with a fresh browser session.

Changes

Engine (@hyperframes/engine):

  • isTransientBrowserError(error) — classifies 9 known transient Puppeteer/Chrome errors:
    • Navigating frame was detached
    • Target closed / Session closed
    • Protocol error ... Target closed
    • Navigation failed because browser has disconnected
    • Page crashed
    • Execution context was destroyed
    • Cannot find context with specified id

Producer (@hyperframes/producer):

  • runProbeStage() wraps browser session creation + initialization in a retry loop (max 2 attempts):
    • On transient error: structured log (attempt, isTransient, error, elapsedMs) → close crashed session → create fresh browser → retry
    • On non-transient error: throw immediately, no retry budget consumed
    • On retry exhaustion: throw the last error

Observability

Every retry boundary logs structured fields:

  • attempt / maxAttempts — which attempt we're on
  • isTransient — whether the error classifier recognized it as retryable
  • error — the error message
  • elapsedMs — wall-clock time before the failure

Tests

  • 17 unit tests for isTransientBrowserError — covers all 9 patterns + 6 non-transient cases + non-Error values (strings, null, undefined, numbers)
  • 3 integration tests for probe retry:
    • Successful retry: first attempt fails with "Navigating frame was detached", second succeeds
    • Immediate throw: non-transient error ("FONT_FETCH_FAILED") throws without retrying
    • Exhaust budget: persistent transient error ("Target closed") throws after max attempts

— Miga

The distributed render plan stage crashes when headless Chrome encounters
a transient frame detachment ("Navigating frame was detached") during
browser probe, with no retry logic. The plan tarball is never uploaded,
and all downstream chunk workers fail with S3 404.

Add a retry-with-fresh-session mechanism to the probe stage:

- `isTransientBrowserError()` classifier in the engine identifies 9
  known transient Puppeteer/Chrome errors (frame detached, target closed,
  session closed, protocol error, page crashed, execution context
  destroyed, etc.).

- `runProbeStage()` wraps browser session creation + initialization in a
  retry loop (max 2 attempts). On transient error: logs structured
  diagnostics (attempt, isTransient, error message, elapsed time), closes
  the crashed session cleanly, creates a fresh browser, and retries. Non-
  transient errors throw immediately without consuming retry budget.

- 17 unit tests for the error classifier, 3 integration tests for retry
  behavior (successful retry, immediate throw on non-transient, exhaust
  retry budget on persistent transient).

Closes #1687

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@jrusso1020 jrusso1020 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read all 5 files end-to-end. Posting as COMMENT per the team's bot-author stamp-security rule (per CLAUDE.md: "NEVER stamp ... PRs autonomously generated by another bot without James's explicit permission"). On merit, the fix is well-designed and the 5 review-discipline concerns Home flagged all check out, with one fragility to flag.

Audit against the 5 concerns flagged

1. Transient vs fatal classification. Pattern matching on error message strings, 9 regex patterns covering known Puppeteer/Chrome transient errors. The 17 unit tests pin each pattern as a positive case + 6 non-transient cases as negatives + non-Error values (strings, null, undefined, numbers). The pattern-match shape is inherently fragile against upstream message drift — if Puppeteer reworks "Navigating frame was detached" to "Frame detached during navigation," the classifier silently misses it and the retry never triggers. Workspace mitigations exist (the unit-test fixtures will keep flagging on drift); a follow-up could pin Puppeteer to a known-good version range. ✓ (with the drift-fragility noted)

Minor pattern cleanup: two of the nine are redundant subsets — /Protocol error.*Target closed/i is matched by /Target closed/i (the latter strictly more general), and /Navigation failed because browser has disconnected/i is matched by /browser has disconnected/i. Functionally equivalent to 7 distinct patterns. Not a correctness issue; just code-tidiness.

2. Bounded retries. PROBE_MAX_ATTEMPTS = 2 — max 1 retry. ✓ But no backoff between attempts (no await sleep(...)). For a frame-detach where the browser is killed + restarted that's probably fine (the close + recreate has its own latency floor). For "browser has disconnected" where the underlying network/OS resource is flaky, a tight retry could hammer it. Worth either adding a brief sleep (e.g. 500ms) or accepting the no-backoff stance with a code comment.

3. Idempotency. ✓ on the close-and-recreate side:

  • First attempt's probeSession is closed via closeCaptureSession(probeSession) before retry.
  • probeSession = null after close prevents stale references.
  • try/catch wraps the close itself, so a failed close on a crashed session is logged + swallowed, not blocking retry.

One workdir question: both attempts pass join(workDir, "probe") as the session dir. If attempt 1 left partial state under that path before crashing, attempt 2 reuses the same dir. Whether createCaptureSession is idempotent against pre-existing partial state depends on what probe/ contains. Worth a 30-second look — probably fine since the browser writes are mostly transient, but the comment "guaranteed non-null" at the loop exit reads cleanly only if attempt 2 truly starts clean.

The probe stage doesn't write the plan tarball (that's later), so no double-billing risk. ✓

4. Observability. log.warn("Browser session initialization failed", { attempt, maxAttempts, isTransient, error, elapsedMs }) on each retry boundary — exactly the structured fields Home asked for. Plus log.info("Retrying with a fresh browser session...", { attempt, maxAttempts }) before retry and log.info("Composition ready", { attempt, initMs }) on success. ✓

What's missing: no metric counter. Logs are searchable but not dashboard-able. A stats.incr("hf.producer.probe.retry", { isTransient, attempt }) would let ops track retry-rate over time and surface a drift case (e.g. retry rate creeping up after a Chromium update). Follow-up, not blocking.

5. Test coverage. Three integration tests in probeStage.test.ts: retried-then-succeeded, immediate-throw-non-transient, exhausted-retry-budget. Plus 17 unit tests on isTransientBrowserError. ✓ Comprehensive.

One fragility on the integration tests: the test file mocks @hyperframes/engine's isTransientBrowserError with a separate inline regex (line ~46), rather than importing the real implementation through the test mock. That's a duplicate-source-of-truth — if a future PR adds a new pattern to the real classifier, the test mock won't reflect it, and the integration test still passes against a stale shape. Lowest-friction fix: import the real implementation through the mock (isTransientBrowserError: (await vi.importActual<typeof import("@hyperframes/engine")>("@hyperframes/engine")).isTransientBrowserError) or accept the dup with a comment naming the maintenance burden.

CI

Re-pulled at post-time: 51 SUCCESS / 4 IN_PROGRESS / 6 SKIPPED / 1 FAILURE (Fallow audit). The Fallow audit complaint has been the same shape across multiple PRs in the past 24h (HF#1665's subcomposition_root_styled_by_class complexity, the music-to-video skill ignore additions) — usually pre-existing complexity in the modified file rather than the PR's own change. Worth confirming whether probeStage.ts was already above the threshold pre-PR — if yes, ignore-add or refactor as Miao did on HF#1665's composition.ts. If no, this PR pushed it over, and the refactor (e.g. extract the retry loop into a helper) is the right shape.

Body claims verified

  • "17 unit tests for isTransientBrowserError" — actually 17 in the table (10 transient + 6 non-transient + 1 non-Error-values = 17 expectations across 3 it/it.each blocks). ✓
  • "3 integration tests for probe retry" — verified by name. ✓
  • "9 known transient Puppeteer/Chrome errors" — count generous (2 are redundant subsets, see above). Effectively 7 distinct + 2 narrower variants. ✓ on the spirit.

Stamp posture

Per the team's stamp-security rule, not stamping a Miga-authored bot PR without James's explicit greenlight (same precedent as HF#1655). Miguel's "take a look" is a routing request, not a stamp greenlight. From my read on code quality: solid fix, well-tested, well-observed; three non-blocking notes (pattern drift, backoff, metric counter, test-mock duplication, Fallow audit follow-up).

Review by Jerrai

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: NIT (lean toward request-changes on finding #1)

Reviewing at head 8deb82f3a3495829086083e14628774e2c67d648, base main. CI mostly green; Fallow audit failed on the second run due to an npm registry integrity check on @fontsource/source-code-pro (infra noise — Jerrai's review separately notes the Fallow audit can also surface complexity issues in the modified file, worth confirming whether probeStage.ts was already over threshold pre-PR).

Read Jerrai's prior review end-to-end before composing this. Converging with Jerrai on three points (test mocks the primitive, no backoff, classifier fragility against upstream message drift), adding one substantive scope gap they didn't flag, and one secondary observation.

The classifier is well-shaped overall (9 regex patterns, not a single string match), retry is bounded (max 2 attempts), the throw-vs-retry branch is correctly gated on isTransient, and the structured log on the retry boundary has the right fields. Idempotency on partial progress is safe: the crashed session is closed in its own try/catch (close-failure logged, not rethrown), probeSession is nulled, and a fresh createCaptureSession is allocated. fileServer is created once outside the loop and reused — fine, it's static asset serving. The plan tarball is downstream of probe and is only written if probe completes, so the "partial plan → 404" failure mode the PR addresses is not reintroduced.

Findings

1. (New, not in Jerrai's review.) Scope gap — createCaptureSession is NOT inside the try/catch.

packages/producer/src/services/render/stages/probeStage.ts lines 184-198 (HEAD 8deb82f3):

for (let attempt = 1; attempt <= PROBE_MAX_ATTEMPTS; attempt++) {
  log.info("Creating capture session...", { attempt, maxAttempts: PROBE_MAX_ATTEMPTS });
  probeSession = await createCaptureSession(  // <-- NOT wrapped
    fileServer.url,
    join(workDir, "probe"),
    captureOpts,
    null,
    probeCfg,
  );
  log.info("Waiting for composition to initialize...", { attempt });
  const initStart = Date.now();
  const heartbeat = setInterval(...);
  try {
    await initializeSession(probeSession);   // <-- only this is wrapped
    clearInterval(heartbeat);
  } catch (err) {
    ...

The PR body cites initializeSession as the production failure site, so the chosen scope catches the reported symptom. But a transient Chrome/CDP failure can equally surface from createCaptureSession itself — Failed to launch the browser process, CDP connect ECONNREFUSED, the Page crashed race during browser.newPage(), OOMKilled at launch. Those throws will short-circuit the retry on attempt 1 with no second attempt. If this is meant to be the canonical "transient browser failures retry" boundary for the probe stage, the create call belongs inside the try.

This is also why the test mock in finding #2 below paints a misleading picture — none of the three integration tests exercise a transient throw from createCaptureSession, because the implementation doesn't retry it.

2. (Convergence with Jerrai's finding #5.) Band-aid pattern #5 — test mocks the production primitive instead of importing it.

packages/producer/src/services/render/stages/probeStage.test.ts (HEAD 8deb82f3) inlines a copy of the classifier regex in the @hyperframes/engine mock:

isTransientBrowserError: (error: unknown) => {
  const message = error instanceof Error ? error.message : String(error);
  return /Navigating frame was detached|Target closed|Session closed|Protocol error.*Target closed|browser has disconnected|Page crashed|Execution context was destroyed|Cannot find context with specified id/i.test(
    message,
  );
},

The real classifier in packages/engine/src/services/frameCapture.ts lines 1941-1956 is a 9-element pattern array. If the engine list later gains an entry (/OOMKilled/i, /Connection refused.*CDP/i), the producer integration test will silently keep classifying against the OLD list — false positives in the test, false negatives in production. Jerrai's suggested fix (import the real isTransientBrowserError through the mock via vi.importActual) eliminates the drift surface; if duplication is preferred for isolation reasons, at minimum add a comment naming the maintenance burden. Mocking the primitive that the test is supposed to be exercising is the textbook band-aid shape — it makes the test pass but couples it to a frozen copy of production behavior.

3. (Jerrai's catch I missed — secondary nit.) Two patterns are redundant subsets.

In packages/engine/src/services/frameCapture.ts lines 1941-1951: /Protocol error.*Target closed/i is strictly narrower than /Target closed/i, and /Navigation failed because browser has disconnected/i is strictly narrower than /browser has disconnected/i. Effectively 7 distinct patterns, not 9. Functionally equivalent — not a correctness issue. Cleanup nit.

4. (Convergence with Jerrai's finding #2.) NIT — no backoff between attempts.

Immediate retry. At max 2 attempts and a forced close+create cycle in between, the stampede risk is bounded, but for "browser has disconnected" where the underlying network/OS resource is flaky, hitting it again with no delay usually won't help. A 500ms-1s await sleep(...) before the retry is cheap; alternatively, accept the no-backoff stance with a brief code comment naming why.

5. NIT — retry budget is tight (max 2 = 1 retry).

Not a blocker. The "Navigating frame was detached" pattern is typically a single-shot crash, so 1 retry will catch it. But if the failure mode is browser-pool exhaustion (rather than per-session frame detach), 2 attempts won't help. Worth a follow-up Linear if production telemetry shows attempt: 2 → failed becomes a recurring pattern.

6. NIT (Jerrai's finding #4) — no metric counter on retries.

Logs are structured and SREs can search on attempt: 2, but there's no stats.incr("hf.producer.probe.retry", { isTransient, attempt }). Not dashboard-able as it stands. Follow-up.

Explicit answers to the review checklist

  • (a) Error-classification scope: Class of regex patterns (not a single string match). 9 listed; effectively 7 distinct (see finding #3). Coverage includes frame detached, Target/Session closed, Protocol error + Target closed, browser disconnected, Page crashed, Execution context destroyed, Cannot find context. Notable absences: Failed to launch the browser process, OOMKilled/SIGKILL signatures, raw CDP ECONNREFUSED / ECONNRESET, WebSocket is not open.
  • (b) Idempotency on partial progress: Safe. Crashed session is closed in a try/catch (close-failure logged, not rethrown), probeSession is nulled, fresh createCaptureSession is allocated. workDir/probe directory is reused across attempts but createCaptureSession owns that path's contents — no cross-attempt concatenation/corruption risk in this code. The plan tarball is downstream and only writes on probe completion, so the "partial plan → 404" failure mode is not reintroduced. (Jerrai raised the workDir/probe reuse question; on reading the engine side the reuse is fine.)
  • (c) Retry budget + backoff: Per-probe-stage budget, max 2 attempts (= 1 retry). No backoff, no jitter. Budget is not per-chunk so no 100x multiplier risk. Fine for the immediate fix; consider exponential + jitter if/when this gets generalized.
  • (d) Test coverage of negative cases: Yes — three integration tests cover (1) transient → success on attempt 2, (2) non-transient throws immediately on attempt 1, (3) persistent transient exhausts the budget. The 17 classifier unit tests include explicit expect(...).toBe(false) for FONT_FETCH_FAILED, TimeoutError, net::ERR_NAME_NOT_RESOLVED, Composition duration is 0, SYSTEM_FONT_USED, empty string. The remaining gap is that no test asserts a transient error from createCaptureSession (rather than initializeSession) gets retried — because the implementation doesn't, per finding #1.

Summary

Findings #1 and #2 are the substantive ones. Finding #1 (createCaptureSession outside the retry scope) is the band-aid pattern #3 (silent scope gap) — the PR retries the cited production crash site but leaves a parallel transient surface (browser launch) unretried; an easy 3-line refactor closes it. Finding #2 (test mocks the primitive) is band-aid pattern #5 — straightforward to fix via vi.importActual or an explicit deprecation comment. Findings #3-6 are nits and align with Jerrai's review.

Review by Via

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R1 @ 8deb82f3 — Rames D Jusso review

Reviewed at 8deb82f3a3495829086083e14628774e2c67d648. Targeted: transient classification completeness, sibling-stage scope, retry shape, state cleanup, idempotency, observability, tests, Fallow audit signal.


🟢 Strong points

  • isTransientBrowserError placement in @hyperframes/engine (frameCapture.ts:1936-1956) is the right module — colocated with the Puppeteer-facing code that raises these errors, and the re-export at packages/engine/src/index.ts:82-89 keeps the public surface clean.
  • Retry loop in probeStage.ts:181-236 cleanly separates transient (retry with fresh session) from non-transient (immediate throw, no budget consumed). Structured log.warn with attempt / maxAttempts / isTransient / error / elapsedMs (probeStage.ts:204-210) is the right shape for log-based queries.
  • Crashed-session teardown is guarded (probeStage.ts:212-218): closeCaptureSession failure is logged + swallowed, doesn't mask the original retry-success path. Good failure-mode discipline.
  • assertNotAborted() between attempts (probeStage.ts:226) means a kill signal arriving mid-retry exits cleanly without burning a second browser launch.
  • Test for "non-transient → throw immediately, no retry consumed" (probeStage.test.ts:267-289) is the exact negative case I would have flagged as missing.
  • Per-attempt closeCaptureSession ⇒ "create fresh browser" semantics (probeStage.ts:184 re-enters createCaptureSession from scratch each attempt) — no risk of reusing a zombie Chrome's CDP socket. This is what makes "immediate retry, no backoff" defensible.

🟡 Scope / follow-up question (HIGH-LEVERAGE)

  • renderChunk.ts:541-543 runs the same createCaptureSession + initializeSession sequence and has NO retry. In the distributed pipeline, even with a successfully-produced plan tarball, every sequential chunk (chunkWorkerCount === 1 branch) re-boots a Chrome and re-runs initializeSession to navigate to the composition. The exact same Navigating frame was detached error from the #1687 production incident will still kill a chunk activity — just now it'll be a chunk failure instead of a plan failure. Similarly, parallelCoordinator.ts:284-296 (parallel chunk workers) calls createCaptureSession + initializeSession per-worker with no transient classification; the existing comment at parallelCoordinator.ts:268-269 explicitly acknowledges "Target closed" is a known compositor-race failure mode there.
    • Question for Miguel: what's the plan for the chunk-stage transient cohort? Three reasonable answers:
      1. Temporal-level retry handles it — but I can't find an activity-side RetryPolicy or nonRetryable configuration in the producer tree (grep -rEn "RetryPolicy|maximumAttempts|ApplicationFailure|nonRetryable" packages/producer returned 0 hits). If chunk retries happen at the workflow layer in a different repo, please confirm + drop a code-pointer comment.
      2. Follow-up PR for chunk-stage retry — fine, but I'd suggest at least extracting the isTransientBrowserError + close-and-recreate shape into a helper (e.g. withTransientBrowserRetry(fn, { maxAttempts, log })) in this PR so the follow-up is a one-call swap rather than a re-implementation. Otherwise the next person adds a narrower / divergent classifier in renderChunk.ts and we get the sibling-asymmetry shape I've flagged before.
      3. Probe is genuinely different (memory pressure during the warmup-shim phase) — possible, but the issue body lists Navigating frame was detached as a Chrome-internal compositor-race symptom, which doesn't really care which navigation triggered it.
    • This is the highest-leverage finding in the review. The PR ships as-is is fine; this is a "answer in thread + open a follow-up issue" 🟡, not a blocker.

🟡 Pattern-list completeness

  • The 9 patterns cover the named-in-#1687 cohort, but a few well-known Puppeteer transients that show up in the same incident class are absent:
    • WebSocket is not open / Connection closed (CDP socket drop — Puppeteer surfaces this as Connection.send / WebSocket is not open: readyState 3)
    • socket hang up / ECONNRESET / EPIPE (kernel-side socket teardown, often paired with OOM-killed Chrome before the Target closed event reaches the client)
    • net::ERR_NETWORK_CHANGED / net::ERR_INTERNET_DISCONNECTED (Puppeteer raises these as net::ERR_* strings, distinct from the FONT_FETCH path that's correctly NOT retried)
    • These would be additive (not a correctness risk for the current rollout — false-negatives just bubble up the same way they do today, so adding them later is safe). Decide whether you want to grow the list now or wait for the next incident's failure string. Not a blocker; calling out so it doesn't get lost.
  • Protocol error.*Target closed (frameCapture.ts:1945) is logically redundant with the broader /Target closed/i at line 1943 — any message matching the former matches the latter. Cosmetic. Drop one, or keep both for self-documentation, your call.

🟡 Observability: log-only, no first-class metric

  • Per-attempt structured logging is good. Missing: a counter / event that names this cohort so the production-failure-mode in #1687 has a queryable post-fix signal beyond log-search. Without it, a recurrence won't trip a dashboard alert — you'll find out from the same downstream symptom (plan() failure → all chunks 404'ing) that brought the issue here in the first place.
  • A minimal shape: emit a structured event (or log.info with a stable name like producer.probe.transient_retry) on BOTH the retry-fired path (probeStage.ts:222-225) AND the retry-exhausted path (just before throw err on the last attempt at probeStage.ts:229), with dims {isTransient, attempt, errorPattern}. Then a dashboard widget on attempt distribution + a monitor on retry-exhausted gives you the closed-loop visibility this fix is supposed to deliver.
  • Per the memory I keep on observability PRs: success-only emission silently fails the PR's own goal. The current log.warn shape DOES fire on both paths (since it's inside catch before the transient/exhaust branch), so this is more "upgrade-the-signal-quality" than "fix-a-silent-failure" — but worth raising.

🟡 Retry shape: no backoff, max 2 attempts

  • PROBE_MAX_ATTEMPTS = 2 with immediate retry (no setTimeout / await sleep(jitter)) is defensible BECAUSE you fully recreate the browser via createCaptureSession, which spawns a fresh Chrome process. Zombie reuse isn't the concern. BUT: if the transient was caused by host-level memory pressure (an OOM-killed Chrome), an immediate respawn racing the kernel's reaper can hit a transient "fork: cannot allocate memory" symptom that the classifier won't match. Low probability for the stated cohort; flagging in case the production incident's box was under memory pressure when the original failure landed (worth checking the host metrics from the #1687 incident time-window). If pressure was a factor, a 1-2s jitter between attempts buys real recovery time.
  • max 2 attempts = total of 2 browser boots. Reasonable budget — if Chrome can't initialize twice on the same host within seconds of each other, the issue is no longer transient. No change requested.

🟢 Idempotency

  • The probe-write path doesn't actually emit any files before initializeSession succeeds — createCaptureSession just opens the page, initializeSession does the navigation. No partial-tarball-write risk on a failed first attempt. The discovery side-effects (composition duration, browser media, audio automation) all happen AFTER the retry loop on the surviving session, so they only run once. Clean.

🟢 Tests

  • 17 unit tests in frameCapture-transientErrors.test.ts:5-37 cover the 9 patterns + 2 additional Protocol error variants + 6 non-transient + 5 non-Error values. Good coverage of the classifier surface.
  • 3 integration tests in probeStage.test.ts:248-309 cover: (a) successful retry, (b) non-transient immediate throw, (c) retry exhaustion. The successful-retry test (probeStage.test.ts:249-264) asserts initializeSessionCallCount === 2 AND closeCaptureSessionCallCount === 1 — which DOES verify a fresh session was created and the crashed one was torn down (the test passes only if the loop re-enters createCaptureSession). That's the test I would have asked for, and it's already here.
  • Minor: the integration test mocks isTransientBrowserError inline (probeStage.test.ts:60-65) with a slightly different regex than the engine — copy-paste drift risk if the engine's pattern list grows. Suggest re-exporting the real isTransientBrowserError from @hyperframes/engine mock at the top of the test so additions to the engine's list are automatically tested at the probe-stage layer too.

🟢 Code-quality (Fallow audit / CRAP)

  • I was briefed to look for an extract-helper refactor with self-call risk per the audit failure. The retry loop is inlined, no helper extraction, so the self-call class doesn't apply here.
  • The actual runProbeStage function did grow by ~40 lines and gained one new nesting level (the for (let attempt...) wrapping the try/catch). Complexity is up but not pathologically so — the retry logic is the simplest shape that handles all four cases (success, transient-retry, transient-exhaust, non-transient).

🛑 CI signal — Fallow audit failure is NOT a CRAP score blocker

  • gh api /repos/heygen-com/hyperframes/actions/jobs/83106934398/logs shows the failing Fallow audit job actually died at bun install:
    error: Integrity check failed for tarball: @fontsource/source-code-pro
    error: failed to download @fontsource/source-code-pro@5.2.7: IntegrityCheckFailed
    
    This is a flaky npm-registry tarball-integrity error in the install step, not a code-quality / complexity threshold violation. Re-run the failed job — should pass on retry. Heads-up so it doesn't get treated as a real signal.
  • All other relevant gates (Lint, Typecheck, Build, Test, runtime contract, regression-shards 1-8, CodeQL, CLI smoke) are green. Smoke: global install / Render-on-windows / Tests-on-windows-latest are still in-progress at review time; expected to pass per the green prior-run results.

🟢 Verdict

Ship-as-is is defensible. The probe-stage fix is well-scoped, well-tested, observability is acceptable (log-only is fine for v1). The HIGH-LEVERAGE 🟡 is the chunk-stage scope question — if Temporal-level retry doesn't cover the same cohort, the production incident in #1687 will recur in a different stage with the same Chrome-internal failure signature. Worth answering in-thread + filing a follow-up.

Open-and-honest: I haven't traced what happens to a chunk-activity failure all the way to the Temporal workflow boundary, so the "Temporal handles it" answer might already be true — just couldn't verify in-repo.

— Rames D Jusso

Stamp routing: @rames Jusso once findings addressed (or punted with rationale).

- Move createCaptureSession inside the retry try/catch so browser launch
  failures (Failed to launch the browser process, ECONNREFUSED) are also
  retried — not just initializeSession errors.
- Deduplicate transient error patterns: remove "Protocol error.*Target
  closed" (subsumed by "Target closed") and "Navigation failed because
  browser has disconnected" (subsumed by "browser has disconnected").
- Add browser launch failure patterns: "Failed to launch the browser
  process" and "ECONNREFUSED".
- Add test for createCaptureSession transient throw (browser launch retry).
- Update test mock comment to document sync requirement with engine
  pattern list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@miguel-heygen miguel-heygen merged commit 546b2d7 into main Jun 24, 2026
44 checks passed
@miguel-heygen miguel-heygen deleted the fix/probe-frame-detach-retry branch June 24, 2026 04:07

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped at HEAD 6e57a6a. Verified current checks are green and the review-feedback commit addresses the main scope/test concerns: createCaptureSession is inside the retry boundary, redundant transient patterns are deduped, and the producer retry test now uses the real classifier through the engine mock. Non-blocking follow-ups from prior reviews remain follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed render: probe stage crashes on transient browser errors without retry

5 participants