Skip to content

fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags#3889

Merged
d-cs merged 21 commits into
mainfrom
snapshots-since-replica-primary-fallback
Jun 10, 2026
Merged

fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags#3889
d-cs merged 21 commits into
mainfrom
snapshots-since-replica-primary-fallback

Conversation

@d-cs

@d-cs d-cs commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

When RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED is on, RunEngine.getSnapshotsSince reads from the read replica. During write spikes the replica can briefly lag, so the snapshot id a runner just learned from the writer isn't visible there yet: the lookup threw, the worker route returned a 500, and the runner waited for its next poll — turning sub-second snapshot notifications into poll-interval latency exactly when things are busiest. This PR makes the flag safe to enable: a replica miss of the since snapshot gets one jittered retry on the replica (most lag windows are shorter than the ~50–200ms wait, so the writer is never touched), then falls back to the primary, observed via a new run_engine.snapshots_since.replica_miss counter with an outcome attribute (replica_retry vs primary). Only genuine misses — absent on the primary too — remain errors.

Design

  • getExecutionSnapshotsSince now throws a typed ExecutionSnapshotNotFoundError so the engine can distinguish the expected lag miss from real failures. The message string is unchanged and the error never leaves the engine.
  • The recovery path only engages when the flag is on, a distinct replica client is configured, and no transaction client was passed. With the flag off, the path is behaviorally identical to before.
  • Retry delay bounds are configurable (RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS/MAX_MS, default 50/200; MAX_MS=0 skips the replica retry and goes straight to the primary).
  • The warn log fires only when the primary serves the read (the writer spill is the operationally interesting event); replica-retry recoveries are counted but quiet. A permanently-missing snapshot id stays an error-level failure with a failedDuring field, so lag metrics aren't polluted by bogus ids.
  • Stale-tail lag (replica has the since snapshot but not newer rows) deliberately still returns the replica's view; the next poll catches up.
  • The since-snapshot anchor lookup is now scoped to the polled run (where: { id, runId }), so a snapshot id from a different run raises not-found instead of silently anchoring a too-wide window of the run's snapshots.

Test plan

All vitest + testcontainers, no mocks. A new schemaOnlyPrisma fixture (migrated-but-empty clone database) simulates a replica that hasn't caught up, and a real in-memory OTel meter pins the counter semantics per outcome.

  • Replica catches up during the jittered retry window → served by the replica, outcome=replica_retry = 1, primary never consulted
  • Replica permanently missing the since snapshot → served by the primary, outcome=primary = 1
  • Snapshot missing on both replica and primary → null, counter = 0
  • Replica has the since snapshot but lags by one → the replica's view is served, no fallback (verified discriminating power: the test fails if reads secretly hit the primary)
  • Flag off with a replica configured → primary serves the read
  • Transaction client provided → bypasses the replica entirely
  • Since snapshot belonging to a different run → null
  • Existing getSnapshotsSince + waitpoints suites green; run-engine, testcontainers, and webapp typechecks pass

🤖 Generated with Claude Code

@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 8418c7b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR implements replica-aware snapshot polling for the Run Engine. When read-replica snapshots are enabled, getSnapshotsSince now validates that the requested snapshot ID belongs to the specified run, retries once on the replica with configurable jittered delay, and falls back to the primary database if the replica hasn't caught up. New test helpers enable copying snapshots to replicas and collecting OpenTelemetry metrics. A new schema-only Prisma fixture supports replica testing. Integration tests validate retry logic, metrics recording, primary fallback, and feature flag control.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding retry logic on the read replica followed by primary fallback for getSnapshotsSince when replica lags.
Description check ✅ Passed The description includes comprehensive sections covering summary, design decisions, test plan with verification checkmarks, and follows the required template structure with most key elements addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch snapshots-since-replica-primary-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@d-cs d-cs self-assigned this Jun 10, 2026
coderabbitai[bot]

This comment was marked as resolved.

d-cs and others added 17 commits June 10, 2026 16:22
… for replica-lag tests

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…edundant cast

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ionSnapshotsSince

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lica misses the since snapshot

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…p fallback without a replica

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… and counter semantics for getSnapshotsSince

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…note, trim dead test setup

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… a different run (red)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ment

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…etSnapshotsSince

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…stinguish fallback failure logs

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ds, tie-stable test assertions, shared clone/drop helpers

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d-cs d-cs force-pushed the snapshots-since-replica-primary-fallback branch from a26a5f0 to f4dac20 Compare June 10, 2026 15:24
@pkg-pr-new

pkg-pr-new Bot commented Jun 10, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@f4dac20

trigger.dev

npm i https://pkg.pr.new/trigger.dev@f4dac20

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@f4dac20

@trigger.dev/plugins

npm i https://pkg.pr.new/@trigger.dev/plugins@f4dac20

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@f4dac20

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@f4dac20

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@f4dac20

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@f4dac20

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@f4dac20

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@f4dac20

commit: f4dac20

@d-cs

d-cs commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Critical path: before / after

How a runner's GET /snapshots/since/:id poll behaves during a replica-lag window with RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED on.

Before (why enabling the flag was previously unsafe)

sequenceDiagram
    participant R as Runner
    participant E as Webapp / RunEngine
    participant RO as Read replica
    participant P as Primary (writer)

    R->>E: poll since snap_X
    E->>RO: findFirst(snap_X)
    RO-->>E: ✗ not replicated yet
    Note over E: ERROR log → alert noise
    E-->>R: 500
    Note over R: wait for next poll interval (~5s)<br/>notification path degrades to polling
    R->>E: poll again
    E->>RO: findFirst(snap_X)
    RO-->>E: ✓ replicated by now
    E-->>R: 200 + snapshots
Loading

Cost per lag hit: error spam + alert noise + up to a full poll interval of added latency on the snapshot transition.

After (this PR)

sequenceDiagram
    participant R as Runner
    participant E as Webapp / RunEngine
    participant RO as Read replica
    participant P as Primary (writer)

    R->>E: poll since snap_X
    E->>RO: findFirst(snap_X)
    RO-->>E: ✗ not replicated yet
    Note over E: sleep 50–200ms (jittered, configurable.<br/>MAX_MS=0 skips straight to primary)
    E->>RO: retry findFirst(snap_X)
    alt replica caught up (common case)
        RO-->>E: ✓ snapshots
        Note over E: counter++ outcome=replica_retry (quiet)
        E-->>R: 200 + snapshots
    else lag outlasts the retry window
        RO-->>E: ✗ still missing
        E->>P: full query on the writer
        P-->>E: ✓ snapshots
        Note over E: counter++ outcome=primary + WARN (writer spill)
        E-->>R: 200 + snapshots
    else permanent miss (bogus / pruned id, or wrong run)
        RO-->>E: ✗ still missing
        E->>P: full query on the writer
        P-->>E: ✗ not found
        Note over E: ERROR log — unchanged from today
        E-->>R: 500
    end
Loading

Cost per lag hit: ≤ ~200ms added latency; the writer is only touched when lag outlasts the retry window.

Invariants

  • Flag off (and/or no distinct replica client configured, and/or a tx passed): none of this exists — the path is the old single primary read, unchanged.
  • Error path preserved: a snapshot id that is genuinely gone — or belongs to a different run, after the scoping fix in this PR — walks all three attempts and still ends in the same error-log-plus-500 as today, so stuck clients stay visible.
  • The run_engine.snapshots_since.replica_miss counter's two outcomes give the rollout signal: replica_retry ≈ lag frequency, primary ≈ writer-spill rate.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 6 additional findings.

Open in Devin Review

…eplica retry

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d-cs d-cs marked this pull request as ready for review June 10, 2026 15:49
@ericallam

Copy link
Copy Markdown
Member

@d-cs probably doesn't need 2 server changes files 👍

…entry

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment thread internal-packages/run-engine/src/engine/index.ts
…the writer fallback

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d-cs

d-cs commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Combined into a single server-changes entry in 51924bb.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d-cs d-cs enabled auto-merge (squash) June 10, 2026 16:53
@d-cs d-cs merged commit 6afc9bf into main Jun 10, 2026
88 of 91 checks passed
@d-cs d-cs deleted the snapshots-since-replica-primary-fallback branch June 10, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants