feat: route new task runs to a parallel task_run_v2 table by d-cs · Pull Request #4000 · triggerdotdev/trigger.dev

d-cs · 2026-06-19T17:03:52Z

Summary

New task runs can be routed to a parallel task_run_v2 Postgres table instead of the main TaskRun table, decided per-org by a feature flag and keyed purely by the run id's format. Existing runs stay in TaskRun, with no backfill. The flag ships off, so behavior is unchanged until an org is opted in.

This builds on the RunStore adapter that already funnels all Postgres TaskRun access through one place (writes in #3981, reads in #3990). RunStore now routes each run to its physical table by id format: a KSUID id means task_run_v2, anything else (including legacy cuids) means TaskRun.

Design

The discriminator is the id format. New runs mint a KSUID when their org has the runTableV2 flag on; everyone else keeps minting legacy ids. The flag is read in memory at the single mint site in the trigger path, so the hot path adds no query. RunStore never sees the flag: it routes purely by isKsuidId(id), and a malformed id falls back to legacy.
By-id reads and writes stay single-table (O(1), one table). Only predicate reads that cannot name a table touch both. findRuns does a bounded two-way merged keyset cursor (ordered reads standardize on a (createdAt, id) keyset, since cuid and KSUID do not share a sort range), and a non-id findRun (idempotency-key dedup, or "are there any runs in this environment") queries both tables. Both apply identical scoping to each table, so a merge cannot leak a run across an auth boundary.
Idempotency is three-source while an org has runs in both tables: legacy TaskRun, task_run_v2, and the mollifier buffer, so a reused key is always found and never produces a duplicate run.
The ClickHouse mirror is always ready. The replication service co-publishes task_run_v2 from the start (empty until orgs cut over), streaming its WAL rows through the same transform into the same ClickHouse table.

task_run_v2 carries the same relations as TaskRun, and the incoming foreign keys pointing at TaskRun are dropped so the two tables are not coupled by constraints.

Stacked on #3990 (its base), so this PR shows only the routing commits on top of the read adapter.

Before enabling the flag for any org, task_run_v2 needs REPLICA IDENTITY FULL applied the same out-of-band way as TaskRun, so its update and delete events stream to ClickHouse with the old row.

…esolution

Replaces the seven throwing stubs on PostgresRunStore with verbatim relocations of the Prisma statements from runAttemptSystem: startAttempt, completeAttemptSuccess, recordRetryOutcome, requeueRun, recordBulkActionMembership, cancelRun, and failRunPermanently. Each method splices the caller-supplied select/include into the Prisma call. Tests use real Postgres containers and cover each method including edge cases (append semantics, conditional fields in cancelRun).

…int methods

…y-clear, and array-append methods Replaces the seven throwing stubs in PostgresRunStore with verbatim-relocated Prisma statements sourced from delayedRunSystem, debounceSystem, updateMetadata, idempotencyKeys, resetIdempotencyKey, batchTriggerV3, and the realtime-stream route handlers. - rescheduleRun: writes delayUntil always; queueTimestamp when provided; nested DELAYED executionSnapshot when snapshot arg provided - enqueueDelayedRun: sets status PENDING + queuedAt - rewriteDebouncedRun: pass-through update with associatedWaitpoint include - updateMetadata: optimistic-lock path (updateMany with version predicate) or direct path (update without predicate); both return { count } - clearIdempotencyKey: three discriminated-union branches — byId clears both columns, byPredicate clears both, byFriendlyIds clears only idempotencyKey - pushTags: push-append to runTags array; returns { updatedAt } - pushRealtimeStream: push-append to realtimeStreams array; returns void

…bapp BaseService Add RunStore field to SystemResources, instantiate PostgresRunStore in RunEngine constructor (after prisma/readOnlyPrisma are set), and expose it on the resources object passed to all systems. Create a webapp singleton (runStore.server.ts) and thread it as a default parameter into BaseService so subclasses can access it without changes.

…ually pass

…eave-unchanged semantics)

…s through RunStore

…nStore

…er input

… debounce writes through RunStore

… writes through RunStore

The service statically imported the db.server-backed runStore singleton, which dragged the Prisma client into otherwise-light test module graphs and opened an eager connection to DATABASE_URL on import. The metadata service test then threw an unhandled connection error whenever no database was reachable at the configured address. Make runStore a required constructor option, pass the singleton at the production construction site, and inject a testcontainer-backed store in the tests.

Add findRun, findRunOrThrow and findRuns to RunStore, mirroring the existing write methods. They pass where/select/include through the same Prisma generics and default to the read replica, while letting the caller pass the writer or a transaction client when needed. This lets Postgres reads of TaskRun be routed through the store the same way writes already are. Additive only; no call sites change yet.

Add a no-args overload to findRun, findRunOrThrow and findRuns that returns the whole TaskRun row, for callers that read a run without a select or include.

Relocate the direct TaskRun reads in the engine and its systems to the RunStore read methods, preserving the exact client (writer, replica, or transaction) at each site. Behavior-preserving; the engine test suite is unchanged.

…tore Relocate the direct TaskRun reads in webapp services, run-engine concerns, realtime, mollifier and metadata to the RunStore read methods, preserving the exact client (writer, replica, or transaction) at each site. The run hydrator now receives the store by injection. Behavior-preserving.

Relocate the dashboard presenter TaskRun reads to the RunStore read methods, preserving the exact client per site. Behavior-preserving.

…store Relocate the route and loader TaskRun reads to the RunStore read methods, preserving the exact client per site, including the replica-resolve then writer-recheck realtime paths. Behavior-preserving.

…store Decompose the three reads that pulled TaskRun in through a parent model's relation include (alert, batch results, attempt dependencies): query the parent without the include, hydrate the run(s) via RunStore in a single batched read, and stitch them back. Preserves field selection, ordering, null handling and the query client. Adds container-backed tests for the batch-results and cancel-dependencies paths.

…tover The recovery script joins TaskRunExecutionSnapshot to TaskRun in raw SQL, so it is the one TaskRun read not routed through the run store. Add a note to revisit it at table cutover.

…on id-list reads findRuns now throws when given skip: offset pagination cannot span the two run tables, where each would independently skip N rows from its own result rather than N from the merged result. For an id-list predicate (id in [...]), it now queries only the table whose id format can contain those ids, avoiding a wasted query against an empty task_run_v2 while it is unpopulated during rollout.

…e merge collation A single-format id-list narrows findRuns to one physical table, but the ordered+limited path still built the cross-table comparator and threw the time-key guard; it now delegates natively to the one table (Postgres orders within a single table fine). Separately, the in-memory merge comparator ordered strings by code unit while the Postgres keyset continuation orders by the database collation (en_US); switching the comparator to localeCompare makes them agree, so a tied-createdAt boundary spanning both tables no longer skips or duplicates a row.

The pre-gate idempotency claim was eligible only when the org was on the mollifier. Concurrent same-key triggers that straddle a runTableV2 flip can mint into different physical tables, whose per-table unique constraints can't see each other, so two runs could share one key. The claim is now also eligible when the org is cut over to the v2 run table, serialising those triggers through Redis.

…warn when missing A v2 run DELETE needs the full old row so its ClickHouse soft-delete tombstone carries organization and environment ids; under the default replica identity those are dropped and the tombstone is lost. A migration sets REPLICA IDENTITY FULL on task_run_v2 rather than relying on an out-of-band step, and the replication client now warns when any co-published table that publishes UPDATE/DELETE lacks FULL. Adds a replication test for the v2 DELETE tombstone.

A v2 run can reference a legacy parent/root, or have legacy children, when a hierarchy straddles a runTableV2 flip. Prisma relation selects are bound to one table, so the run, span, and API-retrieve presenters returned null parent/root and dropped cross-table children. They now resolve parent/root by id (RunStore routes by id format) and children by a both-table predicate, via a shared hydrateParentAndRoot/hydrateChildRuns helper.

When a non-id predicate matches a row in both physical tables, findFirstAcrossTables now returns the v2 copy instead of legacy. Under this PR a run is in exactly one table (createRun routes by id format), so this is a no-op today; it forward-aligns with the later slow legacy to v2 migration, which copies a run into task_run_v2 (the canonical, operated-on copy) before operating. A comment in findRuns marks the matching dedup-by-id work for that migration PR.

TaskRunV2 declared implicit many-to-many relations (tags, connectedWaitpoints) whose join tables were never created by any migration and are absent from the database. Nothing reads them (v2 run tags use the scalar runTags array), so they were pure schema-vs-migration drift. Removing them makes the schema match the database with no migration.

findRuns rejects a Prisma cursor or a negative take on a both-tables read (neither can span two tables) instead of silently returning a wrong or empty result, and tablesForWhere now routes a plain id or friendlyId equality to the single matching table by id format, not just id:{in} lists. Also documents that the cross-table merge comparator assumes the en_US database collation and the COLLATE C fix needed for other collations.

… off Concurrent same-key triggers that straddle a runTableV2 flag flip can mint into different physical tables (cuid to TaskRun, ksuid to task_run_v2), whose per-table unique constraints cannot see each other, so neither insert conflicts and two runs share one key. The pre-gate claim now resolves its backend through a claim-only Redis buffer when the mollifier buffer is absent, so it serialises these triggers instead of falling open. v2-cutover orgs are claim-eligible for every idempotency-keyed trigger, including triggerAndWait, debounce, and one-time-use tokens, and the claim-resolved path blocks the parent on the winner's waitpoint.

A run routed to task_run_v2 was invisible to the Electric realtime feed, whose shapes were bound to the TaskRun table, so subscribeToRun, useRealtimeRun, and run polling returned nothing for those runs. Single-run subscriptions now route the shape to the correct table by id format, and the tag and batch feeds run two upstream shapes (TaskRun and task_run_v2) merged under one composite cursor the client round-trips opaquely, so no SDK change is needed.

runTableV2 is resolved per organization only, so a global toggle on the admin flags page did nothing. Mark it read-only there to remove the misleading control; per-org control stays on the org dialog.

…ment The parent/root/child hydration that resolves a run's hierarchy across both run tables looked runs up by id alone. Those pointers are now plain scalars with no foreign-key enforcement, so a stale or malformed pointer could resolve to a run in another environment and leak its metadata through the run and span presenters. Scope every lookup to the run's runtimeEnvironmentId, restoring the same-environment guarantee the table-bound relation select used to provide.

When the two-table realtime shape merge returns as soon as one upstream shape yields, it aborts the other fetch and returns immediately. That promise was left without a rejection handler, so the abort could surface as an unhandled rejection on the server. Attach a no-op catch to the aborted fetch.

The two-table shape merge could leave one upstream fetch pending without a rejection handler when it aborts the race loser or rethrows from the catch block. Attach a detached no-op catch to both fetches up front so an abandoned fetch can never surface as an unhandled rejection on any path. Also document that a tag/batch subscription opens two upstream Electric connections while an org spans both run tables.

…ectric shapes Electric realtime shapes are bound to a single table, so a task_run_v2 run was invisible to realtime subscriptions. The previous approach merged two Electric shapes per tag/batch feed under a composite cursor, which doubled Electric long-poll connections for those feeds. Electric is being retired in favor of the native realtime backend, which is table-agnostic and already observes both run tables, so that merge is throwaway. Drop the Electric dual-shape merge (revert realtimeClient to its single-table form, remove the merge module) and instead gate runTableV2 on the native backend: a run only routes to task_run_v2 when the deployment has native realtime enabled and the org's realtimeBackend flag is native. This keeps v2 runs realtime-observable without touching Electric, and the gate auto-satisfies once Electric is removed and native is the default. The idempotency pre-gate claim inherits the same gate.

Completes the Electric-merge removal: a run only routes to task_run_v2 when the deployment has native realtime enabled and the org's realtimeBackend flag is native. Electric shapes are single-table and can't observe a v2 run, so without this gate a v2 run would be realtime-invisible. shouldUseV2RunTable takes the native-realtime master switch as a parameter (kept env-free for unit tests); the trigger mint site and the idempotency pre-gate claim both pass it.

Restore the both-table Electric shape merge so tag-list and batch realtime feeds observe runs in TaskRun and task_run_v2 together, and gate the v2 run table on the runTableV2 flag alone (drop the native-realtime coupling). New runs route to task_run_v2 whenever an org has the flag on and stay visible in realtime on the existing Electric backend. Single-run feeds route to one table by id format; only tag and batch feeds fan out to both shapes under one composite continuation.

…ed window Routes that walk the run hierarchy through a Prisma relation only see one physical table, so during a runTableV2 flag flip (a parent and child on opposite tables) they silently miss the cross-table run. This closes the reachable cases: - cancelRun resolves child runs across both tables, so cancelling a parent cascades to a child in the other table instead of leaving it executing and holding concurrency. - updateMetadata routes metadata.parent/root operations to the scalar parent/root id, so they reach a parent in the other table instead of falling back to the child run. - a one-time-use token with no idempotency key now takes a cross-table claim for v2 orgs, so two presentations straddling a flip cannot each mint a run in a different table. - the Electric shape merge reports up-to-date only when both tables are caught up, so a multi-chunk initial snapshot no longer drops the rows that arrive after the first chunk.

… mixed window A cuid parent (TaskRun) with a ksuid child (task_run_v2): cancelling the parent must cascade to the child in the other table. Fails against the old table-bound childRuns relation, passes with the cross-table findRuns lookup.

…tables An unordered take capped each run table independently and concatenated the two results, so a both-table read could silently drop one table rows once the other filled the cap. Reject it like the existing skip and cursor guards; callers that need a bounded cross-table read pass an orderBy for the keyset merge.

The guard added in the previous commit makes that call throw rather than return a non-deterministic cap; this test asserted the removed cap behavior. The throw is covered by the guard test alongside the skip/cursor guards.

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-06-22T17:26:39Z

+export function decodeBackfillCursor(cursor: string): { createdAt: Date; id: string } {
+  const separatorIndex = cursor.indexOf(BACKFILL_CURSOR_SEPARATOR);
+  const createdAt = separatorIndex === -1 ? new Date(NaN) : new Date(cursor.slice(0, separatorIndex));
+  const id = separatorIndex === -1 ? "" : cursor.slice(separatorIndex + 1);
+
+  if (Number.isNaN(createdAt.getTime()) || id.length === 0) {
+    throw new Error(
+      `RunsBackfillerService: malformed cursor "${cursor}" (expected "<createdAt>_<id>")`
+    );
+  }
+
+  return { createdAt, id };
+}


🚩 Backfill cursor format change is backward-incompatible with in-flight batches

The backfill cursor changed from a plain run id (lastRun.id) to a composite <createdAt>_<id> string (runsBackfiller.server.ts:110). decodeBackfillCursor at line 125 throws on a malformed cursor. If a backfill job is in progress when this code deploys, the admin worker will pass the old-format cursor (a bare id) to the new decodeBackfillCursor, which will throw because the separator _ isn't found in a cuid (cuids are [a-z0-9]{25} with no underscore). The error message is clear and the backfill can be restarted from scratch, but an in-flight backfill will fail on the first batch after deploy.

Was this helpful? React with 👍 or 👎 to provide feedback.

…-token claim) - The strengthened findRuns guard threw on GET /api/v1/runs/:runId/spans/:spanId, which pages child runs with take and no orderBy across both tables. Add a createdAt order so it takes the bounded cross-table merge (and the 50-row cap is now deterministic, most recent first) instead of throwing for every org. - Key the one-time-use-token cross-table claim on the token alone (a reserved task slot), matching the task-independent oneTimeUseToken unique constraint, so a multi-task token cannot mint twice across the flip. Stop excluding triggerAndWait from the token claim. Always resolve a held claim on the success path (publish, else release) so it cannot leak until its TTL.

… shape merge The Electric dual-shape merge was a bridge to let the Electric backend observe v2 runs during the cutover, but Electric is short-lived and the merge taxed every tag/batch realtime feed with a second long-poll the moment it deployed. Gate the v2 run table on the native realtime backend instead (the native client is table-agnostic and observes v2 runs directly), so a run only routes to task_run_v2 once its org is on native. Remove the merge module and restore the single-table Electric proxy. The cross-table correctness work stays: a v2 run can still have a cross-table parent or child once an org flips, so the cancelRun cascade, metadata parent/root routing, the one-time-token claim, and the findRuns guard all still apply regardless of realtime backend.

…v2 orgs, add cross-table tests The idempotency-key dedup is a non-id predicate, so RunStore read BOTH run tables in parallel on every idempotency-keyed trigger, including orgs not cut over to v2 (whose runs only live in TaskRun, so the task_run_v2 query is always empty; while native realtime is off that is every org). Add an optional `tables: "legacy" | "both"` scope to findRun and pass "legacy" from the idempotency concern when the org is not on v2, keeping the trigger hot path single-table. Backfills cross-table tests the audit flagged as missing: findRun legacy-scope skips task_run_v2, and clearIdempotencyKey fans out across both tables (byPredicate hits v2; a mixed byFriendlyIds array clears both).

d-cs added 30 commits June 17, 2026 13:35

chore(run-store): scaffold @internal/run-store package

a86635c

feat(run-store): add shared types and the RunStore interface

d4c1ff4

chore(run-store): use .js extensions in index re-exports for Node16 r…

6d7abab

…esolution

feat(run-store): add NoopRunStore test double

010cf17

feat(run-store): add PostgresRunStore with createRun

72a7462

feat(run-store): implement createCancelledRun and createFailedRun

2e63223

feat(run-store): implement expiry, dequeue-lock, version, and checkpo…

f1ab6ae

…int methods

fix(run-store): align create-input types with the columns callers act…

01bbc67

…ually pass

refactor(run-engine): route run creation through RunStore

de52aaa

fix(run-store): allow optional machinePreset in recordRetryOutcome (l…

4826117

…eave-unchanged semantics)

refactor(run-engine): route attempt lifecycle, cancel, and fail write…

8650e40

…s through RunStore

refactor(run-engine): route expiry and dequeue-lock writes through Ru…

d530eb1

…nStore

fix(run-store): allow undefined maxDurationInSeconds in lockRunToWork…

4ec5aab

…er input

refactor(run-engine): route checkpoint, delayed, pending-version, and…

109c6a7

… debounce writes through RunStore

refactor(webapp): route run metadata, idempotency-key, and reschedule…

2fbdc5d

… writes through RunStore

refactor(webapp): route tag and realtime-stream appends through RunStore

1a5ccdc

fix(run-store): short-circuit expireRunsBatch on an empty runIds array

60565cf

Merge main into run-store-write-adapter

3c22b32

feat(run-store): add full-row read overload to the run store

13d5364

Add a no-args overload to findRun, findRunOrThrow and findRuns that returns the whole TaskRun row, for callers that read a run without a select or include.

refactor(webapp): route presenter TaskRun reads through the run store

5683952

Relocate the dashboard presenter TaskRun reads to the RunStore read methods, preserving the exact client per site. Behavior-preserving.

refactor(webapp): route API and loader TaskRun reads through the run …

126b05f

…store Relocate the route and loader TaskRun reads to the RunStore read methods, preserving the exact client per site, including the replica-resolve then writer-recheck realtime paths. Behavior-preserving.

chore(scripts): flag recover-stuck-runs raw TaskRun read for table cu…

cb12430

…tover The recovery script joins TaskRunExecutionSnapshot to TaskRun in raw SQL, so it is the one TaskRun read not routed through the run store. Add a note to revisit it at table cutover.

d-cs added 7 commits June 22, 2026 10:22

Merge branch 'main' into runstore-table-redirect

e72d9fb

This comment was marked as resolved.

Sign in to view

d-cs added 5 commits June 22, 2026 14:20

fix(webapp): lock runTableV2 on the global flags page

eeb1079

runTableV2 is resolved per organization only, so a global toggle on the admin flags page did nothing. Mark it read-only there to remove the misleading control; per-org control stays on the org dialog.

This comment was marked as resolved.

Sign in to view

d-cs added 2 commits June 22, 2026 14:38

This comment was marked as resolved.

Sign in to view

d-cs added 9 commits June 22, 2026 15:08

Merge remote-tracking branch 'origin/main' into runstore-table-redirect

e44af57

devin-ai-integration Bot reviewed Jun 22, 2026

View reviewed changes

d-cs added 3 commits June 22, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: route new task runs to a parallel task_run_v2 table#4000

feat: route new task runs to a parallel task_run_v2 table#4000
d-cs wants to merge 76 commits into
mainfrom
runstore-table-redirect

d-cs commented Jun 19, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d-cs commented Jun 19, 2026

Summary

Design

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant