Skip to content

fix(api): load sheets sequentially at boot to avoid OOM spike#134

Merged
themightychris merged 8 commits into
mainfrom
fix/boot-oom-sequential-load
Jun 26, 2026
Merged

fix(api): load sheets sequentially at boot to avoid OOM spike#134
themightychris merged 8 commits into
mainfrom
fix/boot-oom-sequential-load

Conversation

@themightychris

Copy link
Copy Markdown
Member

Why

Cold boots OOM'd (FATAL ERROR: Reached heap limit) building in-memory state from the full published import — the incident behind #131/#132. The fix turned out to be one line of concurrency, not node size or data volume.

Root cause (found via live-pod heap instrumentation)

loadInMemoryState read all eleven sheets via Promise.all. Running every queryAll() concurrently makes each sheet's transient read/decompress/parse buffers peak at the same instant. On the amd64 runtime that combined spike blew past a 1.5 GB heap — even though the retained state is only ~0.48 GB.

Per-phase logging in the pod made it unambiguous:

  • reconcile + push-daemon + private-store load: heap flat at ~60 MB (none of them was the hog)
  • the balloon was entirely inside loadInMemoryState, during the concurrent queryAll reads
  • sequential reads → boots in ~0.48 GB, people=18203 (the expected count)

What

apps/api/src/store/memory/loader.ts: replace the Promise.all over the eleven queryAll() reads with sequential awaits. Same records, same order, same retained footprint; the peak is bounded to the single largest sheet. Boot isn't latency-sensitive, so serializing the reads costs nothing in practice.

No spec change (the storage spec never mandated concurrent loading).

Validation

  • Live sandbox pod boots to /api/health/ready 200 on the standard node size, 0 restarts, ~0.48 GB retained.
  • type-check + lint clean; loader/store/memory tests pass (12/12).

Relationship to the other recent PRs

🤖 Generated with Claude Code

themightychris and others added 8 commits June 26, 2026 00:08
The legacy import carried ~31.8k people, ~61% judged spam by the offline
person-evaluations pass. Loading them all exceeded the in-memory heap budget
on the standard node size, and spam accounts don't belong in the public
civic-transparency dataset anyway.

Spec defines: verdict aggregation (prune iff confident spam, no legit, and no
project membership), the cascade-prune on `published`, idempotency, and the
import → merge → eval → prune → push pipeline ordering. Exclusion happens in
the data pipeline, not the runtime loader.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements specs/behaviors/spam-exclusion.md — prune confident-spam people
from published so the runtime loads only real members and fits the node
memory budget without a resize.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-runnable script that reads person-evaluations verdicts from the
spam-detection branch and removes confident-spam people from published with
cascaded deletes of their memberships / help-wanted-interest / person
tag-assignments, nulling authorId on their project-updates. Project members
are protected (real involvement overrides a spam verdict). Reads the ~54k
evaluations via streaming git cat-file (not gitsheets) and applies the prune
in one gitsheets transaction. Idempotent; --dry-run reports counts.

No runtime/loader change — published simply ends up smaller.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
spam-detection.md: replace the "spam-purge is a future plan / filter on the
read path" placeholder with the built prune step (command, rule, cascade,
idempotency) and add the mandatory import → merge → eval → prune → push
ordering warning — a re-import resurrects pruned spam until prune re-runs.
cutover.md: add the prune as a required step after the legacy-import merge in
both the T-1 and T-0 sequences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A cold boot OOM'd (FATAL: Reached heap limit) building in-memory state from the
full `published` import. Per-phase heap instrumentation in the live pod traced
it to `loadInMemoryState` reading all eleven sheets via Promise.all: every
sheet's transient read/decompress/parse buffers peaked simultaneously, blowing
past a 1.5 GB heap — even though the retained state is only ~0.5 GB.

Read the sheets sequentially instead. Same records, same order, same retained
footprint; the peak is bounded to the single largest sheet. Boot is not
latency-sensitive, so the serialization costs nothing in practice. Verified:
the live pod boots to /api/health/ready in ~0.48 GB on the standard node size.

Tracked against #132 (heap footprint).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@themightychris themightychris merged commit 9c7b5fa into main Jun 26, 2026
1 check passed
@themightychris themightychris deleted the fix/boot-oom-sequential-load branch June 26, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant