fix(api): load sheets sequentially at boot to avoid OOM spike#134
Merged
Conversation
The legacy import carried ~31.8k people, ~61% judged spam by the offline person-evaluations pass. Loading them all exceeded the in-memory heap budget on the standard node size, and spam accounts don't belong in the public civic-transparency dataset anyway. Spec defines: verdict aggregation (prune iff confident spam, no legit, and no project membership), the cascade-prune on `published`, idempotency, and the import → merge → eval → prune → push pipeline ordering. Exclusion happens in the data pipeline, not the runtime loader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-runnable script that reads person-evaluations verdicts from the spam-detection branch and removes confident-spam people from published with cascaded deletes of their memberships / help-wanted-interest / person tag-assignments, nulling authorId on their project-updates. Project members are protected (real involvement overrides a spam verdict). Reads the ~54k evaluations via streaming git cat-file (not gitsheets) and applies the prune in one gitsheets transaction. Idempotent; --dry-run reports counts. No runtime/loader change — published simply ends up smaller. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
spam-detection.md: replace the "spam-purge is a future plan / filter on the read path" placeholder with the built prune step (command, rule, cascade, idempotency) and add the mandatory import → merge → eval → prune → push ordering warning — a re-import resurrects pruned spam until prune re-runs. cutover.md: add the prune as a required step after the legacy-import merge in both the T-1 and T-0 sequences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A cold boot OOM'd (FATAL: Reached heap limit) building in-memory state from the full `published` import. Per-phase heap instrumentation in the live pod traced it to `loadInMemoryState` reading all eleven sheets via Promise.all: every sheet's transient read/decompress/parse buffers peaked simultaneously, blowing past a 1.5 GB heap — even though the retained state is only ~0.5 GB. Read the sheets sequentially instead. Same records, same order, same retained footprint; the peak is bounded to the single largest sheet. Boot is not latency-sensitive, so the serialization costs nothing in practice. Verified: the live pod boots to /api/health/ready in ~0.48 GB on the standard node size. Tracked against #132 (heap footprint). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Cold boots OOM'd (
FATAL ERROR: Reached heap limit) building in-memory state from the fullpublishedimport — the incident behind #131/#132. The fix turned out to be one line of concurrency, not node size or data volume.Root cause (found via live-pod heap instrumentation)
loadInMemoryStateread all eleven sheets viaPromise.all. Running everyqueryAll()concurrently makes each sheet's transient read/decompress/parse buffers peak at the same instant. On the amd64 runtime that combined spike blew past a 1.5 GB heap — even though the retained state is only ~0.48 GB.Per-phase logging in the pod made it unambiguous:
loadInMemoryState, during the concurrentqueryAllreadspeople=18203(the expected count)What
apps/api/src/store/memory/loader.ts: replace thePromise.allover the elevenqueryAll()reads with sequentialawaits. Same records, same order, same retained footprint; the peak is bounded to the single largest sheet. Boot isn't latency-sensitive, so serializing the reads costs nothing in practice.No spec change (the storage spec never mandated concurrent loading).
Validation
/api/health/ready200 on the standard node size, 0 restarts, ~0.48 GB retained.type-check+lintclean; loader/store/memory tests pass (12/12).Relationship to the other recent PRs
🤖 Generated with Claude Code