fix(api): load sheets sequentially at boot to avoid OOM spike by themightychris · Pull Request #134 · CodeForPhilly/codeforphilly-ng

themightychris · 2026-06-26T14:12:15Z

Why

Cold boots OOM'd (FATAL ERROR: Reached heap limit) building in-memory state from the full published import — the incident behind #131/#132. The fix turned out to be one line of concurrency, not node size or data volume.

Root cause (found via live-pod heap instrumentation)

loadInMemoryState read all eleven sheets via Promise.all. Running every queryAll() concurrently makes each sheet's transient read/decompress/parse buffers peak at the same instant. On the amd64 runtime that combined spike blew past a 1.5 GB heap — even though the retained state is only ~0.48 GB.

Per-phase logging in the pod made it unambiguous:

reconcile + push-daemon + private-store load: heap flat at ~60 MB (none of them was the hog)
the balloon was entirely inside loadInMemoryState, during the concurrent queryAll reads
sequential reads → boots in ~0.48 GB, people=18203 (the expected count)

What

apps/api/src/store/memory/loader.ts: replace the Promise.all over the eleven queryAll() reads with sequential awaits. Same records, same order, same retained footprint; the peak is bounded to the single largest sheet. Boot isn't latency-sensitive, so serializing the reads costs nothing in practice.

No spec change (the storage spec never mandated concurrent loading).

Validation

Live sandbox pod boots to /api/health/ready 200 on the standard node size, 0 restarts, ~0.48 GB retained.
type-check + lint clean; loader/store/memory tests pass (12/12).

Relationship to the other recent PRs

feat(spam): prune confident-spam from published #133 (spam prune) is still a good change on its own merits, but it was not what fixed the OOM (the crash signature was identical before/after pruning — the tell that the cause was elsewhere).
fix(deploy): raise heap + memory limit for full published import #131 bumped heap→2048/limit→2.5Gi as a mitigation; with this fix the original 1.5 GB would suffice, but the headroom is harmless and left as-is.
perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132 (heap footprint) remains open as the broader lever.

🤖 Generated with Claude Code

The legacy import carried ~31.8k people, ~61% judged spam by the offline person-evaluations pass. Loading them all exceeded the in-memory heap budget on the standard node size, and spam accounts don't belong in the public civic-transparency dataset anyway. Spec defines: verdict aggregation (prune iff confident spam, no legit, and no project membership), the cascade-prune on `published`, idempotency, and the import → merge → eval → prune → push pipeline ordering. Exclusion happens in the data pipeline, not the runtime loader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Re-runnable script that reads person-evaluations verdicts from the spam-detection branch and removes confident-spam people from published with cascaded deletes of their memberships / help-wanted-interest / person tag-assignments, nulling authorId on their project-updates. Project members are protected (real involvement overrides a spam verdict). Reads the ~54k evaluations via streaming git cat-file (not gitsheets) and applies the prune in one gitsheets transaction. Idempotent; --dry-run reports counts. No runtime/loader change — published simply ends up smaller. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

spam-detection.md: replace the "spam-purge is a future plan / filter on the read path" placeholder with the built prune step (command, rule, cascade, idempotency) and add the mandatory import → merge → eval → prune → push ordering warning — a re-import resurrects pruned spam until prune re-runs. cutover.md: add the prune as a required step after the legacy-import merge in both the T-1 and T-0 sequences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A cold boot OOM'd (FATAL: Reached heap limit) building in-memory state from the full `published` import. Per-phase heap instrumentation in the live pod traced it to `loadInMemoryState` reading all eleven sheets via Promise.all: every sheet's transient read/decompress/parse buffers peaked simultaneously, blowing past a 1.5 GB heap — even though the retained state is only ~0.5 GB. Read the sheets sequentially instead. Same records, same order, same retained footprint; the peak is bounded to the single largest sheet. Boot is not latency-sensitive, so the serialization costs nothing in practice. Verified: the live pod boots to /api/health/ready in ~0.48 GB on the standard node size. Tracked against #132 (heap footprint). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

themightychris and others added 8 commits June 26, 2026 00:08

chore(plans): open spam-prune

c6a71e9

Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(plans): mark spam-prune done (PR #133)

b966800

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(plans): open boot-oom-sequential-load

ef79e0c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(plans): mark boot-oom-sequential-load done (PR #134)

463f71d

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

themightychris merged commit 9c7b5fa into main Jun 26, 2026
1 check passed

themightychris deleted the fix/boot-oom-sequential-load branch June 26, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(api): load sheets sequentially at boot to avoid OOM spike#134

fix(api): load sheets sequentially at boot to avoid OOM spike#134
themightychris merged 8 commits into
mainfrom
fix/boot-oom-sequential-load

themightychris commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

themightychris commented Jun 26, 2026

Why

Root cause (found via live-pod heap instrumentation)

What

Validation

Relationship to the other recent PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant