fix(deploy): raise heap + memory limit for full published import#131
Merged
Conversation
A cold boot rebuilding in-memory state from the full `published` import (~31.8k people, ~10.4k tag-assignments, plus secondary indices) OOM'd at the previous 1536Mi V8 old-space ceiling: "FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory". The long-running pod had been serving state from an earlier, lighter boot and never had to rebuild; a rollout restart forced the first cold load of the current import and it no longer fit. The native FTS5 store is off-heap (better-sqlite3), so the V8 heap holds the record maps + indices. Raise NODE_OPTIONS --max-old-space-size 1536 -> 3072 and the container memory limit 2Gi -> 3.5Gi (nodes are 3.9Gi), with requests 768Mi -> 1Gi. The ~60x on-disk-to-heap expansion is suspiciously large and is tracked separately for a memory-optimization investigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correcting this branch's first attempt. The initial 3072 heap / 3.5Gi limit restored the boot but was too large for the ~3.9Gi nodes: as the pod grew it starved the node's kubelet and drove it NodeNotReady, which cascaded into an RWO volume multi-attach deadlock and a longer outage. The proven-safe values (live for 8h, 0 restarts): heap 2048 / container limit 2560Mi / request 1Gi. 2048 boots the full `published` import cleanly; capping the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take a node down again. Reducing the footprint further is tracked in the memory-optimization issue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why — incident: sandbox boot OOM (and the over-correction that followed)
Deploying the home-CTA fix (#128) included a
rollout restart. The Recreate strategy terminated the only replica, and the new pod crash-looped on boot with:Root cause (pre-existing, not the CTA change): a cold boot rebuilds in-memory state from the full
publishedimport — ~31.8k people, ~10.4k tag-assignments, 1k tags, 268 projects, plus secondary indices. That no longer fits in the prior 1.5 GB V8 old-space. The long-running pod had been serving state from an earlier, lighter boot; the restart forced the first cold load of the current import. (FTS5 isbetter-sqlite3/ off-heap, so the V8 heap holds the record maps + indices.)Over-correction (resolved): the first fix here raised heap→3072 / limit→3.5Gi. On these ~3.9Gi nodes that let the pod grow until it starved a node's kubelet → NodeNotReady, cascading into an RWO volume multi-attach deadlock. Corrected to node-safe values, which have run stable for 8h.
What
configmap.yaml:NODE_OPTIONS --max-old-space-size1536 → 2048deployment.yaml: container memory limit 2Gi → 2560Mi, requests 768Mi → 1Gi2048 boots the full import cleanly; capping the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take a node down again.
Status
managed-by: kustomizebeing added to the selector, soapplyfailedspec.selector: immutable. The Deployment was deleted + recreated, so it now matches the rendered selector and futureapply -kis clean.Follow-ups
🤖 Generated with Claude Code