fix(deploy): raise heap + memory limit for full published import by themightychris · Pull Request #131 · CodeForPhilly/codeforphilly-ng

themightychris · 2026-06-25T12:23:16Z

Why — incident: sandbox boot OOM (and the over-correction that followed)

Deploying the home-CTA fix (#128) included a rollout restart. The Recreate strategy terminated the only replica, and the new pod crash-looped on boot with:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Root cause (pre-existing, not the CTA change): a cold boot rebuilds in-memory state from the full published import — ~31.8k people, ~10.4k tag-assignments, 1k tags, 268 projects, plus secondary indices. That no longer fits in the prior 1.5 GB V8 old-space. The long-running pod had been serving state from an earlier, lighter boot; the restart forced the first cold load of the current import. (FTS5 is better-sqlite3 / off-heap, so the V8 heap holds the record maps + indices.)

Over-correction (resolved): the first fix here raised heap→3072 / limit→3.5Gi. On these ~3.9Gi nodes that let the pod grow until it starved a node's kubelet → NodeNotReady, cascading into an RWO volume multi-attach deadlock. Corrected to node-safe values, which have run stable for 8h.

What

configmap.yaml: NODE_OPTIONS --max-old-space-size 1536 → 2048
deployment.yaml: container memory limit 2Gi → 2560Mi, requests 768Mi → 1Gi

2048 boots the full import cleanly; capping the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take a node down again.

Status

Live deployment is already on these values (applied during the incident) — 8h stable, 0 restarts. This PR makes the repo/GitOps match what's running.
Also resolved a latent kustomize drift along the way: the live Deployment's selector predated managed-by: kustomize being added to the selector, so apply failed spec.selector: immutable. The Deployment was deleted + recreated, so it now matches the rendered selector and future apply -k is clean.

Follow-ups

perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132 — investigate the ~60× on-disk→heap expansion (the real lever; 4GB nodes are a tight fit for ~2–2.5Gi RSS).
Consider a larger node type for this workload (infra decision).

🤖 Generated with Claude Code

A cold boot rebuilding in-memory state from the full `published` import (~31.8k people, ~10.4k tag-assignments, plus secondary indices) OOM'd at the previous 1536Mi V8 old-space ceiling: "FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory". The long-running pod had been serving state from an earlier, lighter boot and never had to rebuild; a rollout restart forced the first cold load of the current import and it no longer fit. The native FTS5 store is off-heap (better-sqlite3), so the V8 heap holds the record maps + indices. Raise NODE_OPTIONS --max-old-space-size 1536 -> 3072 and the container memory limit 2Gi -> 3.5Gi (nodes are 3.9Gi), with requests 768Mi -> 1Gi. The ~60x on-disk-to-heap expansion is suspiciously large and is tracked separately for a memory-optimization investigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Correcting this branch's first attempt. The initial 3072 heap / 3.5Gi limit restored the boot but was too large for the ~3.9Gi nodes: as the pod grew it starved the node's kubelet and drove it NodeNotReady, which cascaded into an RWO volume multi-attach deadlock and a longer outage. The proven-safe values (live for 8h, 0 restarts): heap 2048 / container limit 2560Mi / request 1Gi. 2048 boots the full `published` import cleanly; capping the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take a node down again. Reducing the footprint further is tracked in the memory-optimization issue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

themightychris mentioned this pull request Jun 25, 2026

perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132

Open

themightychris merged commit 19fb503 into main Jun 25, 2026
1 check passed

themightychris deleted the fix/boot-oom-heap-bump branch June 25, 2026 23:18

This was referenced Jun 26, 2026

feat(spam): prune confident-spam from published #133

Merged

fix(api): load sheets sequentially at boot to avoid OOM spike #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deploy): raise heap + memory limit for full published import#131

fix(deploy): raise heap + memory limit for full published import#131
themightychris merged 2 commits into
mainfrom
fix/boot-oom-heap-bump

themightychris commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

themightychris commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why — incident: sandbox boot OOM (and the over-correction that followed)

What

Status

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

themightychris commented Jun 25, 2026 •

edited

Loading