Skip to content

feat(spam): prune confident-spam from published#133

Open
themightychris wants to merge 5 commits into
mainfrom
feat/spam-prune
Open

feat(spam): prune confident-spam from published#133
themightychris wants to merge 5 commits into
mainfrom
feat/spam-prune

Conversation

@themightychris

Copy link
Copy Markdown
Member

Why

The full published import (~31.8k people, ~61% offline-flagged spam) no longer fits the in-memory heap budget on the 4 GB sandbox nodes — a cold boot OOM'd. Rather than double node cost to hold tens of thousands of spam accounts, prune confident-spam from published so the runtime loads only real members. Spam accounts also don't belong in the public, civic-transparency dataset.

Context: this came out of the boot-OOM incident (see #132). Pruning is the durable fix; the memory bump in #131 just bought headroom.

What

  • specs/behaviors/spam-exclusion.md — the contract: verdict aggregation, prune + cascade scope, idempotency, pipeline ordering.
  • apps/api/scripts/prune-spam.ts — re-runnable operator script. Reads person-evaluations verdicts from spam-detection (streaming git cat-file, not gitsheets — 54k records), aggregates per person, and cascade-prunes confident-spam from published in one gitsheets transaction.
  • plans/spam-prune.md — the plan (links perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132).
  • Docsspam-detection.md + cutover.md updated so the reimport process always runs the prune (with the resurrection-on-reimport ordering warning).

No runtime/loader changepublished simply ends up smaller.

The rule

Prune a person iff: ≥1 spam verdict at confidence ≥ 0.8, and no legit verdict at any confidence, and no project membership (real involvement overrides a spam verdict). Cascade deletes their memberships / help-wanted-interest / person tag-assignments; nulls authorId on their project-updates.

Validation (on a throwaway clone)

People 31,832 → 18,203 (pruned 13,629)
Protected by project membership 1
Person tag-assignments removed 1,710
Memberships / updates touched 0 / 0
Idempotent re-run ✅ 0 changes
Boot (pruned) heap / RSS 459 MB / 658 MB @ a 1536 cap (full data OOM'd >2.5 GB)
type-check / lint

Spot-checked 10 pruned accounts: 9 unambiguous bulk-created commercial spam; the 1 with a real project membership is now protected by the membership clause.

Ordering (documented, mandatory)

published is the merge target of legacy-import (full raw snapshot). A re-import/merge re-adds pruned spam, so the pipeline must always end with prune: import → merge → (re-)eval → prune → push.

🤖 Generated with Claude Code

themightychris and others added 5 commits June 26, 2026 00:08
The legacy import carried ~31.8k people, ~61% judged spam by the offline
person-evaluations pass. Loading them all exceeded the in-memory heap budget
on the standard node size, and spam accounts don't belong in the public
civic-transparency dataset anyway.

Spec defines: verdict aggregation (prune iff confident spam, no legit, and no
project membership), the cascade-prune on `published`, idempotency, and the
import → merge → eval → prune → push pipeline ordering. Exclusion happens in
the data pipeline, not the runtime loader.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements specs/behaviors/spam-exclusion.md — prune confident-spam people
from published so the runtime loads only real members and fits the node
memory budget without a resize.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-runnable script that reads person-evaluations verdicts from the
spam-detection branch and removes confident-spam people from published with
cascaded deletes of their memberships / help-wanted-interest / person
tag-assignments, nulling authorId on their project-updates. Project members
are protected (real involvement overrides a spam verdict). Reads the ~54k
evaluations via streaming git cat-file (not gitsheets) and applies the prune
in one gitsheets transaction. Idempotent; --dry-run reports counts.

No runtime/loader change — published simply ends up smaller.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
spam-detection.md: replace the "spam-purge is a future plan / filter on the
read path" placeholder with the built prune step (command, rule, cascade,
idempotency) and add the mandatory import → merge → eval → prune → push
ordering warning — a re-import resurrects pruned spam until prune re-runs.
cutover.md: add the prune as a required step after the legacy-import merge in
both the T-1 and T-0 sequences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant