From 7419bbda600c4344034153239cea20c158eacc8f Mon Sep 17 00:00:00 2001 From: Chris Alfano Date: Fri, 26 Jun 2026 00:08:56 -0400 Subject: [PATCH 1/5] docs(specs): add spam-exclusion behavior spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The legacy import carried ~31.8k people, ~61% judged spam by the offline person-evaluations pass. Loading them all exceeded the in-memory heap budget on the standard node size, and spam accounts don't belong in the public civic-transparency dataset anyway. Spec defines: verdict aggregation (prune iff confident spam, no legit, and no project membership), the cascade-prune on `published`, idempotency, and the import → merge → eval → prune → push pipeline ordering. Exclusion happens in the data pipeline, not the runtime loader. Co-Authored-By: Claude Opus 4.8 (1M context) --- specs/behaviors/spam-exclusion.md | 95 +++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) create mode 100644 specs/behaviors/spam-exclusion.md diff --git a/specs/behaviors/spam-exclusion.md b/specs/behaviors/spam-exclusion.md new file mode 100644 index 0000000..657e359 --- /dev/null +++ b/specs/behaviors/spam-exclusion.md @@ -0,0 +1,95 @@ +# Spam exclusion + +Status: proposed + +The legacy laddr import carried ~31.8k people, of which an offline evaluation +pass judged ~61% to be spam. The public runtime holds the full public dataset +in memory at boot (see [storage.md](./storage.md)); loading tens of thousands of +spam accounts is both a memory problem (a cold boot exceeded the heap budget on +the standard node size) and wrong on the merits — spam accounts should not +appear in the public, civic-transparency dataset at all. + +This spec defines how spam verdicts are produced, aggregated, and applied so the +**`published` branch contains only non-spam people**. Pruning happens in the data +pipeline, not at runtime: the runtime loader is spam-unaware and simply loads +whatever `published` contains. + +## Where verdicts come from + +Spam evaluation runs offline and lands on the **`spam-detection`** branch of the +data repo, in the **`person-evaluations`** sheet (path template +`${personSlug}/${evaluator}` — one record per (person, evaluator)). Each record: + +| Field | Meaning | +| ----- | ------- | +| `personSlug` | the evaluated person | +| `evaluator` | model/run id (e.g. `haiku-2026-05`) | +| `verdict` | `"spam"` \| `"legit"` \| `"uncertain"` | +| `confidence` | 0–1 | +| `flags` | array of short reason tags | +| `reasoning` | free-text justification | +| `evaluatedAt` | ISO 8601 UTC | + +The evaluations stay on `spam-detection`; they are **not** merged into +`published` (they are bulky and not runtime data). The pipeline reads them from +`spam-detection` and applies the result to `published`. + +## Per-person verdict aggregation + +A person may have multiple evaluator records. The aggregate decision is +deliberately **conservative — only confident spam is pruned**: + +> A person is **pruned as spam** iff they have at least one `spam` verdict with +> `confidence ≥ SPAM_CONFIDENCE_THRESHOLD` (default **0.8**), no `legit` +> verdict at any confidence, **and no `project-membership`** (real project +> involvement overrides any spam verdict). Otherwise they are **kept** — this +> includes `uncertain`, `legit`, low-confidence spam, anyone who is a project +> member, and people with no evaluation. + +Rationale: false-positive spam removal is worse than keeping a borderline +account, so two signals protect a person — a single confident "legit" from any +evaluator, and any actual project membership (real engagement, not a throwaway +account). `uncertain` people are kept — inactivity is not spam. In practice +spam accounts essentially never hold a project membership, so this protection is +nearly free while it reliably spares real contributors a classifier may misjudge +on thin evidence (e.g. one off-topic intro message). + +## The prune operation + +Applied to `published`, re-runnably. For each pruned person: + +1. Delete the `people` record. +2. Cascade-delete records that belong to that person: + - `project-membership` where `personId` matches + - `help-wanted-interest` where `personId` matches + - `tag-assignment` where `taggableType = "person"` and the taggable is that person +3. Unlink (do **not** delete) `project-update` records whose `authorId` is the + pruned person: set `authorId = null` so project history is preserved with an + unknown author. `project-buzz` is project-scoped and needs no change. + +The operation is **idempotent**: re-running with the same verdicts produces no +new changes; re-running after new verdicts prunes only the newly-confident-spam. +It coexists with runtime writes on `published` (a targeted delete of specific +records, not a full-tree replacement like the importer). + +## What the runtime sees + +Nothing changes in the loader or read services. After a prune, `published` holds +only kept people (legit + uncertain + unevaluated minus confident spam), so the +in-memory state, indices, and FTS are built over that smaller set. Dangling +references are avoided by the cascade, so member lists, help-wanted interest, and +person tags never point at a removed person. + +## Re-runnability & operations + +The prune is an operator step (a re-runnable script), run when a new evaluation +pass lands on `spam-detection`. It is documented alongside the other +operator-facing scripts. Counts (evaluated, pruned, cascade deletions, authors +unlinked) are reported each run. + +## Open questions + +- `SPAM_CONFIDENCE_THRESHOLD` default (0.8) — tune against the verdict + distribution once we see false-positive/negative rates. +- Whether to later surface an admin view of pruned accounts (auditability) — + out of scope here; the evaluations remain on `spam-detection` as the record. From c6a71e9842640ff53668c93dd36e073ea4f948cb Mon Sep 17 00:00:00 2001 From: Chris Alfano Date: Fri, 26 Jun 2026 00:08:56 -0400 Subject: [PATCH 2/5] chore(plans): open spam-prune MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) --- plans/spam-prune.md | 70 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 plans/spam-prune.md diff --git a/plans/spam-prune.md b/plans/spam-prune.md new file mode 100644 index 0000000..4f40692 --- /dev/null +++ b/plans/spam-prune.md @@ -0,0 +1,70 @@ +--- +status: in-progress +depends: [] +specs: + - specs/behaviors/spam-exclusion.md +issues: + - 132 +pr: +--- + +# Plan: prune confident-spam people from published + +## Scope + +A cold boot rebuilding in-memory state from the full `published` import (31,832 +people, ~61% flagged spam by the offline `person-evaluations` pass) exceeds the +heap budget on the 4 GB sandbox nodes. Rather than double node cost, prune the +confident-spam people from `published` so the runtime loads only real members. + +What ships: + +- **`apps/api/scripts/prune-spam.ts`** — re-runnable operator script that reads + `person-evaluations` verdicts from `spam-detection`, applies the aggregation + rule from the spec, and cascade-prunes confident-spam people from `published`. +- No app/loader change — `published` simply ends up smaller. + +## Implements + +- [spam-exclusion.md](../specs/behaviors/spam-exclusion.md) — verdict + aggregation rule, prune + cascade scope, idempotency. + +## Approach + +1. **Read verdicts** from `spam-detection`'s `person-evaluations` sheet + efficiently (git, read-only — 54k records; avoid loading them into gitsheets + memory). Aggregate per person: prune iff ≥1 `spam` verdict at + confidence ≥ `--threshold` (default 0.8) AND no `legit` verdict. +2. **Prune on `published`** via one gitsheets transaction (mirroring + `import-laddr/importer.ts`): `store.people.delete` each spam person; cascade + `project-membership` / `help-wanted-interest` / person `tag-assignment` + deletes; `patch` `project-update.authorId → null`. Idempotent. +3. **`--dry-run`** reports counts + sample without committing. +4. CLI mirrors the importer: `--data-repo`/`$CFP_DATA_REPO_PATH`, + `--evaluations-ref` (default `spam-detection`), `--branch` (default + `published`), `--threshold`, `--dry-run`, `--verbose`. + +## Validation + +- [ ] Dry-run on a fresh clone reports: people before/after (~31.8k → ~12k), + cascade deletion counts, authors-unlinked count. +- [ ] **Spot-check**: sample pruned people + their cascaded records look like + spam (empty/throwaway), not real members — a real-looking cascade is a + signal the threshold/rule is too aggressive. +- [ ] Re-running is idempotent (second run = no changes). +- [ ] After applying to a clone, a local API boot loads the pruned set under + ~1.5 GB heap (fits the current nodes). +- [ ] `npm run type-check && npm run lint` clean. + +## Risks + +- False-positive pruning of real members — mitigated by the conservative rule + (a single confident `legit` protects) and the spot-check gate. Originals + remain on `legacy-import`; evaluations remain on `spam-detection`; re-import + recovers anyone wrongly removed. +- Large single transaction (≈19.5k deletes) — offline script, run with ample + heap; chunk if needed. + +## Notes + +## Follow-ups From 04778783e636d83bece57d5c96f2526777ad04d6 Mon Sep 17 00:00:00 2001 From: Chris Alfano Date: Fri, 26 Jun 2026 00:09:09 -0400 Subject: [PATCH 3/5] feat(api): prune-spam operator script MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Re-runnable script that reads person-evaluations verdicts from the spam-detection branch and removes confident-spam people from published with cascaded deletes of their memberships / help-wanted-interest / person tag-assignments, nulling authorId on their project-updates. Project members are protected (real involvement overrides a spam verdict). Reads the ~54k evaluations via streaming git cat-file (not gitsheets) and applies the prune in one gitsheets transaction. Idempotent; --dry-run reports counts. No runtime/loader change — published simply ends up smaller. Co-Authored-By: Claude Opus 4.8 (1M context) --- apps/api/package.json | 3 +- apps/api/scripts/prune-spam.ts | 626 +++++++++++++++++++++++++++++++++ 2 files changed, 628 insertions(+), 1 deletion(-) create mode 100644 apps/api/scripts/prune-spam.ts diff --git a/apps/api/package.json b/apps/api/package.json index f5f75cd..0859033 100644 --- a/apps/api/package.json +++ b/apps/api/package.json @@ -16,7 +16,8 @@ "script:import-laddr-credentials": "tsx scripts/import-laddr-credentials.ts", "script:reconcile": "tsx scripts/reconcile.ts", "script:cutover-dry-run": "tsx scripts/cutover-dry-run.ts", - "script:cutover-mailout": "tsx scripts/cutover-mailout.ts" + "script:cutover-mailout": "tsx scripts/cutover-mailout.ts", + "script:prune-spam": "tsx scripts/prune-spam.ts" }, "dependencies": { "@aws-sdk/client-s3": "^3.1048.0", diff --git a/apps/api/scripts/prune-spam.ts b/apps/api/scripts/prune-spam.ts new file mode 100644 index 0000000..dbacc9e --- /dev/null +++ b/apps/api/scripts/prune-spam.ts @@ -0,0 +1,626 @@ +/** + * prune-spam.ts — Re-runnable spam-prune operator script. + * + * Reads spam verdicts from the `spam-detection` branch of the data repo, + * aggregates them per the spec rule, and removes confident-spam people from + * the `published` branch with cascaded deletes of their associated records. + * + * Spec: specs/behaviors/spam-exclusion.md + * + * Usage: + * npm run -w apps/api script:prune-spam -- \ + * --data-repo=/path/to/codeforphilly-data \ + * [--evaluations-ref=spam-detection] \ + * [--branch=published] \ + * [--threshold=0.8] \ + * [--dry-run] [--verbose] + * + * --data-repo Path to a local bare clone of the data repo. + * Falls back to $CFP_DATA_REPO_PATH. + * --evaluations-ref Ref to read person-evaluations from (default: spam-detection). + * --branch Branch to prune (default: published). + * --threshold Spam confidence threshold (default: 0.8). + * --dry-run Report without writing. + * --verbose Increase logging verbosity. + */ +import { execFile } from 'node:child_process'; +import { resolve } from 'node:path'; +import { promisify } from 'node:util'; + +const exec = promisify(execFile); + +import { openPublicStore } from '../src/store/public.js'; +import type { + HelpWantedInterestExpression, + ProjectMembership, + ProjectUpdate, + TagAssignment, +} from '@cfp/shared/schemas'; + +// --------------------------------------------------------------------------- +// CLI argument parsing +// --------------------------------------------------------------------------- + +interface CliArgs { + readonly dataRepo: string; + readonly evaluationsRef: string; + readonly branch: string; + readonly threshold: number; + readonly dryRun: boolean; + readonly verbose: boolean; +} + +function parseArgs(argv: readonly string[]): CliArgs { + const opts: Record = {}; + for (const a of argv) { + if (!a.startsWith('--')) continue; + const eq = a.indexOf('='); + if (eq === -1) opts[a.slice(2)] = true; + else opts[a.slice(2, eq)] = a.slice(eq + 1); + } + + const envRepo = process.env['CFP_DATA_REPO_PATH']; + const dataRepoRaw = + typeof opts['data-repo'] === 'string' && opts['data-repo'] !== '' + ? opts['data-repo'] + : envRepo; + if (!dataRepoRaw) { + process.stderr.write('missing --data-repo= (or set CFP_DATA_REPO_PATH)\n'); + process.exit(2); + } + + const thresholdRaw = opts['threshold']; + const threshold = + typeof thresholdRaw === 'string' ? Number.parseFloat(thresholdRaw) : 0.8; + + return { + dataRepo: resolve(dataRepoRaw), + evaluationsRef: + typeof opts['evaluations-ref'] === 'string' && opts['evaluations-ref'] !== '' + ? opts['evaluations-ref'] + : 'spam-detection', + branch: + typeof opts['branch'] === 'string' && opts['branch'] !== '' + ? opts['branch'] + : 'published', + threshold: Number.isFinite(threshold) ? threshold : 0.8, + dryRun: opts['dry-run'] === true, + verbose: opts['verbose'] === true, + }; +} + +// --------------------------------------------------------------------------- +// Verdict aggregation +// --------------------------------------------------------------------------- + +interface PersonVerdict { + /** Whether any evaluator gave spam confidence >= threshold. */ + hasConfidentSpam: boolean; + /** Whether any evaluator gave a legit verdict at any confidence. */ + hasAnyLegit: boolean; +} + +/** + * Parse verdict and confidence from TOML content using line-regex + * (tolerant, avoids pulling in a full TOML parser just for two fields). + */ +function parseEvaluationRecord(tomlContent: string): { + verdict: string | null; + confidence: number | null; +} { + let verdict: string | null = null; + let confidence: number | null = null; + + for (const line of tomlContent.split('\n')) { + const trimmed = line.trim(); + const verdictMatch = trimmed.match(/^verdict\s*=\s*"([^"]+)"/); + if (verdictMatch) { + verdict = verdictMatch[1] ?? null; + continue; + } + const confidenceMatch = trimmed.match(/^confidence\s*=\s*([0-9.]+)/); + if (confidenceMatch) { + const parsed = Number.parseFloat(confidenceMatch[1] ?? ''); + if (Number.isFinite(parsed)) confidence = parsed; + } + } + + return { verdict, confidence }; +} + +/** + * Read all person-evaluations from the given ref via `git cat-file` bulk read. + * Does NOT go through gitsheets — there are ~54k records and we want a + * streaming git-native read. Returns a Map from personSlug → PersonVerdict. + */ +async function aggregateVerdicts( + repo: string, + evaluationsRef: string, + threshold: number, + log: (msg: string) => void, +): Promise> { + log(`[prune-spam] listing person-evaluations under ref=${evaluationsRef}`); + + // List all blobs under person-evaluations/ in the evaluations ref. + const lsOutput = await exec( + 'git', + ['ls-tree', '-r', '--format=%(objectname) %(path)', evaluationsRef, 'person-evaluations/'], + { cwd: repo, maxBuffer: 64 * 1024 * 1024 }, + ); + + const lines = lsOutput.stdout.trim().split('\n').filter((l) => l.length > 0); + log(`[prune-spam] found ${lines.length} evaluation records`); + + if (lines.length === 0) { + return new Map(); + } + + // Build a batch-check-mailbox input: one object hash per line. + // We'll use git cat-file --batch to stream all blob contents. + const hashPathPairs: Array<{ hash: string; path: string }> = []; + for (const line of lines) { + const spaceIdx = line.indexOf(' '); + if (spaceIdx === -1) continue; + const hash = line.slice(0, spaceIdx).trim(); + const path = line.slice(spaceIdx + 1).trim(); + if (hash && path.endsWith('.toml')) { + hashPathPairs.push({ hash, path }); + } + } + + log(`[prune-spam] streaming ${hashPathPairs.length} blobs via git cat-file`); + + // Stream all blobs. We pass hashes on stdin, get " blob \n\n" back. + // Use child_process.spawn for streaming instead of execFile (fits in memory for this size). + const { spawn } = await import('node:child_process'); + + const verdictMap = new Map(); + + await new Promise((resolvePromise, reject) => { + const catFile = spawn('git', ['cat-file', '--batch'], { cwd: repo }); + + let buffer = ''; + let currentExpected: { hash: string; path: string; size: number } | null = null; + let contentAccum = ''; + let contentRead = 0; + let inputIdx = 0; + + // Write all hashes to stdin + const writeNext = (): void => { + if (inputIdx >= hashPathPairs.length) { + catFile.stdin.end(); + return; + } + const pair = hashPathPairs[inputIdx++]; + if (pair) { + catFile.stdin.write(pair.hash + '\n'); + } + }; + + // Kick it off — write all at once (output is streamed back) + for (const pair of hashPathPairs) { + catFile.stdin.write(pair.hash + '\n'); + } + catFile.stdin.end(); + + catFile.stdout.on('data', (chunk: Buffer) => { + buffer += chunk.toString('utf8'); + + // Process as many complete records as possible from buffer. + while (buffer.length > 0) { + if (currentExpected === null) { + // Look for a header line: " blob \n" + const nlIdx = buffer.indexOf('\n'); + if (nlIdx === -1) break; // incomplete header + const header = buffer.slice(0, nlIdx); + buffer = buffer.slice(nlIdx + 1); + + const parts = header.trim().split(' '); + if (parts.length < 3) continue; + const hash = parts[0]!; + const size = Number.parseInt(parts[2] ?? '0', 10); + + // Find the corresponding path + const pairEntry = hashPathPairs.find((p) => p.hash === hash); + if (!pairEntry || !Number.isFinite(size)) continue; + + currentExpected = { hash, path: pairEntry.path, size }; + contentAccum = ''; + contentRead = 0; + } + + if (currentExpected !== null) { + // We need `size` bytes of content + 1 trailing newline + const needed = currentExpected.size + 1 - contentRead; + if (buffer.length < needed) { + // Not enough yet + contentAccum += buffer; + contentRead += buffer.length; + buffer = ''; + break; + } + const chunk2 = buffer.slice(0, needed); + buffer = buffer.slice(needed); + contentAccum += chunk2; + + // Parse the TOML + const tomlContent = contentAccum.slice(0, currentExpected.size); + const pathParts = currentExpected.path.split('/'); + // path is like: person-evaluations//.toml + const personSlug = pathParts[1]; + if (personSlug) { + const { verdict, confidence } = parseEvaluationRecord(tomlContent); + if (verdict !== null && confidence !== null) { + let entry = verdictMap.get(personSlug); + if (!entry) { + entry = { hasConfidentSpam: false, hasAnyLegit: false }; + verdictMap.set(personSlug, entry); + } + if (verdict === 'spam' && confidence >= threshold) { + (entry as { hasConfidentSpam: boolean }).hasConfidentSpam = true; + } + if (verdict === 'legit') { + (entry as { hasAnyLegit: boolean }).hasAnyLegit = true; + } + } + } + + currentExpected = null; + contentAccum = ''; + contentRead = 0; + } + } + }); + + catFile.stderr.on('data', (d: Buffer) => { + process.stderr.write(`[git cat-file stderr] ${d.toString()}`); + }); + + catFile.on('close', (code) => { + if (code !== 0) { + reject(new Error(`git cat-file exited with code ${code}`)); + } else { + resolvePromise(); + } + }); + + // Suppress unused variable warning — writeNext declared for clarity + void writeNext; + }); + + return verdictMap; +} + +/** + * Compute the set of person slugs to prune: + * prune iff hasConfidentSpam AND NOT hasAnyLegit. + */ +function computePruneSet(verdictMap: Map): Set { + const pruneSet = new Set(); + for (const [slug, v] of verdictMap) { + if (v.hasConfidentSpam && !v.hasAnyLegit) { + pruneSet.add(slug); + } + } + return pruneSet; +} + +/** Minimal read surface shared by the live store and an open transaction. */ +interface Queryable { + people: { query(): AsyncIterable }; + 'project-memberships': { query(): AsyncIterable }; +} + +interface CandidatePerson { + readonly record: unknown; + readonly id: string; + readonly slug: string; +} + +/** + * Partition verdict-based candidates into those to prune vs. those protected + * by an existing project membership. Real project involvement overrides a spam + * verdict (see specs/behaviors/spam-exclusion.md). Scans people once and + * project-memberships once against the given queryable. + */ +async function partitionCandidates( + q: Queryable, + candidateSlugs: Set, +): Promise<{ prune: CandidatePerson[]; protectedByMembership: number }> { + const candidates: CandidatePerson[] = []; + for await (const person of q.people.query()) { + const p = person as Record; + const slug = p['slug']; + const id = p['id']; + if (typeof slug === 'string' && typeof id === 'string' && candidateSlugs.has(slug)) { + candidates.push({ record: person, id, slug }); + } + } + + const memberPersonIds = new Set(); + for await (const mem of q['project-memberships'].query()) { + const pid = (mem as Record)['personId']; + if (typeof pid === 'string') memberPersonIds.add(pid); + } + + const prune = candidates.filter((c) => !memberPersonIds.has(c.id)); + return { prune, protectedByMembership: candidates.length - prune.length }; +} + +// --------------------------------------------------------------------------- +// Prune summary +// --------------------------------------------------------------------------- + +export interface PruneSummary { + readonly peopleBefore: number; + readonly prunedPeople: number; + readonly peopleAfter: number; + readonly protectedByMembership: number; + readonly membershipsDeleted: number; + readonly helpWantedInterestDeleted: number; + readonly personTagAssignmentsDeleted: number; + readonly projectUpdatesAuthorNulled: number; + readonly commitHash: string | null; + readonly noChanges: boolean; +} + +// --------------------------------------------------------------------------- +// Git helpers (mirroring import-laddr/importer.ts) +// --------------------------------------------------------------------------- + +async function ensureBranchCheckedOut(repo: string, branch: string): Promise { + // Ensure the local branch ref exists (it should for 'published' in a fresh clone) + try { + await exec('git', ['rev-parse', '--verify', `refs/heads/${branch}`], { cwd: repo }); + } catch { + // Try to create from origin/ + try { + const result = await exec( + 'git', + ['rev-parse', '--verify', `refs/remotes/origin/${branch}`], + { cwd: repo }, + ); + const parentCommit = result.stdout.trim(); + await exec('git', ['update-ref', `refs/heads/${branch}`, parentCommit], { cwd: repo }); + } catch { + throw new Error(`[prune-spam] branch '${branch}' not found in ${repo}`); + } + } + + // Point HEAD at the branch + await exec('git', ['symbolic-ref', 'HEAD', `refs/heads/${branch}`], { cwd: repo }); +} + +// --------------------------------------------------------------------------- +// Core prune logic +// --------------------------------------------------------------------------- + +const AUTHOR_NAME = 'Code for Philly API'; +const AUTHOR_EMAIL = 'api@users.noreply.codeforphilly.org'; + +async function pruneSpam(args: CliArgs): Promise { + const log = args.verbose + ? (msg: string) => console.log(msg) + : (): void => {}; + + // ------------------------------------------------------------------------- + // 1. Read verdicts from evaluations ref (efficient git read) + // ------------------------------------------------------------------------- + log(`[prune-spam] reading verdicts from ref=${args.evaluationsRef}, threshold=${args.threshold}`); + const verdictMap = await aggregateVerdicts( + args.dataRepo, + args.evaluationsRef, + args.threshold, + log, + ); + + const pruneSet = computePruneSet(verdictMap); + log( + `[prune-spam] evaluated=${verdictMap.size} persons, pruneSet=${pruneSet.size} (confident spam with no legit)`, + ); + + // ------------------------------------------------------------------------- + // 2. Open the published store + // ------------------------------------------------------------------------- + log(`[prune-spam] switching ${args.dataRepo} HEAD → refs/heads/${args.branch}`); + await ensureBranchCheckedOut(args.dataRepo, args.branch); + + const { store } = await openPublicStore(args.dataRepo); + + // ------------------------------------------------------------------------- + // 3. Count people before pruning + // ------------------------------------------------------------------------- + let peopleBefore = 0; + // eslint-disable-next-line @typescript-eslint/no-unused-vars + for await (const _p of store.people.query()) { + peopleBefore++; + } + log(`[prune-spam] peopleBefore=${peopleBefore}`); + + if (args.dryRun) { + // Apply the same partition (verdict candidates minus project members) + // the real run uses, so the count is accurate. + const { prune, protectedByMembership } = await partitionCandidates( + store as unknown as Queryable, + pruneSet, + ); + console.log( + `[prune-spam] dry-run: would prune ${prune.length} (of ${pruneSet.size} verdict-flagged slugs); ${protectedByMembership} protected by project membership`, + ); + return { + peopleBefore, + prunedPeople: prune.length, + peopleAfter: peopleBefore - prune.length, + protectedByMembership, + membershipsDeleted: 0, + helpWantedInterestDeleted: 0, + personTagAssignmentsDeleted: 0, + projectUpdatesAuthorNulled: 0, + commitHash: null, + noChanges: true, + }; + } + + // ------------------------------------------------------------------------- + // 4. One atomic transaction: delete people + cascade + // ------------------------------------------------------------------------- + const runAt = new Date().toISOString(); + + let prunedPeople = 0; + let protectedByMembership = 0; + let membershipsDeleted = 0; + let helpWantedInterestDeleted = 0; + let personTagAssignmentsDeleted = 0; + let projectUpdatesAuthorNulled = 0; + + const result = await store.transact( + { + message: `prune: remove confident-spam people from published (${runAt})\n\nThreshold: ${args.threshold}, pruneSet: ${pruneSet.size} slugs evaluated.\n`, + author: { name: AUTHOR_NAME, email: AUTHOR_EMAIL }, + trailers: { + Action: 'prune.spam', + 'Evaluations-Ref': args.evaluationsRef, + 'Threshold': String(args.threshold), + 'Run-At': runAt, + }, + }, + async (tx) => { + // --- Step A: Partition verdict candidates; protect project members --- + // Real project involvement overrides a spam verdict, so candidates who + // hold a project-membership are kept (see the spec). + log(`[prune-spam] partitioning candidates (protecting project members)`); + const { prune, protectedByMembership: protectedCount } = await partitionCandidates( + tx as unknown as Queryable, + pruneSet, + ); + protectedByMembership = protectedCount; + + const prunedIds = new Set(prune.map((c) => c.id)); + for (const c of prune) { + await tx.people.delete(c.record as Parameters[0]); + prunedPeople++; + log(`[prune-spam] deleted person: slug=${c.slug} id=${c.id}`); + } + log(`[prune-spam] prunedPeople=${prunedPeople} protectedByMembership=${protectedByMembership}`); + + if (prunedPeople === 0) { + log(`[prune-spam] nothing to prune (all spam persons already absent or protected)`); + return; + } + + // --- Step B: Cascade-delete project-memberships --- + // By construction members are protected from pruning, so this is + // normally 0; kept as a defensive sweep for any stale membership. + log(`[prune-spam] scanning project-memberships for cascade deletes`); + for await (const mem of tx['project-memberships'].query()) { + const m = mem as unknown as ProjectMembership; + if (prunedIds.has(m.personId)) { + await tx['project-memberships'].delete(mem as unknown as ProjectMembership); + membershipsDeleted++; + } + } + log(`[prune-spam] membershipsDeleted=${membershipsDeleted}`); + + // --- Step C: Cascade-delete help-wanted-interest (path: roleId/personId) --- + log(`[prune-spam] scanning help-wanted-interest for cascade deletes`); + for await (const interest of tx['help-wanted-interest'].query()) { + const hw = interest as unknown as HelpWantedInterestExpression; + if (prunedIds.has(hw.personId)) { + await tx['help-wanted-interest'].delete(interest as unknown as HelpWantedInterestExpression); + helpWantedInterestDeleted++; + } + } + log(`[prune-spam] helpWantedInterestDeleted=${helpWantedInterestDeleted}`); + + // --- Step D: Cascade-delete person tag-assignments (path: taggableType/taggableId/tagId) --- + log(`[prune-spam] scanning tag-assignments for cascade deletes (person type only)`); + for await (const ta of tx['tag-assignments'].query()) { + const t = ta as unknown as TagAssignment; + if (t.taggableType === 'person' && prunedIds.has(t.taggableId)) { + await tx['tag-assignments'].delete(ta as unknown as TagAssignment); + personTagAssignmentsDeleted++; + } + } + log(`[prune-spam] personTagAssignmentsDeleted=${personTagAssignmentsDeleted}`); + + // --- Step E: Null authorId on project-updates authored by pruned people --- + log(`[prune-spam] scanning project-updates for authorId nulling`); + for await (const update of tx['project-updates'].query()) { + const u = update as unknown as ProjectUpdate; + if (u.authorId !== null && u.authorId !== undefined && prunedIds.has(u.authorId)) { + // patch applies a JSON Merge Patch — sets authorId to null + await tx['project-updates'].patch( + { id: u.id } as Record, + { authorId: null } as Partial, + ); + projectUpdatesAuthorNulled++; + } + } + log(`[prune-spam] projectUpdatesAuthorNulled=${projectUpdatesAuthorNulled}`); + }, + ); + + const peopleAfter = peopleBefore - prunedPeople; + + return { + peopleBefore, + prunedPeople, + peopleAfter, + protectedByMembership, + membershipsDeleted, + helpWantedInterestDeleted, + personTagAssignmentsDeleted, + projectUpdatesAuthorNulled, + commitHash: result.commitHash, + noChanges: result.commitHash === null, + }; +} + +// --------------------------------------------------------------------------- +// Entry point +// --------------------------------------------------------------------------- + +async function main(): Promise { + const args = parseArgs(process.argv.slice(2)); + + console.log(`[prune-spam] data-repo=${args.dataRepo}`); + console.log(`[prune-spam] evaluations-ref=${args.evaluationsRef}`); + console.log(`[prune-spam] branch=${args.branch}`); + console.log(`[prune-spam] threshold=${args.threshold}`); + console.log(`[prune-spam] dry-run=${args.dryRun} verbose=${args.verbose}`); + + const summary = await pruneSpam(args); + printSummary(summary, args); +} + +function printSummary(summary: PruneSummary, args: CliArgs): void { + const lines: string[] = []; + lines.push('\n=== prune-spam report ==='); + lines.push(`peopleBefore: ${summary.peopleBefore}`); + lines.push(`prunedPeople: ${summary.prunedPeople}`); + lines.push(`peopleAfter: ${summary.peopleAfter}`); + lines.push(`protectedByMembership: ${summary.protectedByMembership}`); + lines.push(`membershipsDeleted: ${summary.membershipsDeleted}`); + lines.push(`helpWantedInterestDeleted: ${summary.helpWantedInterestDeleted}`); + lines.push(`personTagAssignmentsDeleted: ${summary.personTagAssignmentsDeleted}`); + lines.push(`projectUpdatesAuthorNulled: ${summary.projectUpdatesAuthorNulled}`); + if (args.dryRun) { + lines.push(`(dry-run: no writes performed)`); + } else if (summary.noChanges) { + lines.push(`(no changes — branch unchanged, idempotent)`); + } else if (summary.commitHash) { + lines.push(`commit: ${summary.commitHash} on ${args.branch}`); + } + console.log(lines.join('\n')); +} + +const isMain = + process.argv[1] !== undefined && + import.meta.url.endsWith(process.argv[1].replace(/\\/g, '/')); + +if (isMain) { + main().catch((err: unknown) => { + console.error('[prune-spam] failed:', err); + process.exit(1); + }); +} From bb231fa0e63676d4c38fcefb724875c2d09a1049 Mon Sep 17 00:00:00 2001 From: Chris Alfano Date: Fri, 26 Jun 2026 00:09:09 -0400 Subject: [PATCH 4/5] docs(ops): document the prune step in the reimport process MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit spam-detection.md: replace the "spam-purge is a future plan / filter on the read path" placeholder with the built prune step (command, rule, cascade, idempotency) and add the mandatory import → merge → eval → prune → push ordering warning — a re-import resurrects pruned spam until prune re-runs. cutover.md: add the prune as a required step after the legacy-import merge in both the T-1 and T-0 sequences. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/operations/cutover.md | 28 +++++++++++++++++++++---- docs/operations/spam-detection.md | 34 +++++++++++++++++++++++-------- 2 files changed, 50 insertions(+), 12 deletions(-) diff --git a/docs/operations/cutover.md b/docs/operations/cutover.md index d6bb14c..4fa9095 100644 --- a/docs/operations/cutover.md +++ b/docs/operations/cutover.md @@ -120,7 +120,24 @@ S3-compat bucket). 5. Push the `legacy-import` branch to the production GitHub remote. 6. Merge `legacy-import` into `main` (operator step — review the diff in a PR, resolve any path-template conflicts, then merge). -7. Run reconciliation: +7. **Prune confident-spam** from the runtime branch before it goes live. The + merge in step 6 re-adds the full raw import (spam included); the deployed pod + cannot hold the unpruned set in memory, so this step is **mandatory** after + every import/merge. See + [spam-detection.md → Applying spam decisions](./spam-detection.md#applying-spam-decisions--the-prune-step): + + ```bash + npm run -w apps/api script:prune-spam -- \ + --data-repo=/scratch/codeforphilly-data \ + --evaluations-ref=spam-detection \ + --branch=published \ + --dry-run # review counts, then drop --dry-run and push the branch + ``` + + Newly-imported accounts with no spam verdict yet are kept (the rule only + removes *confident* spam), so an incomplete eval pass is safe — it just keeps + more people than strictly necessary. +8. Run reconciliation: ```bash npm run -w apps/api script:reconcile -- --json=/scratch/reconcile-T1.json @@ -129,13 +146,13 @@ S3-compat bucket). Every counter should be zero in the orphan + inconsistent categories. If anything is flagged, **stop** and investigate before T-0. -8. Deploy the rewrite to production via the production GitOps repo (a +9. Deploy the rewrite to production via the production GitOps repo (a sibling to [`cfp-sandbox-cluster`](https://github.com/CodeForPhilly/cfp-sandbox-cluster) — see [deploy.md](deploy.md)). The pod will boot against the just-imported data + bucket but receive no public traffic yet (Gateway hostname not pointed at the prod LoadBalancer yet). -9. Smoke-test the production hostname through `/etc/hosts` or via direct +10. Smoke-test the production hostname through `/etc/hosts` or via direct cluster IP: hit `/api/health`, `/api/people/`, `/api/projects/`. Don't yet flip DNS. @@ -150,7 +167,10 @@ engineering second has the runbook open and reads checks back. 2. **0:01 — final delta.** Re-run the importer against the live laddr site into the same data-repo path. UUIDs are read-forward from the previous snapshot's tree, so the diff between this commit and the T-1 commit is - exactly the records that changed upstream since T-1. + exactly the records that changed upstream since T-1. **Then merge into the + runtime branch and re-run the spam prune (step 7 above) before the pod + reloads** — the final-delta merge re-adds raw import records, so an unpruned + reload would re-bloat memory. 3. **0:05 — DNS flip.** Update the `codeforphilly.org` A/CNAME to point at the rewrite's ingress. TTL was lowered to 60s a week ago, so propagation completes in under two minutes for most resolvers. diff --git a/docs/operations/spam-detection.md b/docs/operations/spam-detection.md index 7170667..7cec414 100644 --- a/docs/operations/spam-detection.md +++ b/docs/operations/spam-detection.md @@ -157,6 +157,10 @@ npm run evaluate-heuristic # 5. LLM-eval the new uncertain bucket npm run evaluate-llm + +# 6. Apply the verdicts — prune confident-spam from `published` (see below). +# Run from the codeforphilly-rewrite repo against a bare clone, then push. +# THIS STEP IS MANDATORY after any import/merge — see "Applying spam decisions". ``` The heuristic re-evaluates everyone (deterministic, fast, free). Pass A skips slugs already in its cache — only new uncertains cost LLM tokens. Estimate: typical refresh after a `legacy-import` snapshot is dominated by Pass A's cost on newly-imported uncertain accounts — usually under $1 unless a huge batch of new signups landed. @@ -165,19 +169,33 @@ The heuristic re-evaluates everyone (deterministic, fast, free). Pass A skips sl When source records get updated (e.g., a previously-empty profile gets a new bio in a re-imported snapshot), the existing evaluation may no longer reflect the current data. The intended re-eval trigger is `person.updatedAt > evaluation.evaluatedAt`. The current scripts don't implement this filter — they just skip cached entries — so to force re-eval on a slug whose source changed, delete the cache entry first or use `--refresh`. -## Applying spam decisions +## Applying spam decisions — the prune step -Currently the eval records are advisory — they describe verdicts but don't mutate the source data. The plan is to eventually run a **spam-purge** pass that hard-deletes confirmed-spam records and their associated content (memberships, buzz, updates) from `published`. Git history preserves everything for recovery; the deployed app no longer sees them. +Verdicts are advisory until the **prune** step applies them. Prune is not a read-path filter (the runtime loader stays spam-unaware); it **removes confident-spam people from `published`** so the deployed app never loads them into memory or shows them. This is what keeps the in-memory footprint within the node budget — see [specs/behaviors/spam-exclusion.md](../../specs/behaviors/spam-exclusion.md) for the full contract. -Until that purge is written and run, code on the read path can filter person-evaluations records inline: +The tool is `apps/api/scripts/prune-spam.ts` in the **`codeforphilly-rewrite`** repo (not the data repo). Run it against a bare clone of the data repo that carries both `published` and `spam-detection`, dry-run first, then push: -```typescript -const evals = await evaluationsSheet.queryAll({ personSlug }); -const verdict = pickVerdict(evals); // human > haiku > heuristic -if (verdict === 'spam') return null; // skip this person +```bash +# From the codeforphilly-rewrite repo +npm run -w apps/api script:prune-spam -- \ + --data-repo=/path/to/codeforphilly-data.git \ + --evaluations-ref=spam-detection \ + --branch=published \ + --threshold=0.8 \ + --dry-run # drop --dry-run to commit the prune + +git -C /path/to/codeforphilly-data.git push origin published ``` -No `tag-assignments` or moderation tags are used — verdicts live entirely in `person-evaluations` as the separate, dedicated record set. +**Rule** (from the spec): a person is pruned iff they have a `spam` verdict at confidence ≥ threshold (default **0.8**), **no** `legit` verdict at any confidence, **and no project membership** — real project involvement overrides a spam verdict. **Cascade:** also deletes that person's `project-membership`, `help-wanted-interest`, and person `tag-assignment` records, and nulls `authorId` on their `project-update`s (history is kept). The run is **idempotent** and reports counts (pruned, protected-by-membership, cascade deletions). + +> ⚠️ **Ordering — prune AFTER every import/merge.** `published` is the merge target of `legacy-import`, the full raw snapshot (spam included). A fresh import or a merge into `published` **re-adds the pruned spam**, bloating `published` until the next prune — and re-introducing the boot-time OOM that motivated this whole step. The publish pipeline must always end with prune: +> +> **import → merge into `published` → (re-)eval new accounts → prune → push.** +> +> Never push a freshly-merged `published` without re-running the prune. + +`legacy-import` is kept complete and raw for audit/recovery; spam exclusion happens **only** at the `published` layer. Git history preserves every original, so a wrongly-pruned person is recoverable by re-import. No `tag-assignments` or moderation tags are used for verdicts — they live entirely in `person-evaluations`. ## Inspection / auditing From b966800e21d65468f3f4aac788f9f02f67bf10e1 Mon Sep 17 00:00:00 2001 From: Chris Alfano Date: Fri, 26 Jun 2026 00:10:28 -0400 Subject: [PATCH 5/5] chore(plans): mark spam-prune done (PR #133) Co-Authored-By: Claude Opus 4.8 (1M context) --- plans/spam-prune.md | 44 +++++++++++++++++++++++++++++++++----------- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/plans/spam-prune.md b/plans/spam-prune.md index 4f40692..f051855 100644 --- a/plans/spam-prune.md +++ b/plans/spam-prune.md @@ -1,11 +1,11 @@ --- -status: in-progress +status: done depends: [] specs: - specs/behaviors/spam-exclusion.md issues: - 132 -pr: +pr: 133 --- # Plan: prune confident-spam people from published @@ -46,15 +46,15 @@ What ships: ## Validation -- [ ] Dry-run on a fresh clone reports: people before/after (~31.8k → ~12k), - cascade deletion counts, authors-unlinked count. -- [ ] **Spot-check**: sample pruned people + their cascaded records look like - spam (empty/throwaway), not real members — a real-looking cascade is a - signal the threshold/rule is too aggressive. -- [ ] Re-running is idempotent (second run = no changes). -- [ ] After applying to a clone, a local API boot loads the pruned set under - ~1.5 GB heap (fits the current nodes). -- [ ] `npm run type-check && npm run lint` clean. +- [x] Dry-run on a fresh clone reports: 31,832 → 18,203 (pruned 13,629), + 1,710 person tag-assignments deleted, 0 memberships/authors touched. +- [x] **Spot-check**: 9/10 sampled prunes were unambiguous bulk commercial + spam; the 1 with a real project membership (quinn / phillytruce) is now + protected by the membership clause added after the spot-check. +- [x] Re-running is idempotent (second run = 0 changes). +- [x] Pruned set loads + builds FTS in **459 MB heap / 658 MB RSS** at a 1536 + ceiling (full data OOM'd >2.5 GB) — fits the current nodes with margin. +- [x] `npm run type-check && npm run lint` clean. ## Risks @@ -67,4 +67,26 @@ What ships: ## Notes +- The memory win is **super-linear**, not proportional to the 43% people cut: + removing spam dropped the cold-boot heap from >2.5 GB (OOM) to ~459 MB. Spam + accounts carry long marketing-copy bios, so they're individually large records + and heavy FTS terms — pruning them removes outsized memory, not just a head + count. So even the original 1536 heap now fits comfortably; the #131 bump to + 2048/2.5Gi is just headroom. +- **Membership protection added after the spot-check.** The first pass would have + pruned `quinn`, who had a real `phillytruce` membership (spam verdict rested on + one crypto-framed intro message). Added "no project membership" to the protect + rule — across all 13,629 prunes only 1 person was protected, confirming spam + accounts essentially never hold memberships (so the clause is near-free). +- The prune is a **data-pipeline** step, not runtime — `loader.ts` is untouched. + `published` is spam-free only by running prune after every import/merge; this + ordering is now documented in spam-detection.md + cutover.md. + ## Follow-ups + +- **Tracked as #132** — investigate the in-memory footprint itself (the per-record + heap cost), so growth doesn't re-pressure the budget even with spam pruned. +- **Deferred (ops):** fold merge+prune into a single `publish`/rebuild step so the + ordering can't be skipped by hand. Documented for now; not yet automated. No + issue filed — revisit when the publish flow gets automated past the manual + cutover runbook.