Skip to content

fix: redact current_stage from student_public bundles#56

Merged
shaypal5 merged 3 commits into
mainfrom
fix/current-stage-leakage-redaction
May 4, 2026
Merged

fix: redact current_stage from student_public bundles#56
shaypal5 merged 3 commits into
mainfrom
fix/current-stage-leakage-redaction

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the current_stage leakage that was surviving into student_public task splits at the 90-day horizon (where the column held terminal stages closed_won/closed_lost, perfectly encoding the label). Closes the "Known issue" entry under Phase 5 of .agent-plan.md.

Why this approach (option 2 — exposure-layer redaction)

I considered both options listed in .agent-plan.md:

  1. Windowed snapshot (snapshot at day < label horizon, like v6/v7 datasets at day 14/20). Fundamentally more correct — features anchored before label resolution — but blast radius is huge: every aggregated feature value shifts, all downstream documentation and conversion-rate copy needs rewriting, and every byte of every published Parquet changes for non-leakage reasons. The v6/v7 datasets that do use windowed snapshots are simplified one-off teaching CSVs (lead_scoring_intro_v*.csv), not the full relational release/ bundle.
  2. Exposure-layer column redaction (chosen). Strip leakage_risk and not is_leakage_trap columns in student_public mode. Smaller blast radius (only the leaky columns disappear), generalisable rule, naturally reuses the existing BundleFilter machinery, and preserves the deliberately included pedagogical trap total_touches_all. The simulation, the relational tables, and every non-leakage feature are unchanged.

A new is_leakage_trap field on FeatureSpec distinguishes the trap (total_touches_all, intentionally included for teaching) from a true leak (current_stage, the bug). The redaction set is derived from features at import time — any future leakage_risk=True column without is_leakage_trap=True is automatically stripped from student_public exports.

is_mql / is_sql were reviewed at the same time per the task brief: they are point-in-time achievement flags that correlate with the label but don't deterministically encode it (is_sql=True for many non-converted leads who reached SQL but didn't close). Their existing leakage_risk=False is correct; no change needed.

Changes

Schema layer

  • schema/features.py: new is_leakage_trap: bool = False on FeatureSpec; set True on total_touches_all; export STUDENT_PUBLIC_REDACTED_COLUMNS.
  • schema/dictionaries.py: emit is_leakage_trap column in feature_dictionary.csv so consumers can tell traps from leaks.

Exposure layer

  • exposure/filters.py: BundleFilter.redacted_columns: frozenset[str]. Populated for student_public from the feature spec; empty for research_instructor.

Bundle assembly

  • api/bundle.py: drop redacted columns from the snapshot DataFrame before writing task splits, and pass the visible feature tuple to the dictionary writer and dataset card renderer. Manifest hashes therefore reflect the published column set.
  • narrative/dataset_card.py: accept a features arg so the categories / leakage section reflects what's actually published.

Validation

  • validation/bundle_checks.py: new _check_exposure_redaction() reads exposure_mode from the manifest and asserts that the redacted columns are not present in any task split or in the feature dictionary. Wires CLAUDE.md invariant docs: add mandatory branch/PR workflow rules #1 into validate_bundle() instead of relying on hope.
  • validation/invariants.py: relax check_exposure_monotonicity() from byte identity on feature_dictionary.csv and task splits to subset-with-shared-content semantics (student rows ⊆ instructor rows; shared columns hold identical values).

Build pipeline

  • scripts/build_public_release.py: drop _FLAT_CSV_DROP_COLS — the bundle writer now redacts at source, so the flat CSV inherits the redaction transitively.

Documentation

  • release/README.md and release/HF_DATASET_CARD.md: update column counts (33 + trap + target in student_public; 34 + trap + target in instructor) and rewrite the leakage section to explain stripped vs. deliberately retained.
  • .agent-plan.md: flip the "Known issue: current_stage leakage" entry to resolved.

Tests

tests/exposure/test_redaction.py (19 new tests):

  • Static checks: BundleFilter.redacted_columns matches the derived feature-spec set; research_instructor redacts nothing; the redaction set contains current_stage and excludes total_touches_all.
  • End-to-end (student_public): every task split lacks current_stage, every task split keeps total_touches_all, no redacted column survives in any split, the target column is intact, the feature dictionary excludes current_stage, includes the trap, and has the expected row count, and validate_bundle() produces no redaction errors.
  • End-to-end (research_instructor): every task split keeps current_stage and total_touches_all; the dictionary still has all 35 rows.
  • Cross-mode: student columns ⊆ instructor columns; the difference is exactly the redacted set; for shared columns the values match row-by-row (post-shuffle, by determinism).
  • Validation enforcement: an instructor bundle whose manifest has been hand-edited to claim student_public is correctly flagged by validate_bundle().

tests/schema/test_features.py extended with five new assertions covering the trap flag, the redaction set, and current_stage's role.

Verification

  • pytest: 899 passing (was 880; +19 new redaction tests).
  • ruff check / ruff format / mypy: clean.
  • python scripts/build_public_release.py release: all 4 bundles generated and validated. Confirmed via pyarrow.parquet.read_schema:
    • release/{intro,intermediate,advanced}/.../train.parquet → 34 columns, no current_stage, total_touches_all present.
    • release/intermediate_instructor/.../train.parquet → 35 columns including current_stage.
  • python scripts/verify_hash_determinism.py: 73/73 files hash-identical across two consecutive builds. Determinism preserved.

Out of scope

Per the task brief: uploading to Kaggle / HuggingFace and announcing the release (manual user steps, last in Phase 5).

Test plan

  • pytest passes (899/899)
  • ruff check and ruff format --check clean
  • mypy leadforge/ clean
  • scripts/build_public_release.py regenerates all 4 bundles, all pass validate_bundle()
  • scripts/verify_hash_determinism.py reports 73/73 identical
  • Manual inspection confirms current_stage is absent from student_public task splits and present in research_instructor
  • feature_dictionary.csv row counts: 34 in student_public (no current_stage), 35 in research_instructor

🤖 Generated with Claude Code

shaypal5 and others added 2 commits May 4, 2026 11:29
Add an `is_leakage_trap` flag to FeatureSpec and an exposure-layer
column-redaction pass that drops `leakage_risk and not is_leakage_trap`
columns from student_public snapshots, task splits, feature dictionary,
and dataset card. `current_stage` (true label leak) is now removed in
student_public; `total_touches_all` (intentional pedagogical trap) is
preserved via the new flag.

- `schema/features.py`: add `is_leakage_trap` field; mark
  `total_touches_all`; export `STUDENT_PUBLIC_REDACTED_COLUMNS`.
- `schema/dictionaries.py`: emit `is_leakage_trap` column in
  `feature_dictionary.csv`.
- `exposure/filters.py`: `BundleFilter.redacted_columns`; populated for
  student_public.
- `api/bundle.py`: drop redacted columns from snapshot before splits;
  thread the visible feature tuple to dataset card and dictionary writers.
- `narrative/dataset_card.py`: accept a `features` arg so categories /
  leakage section reflect what is actually published.
- `validation/bundle_checks.py`: new `_check_exposure_redaction()`
  enforces CLAUDE.md invariant #1 — redacted columns must not appear in
  the published task splits or dictionary.
- `validation/invariants.py`: relax exposure-monotonicity from byte
  identity to subset-with-shared-content for `feature_dictionary.csv`
  and task splits, so redaction doesn't trip the check.

Tests: extend `tests/schema/test_features.py` with five new assertions
covering the trap flag, the redaction set, and `current_stage`'s role.

All 880 existing tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- `tests/exposure/test_redaction.py` (19 tests): bundle-level proof that
  `current_stage` is absent from every student_public task split, that
  `total_touches_all` (the pedagogical trap) is preserved, that the
  research_instructor bundle still contains both, that the student
  feature dictionary differs from the instructor's by exactly the
  redacted set, that shared columns hold identical values across modes,
  and that `validate_bundle()` flags a tampered student_public bundle
  whose task split still carries `current_stage`.
- `scripts/build_public_release.py`: drop `_FLAT_CSV_DROP_COLS` — the
  bundle writer now redacts at source, so the flat CSV inherits it.
- `release/README.md`, `release/HF_DATASET_CARD.md`: update column
  counts (33 features + 1 trap + 1 target in student_public; 34 + 1 + 1
  in the instructor companion) and rewrite the leakage section to
  distinguish stripped columns from the deliberate trap.
- `.agent-plan.md`: flip "Known issue: current_stage leakage" to
  resolved with a link to the redaction mechanism.

Bundles regenerated; all four pass `leadforge validate`;
`scripts/verify_hash_determinism.py` confirms 73/73 files identical
across two consecutive builds. Full test suite: 899 passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added type: bugfix Fixes a bug layer: render render/ bundle and artifact output layer: exposure exposure/ truth-mode filtering layer: validation validation/ invariants and checks labels May 4, 2026
Copilot AI review requested due to automatic review settings May 4, 2026 08:34
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a label leakage issue in student_public bundles by introducing an exposure-layer redaction rule that removes leakage-risk columns (except explicitly marked pedagogical traps) from published task splits and their associated metadata (feature dictionary + dataset card).

Changes:

  • Add is_leakage_trap to FeatureSpec and derive STUDENT_PUBLIC_REDACTED_COLUMNS from the canonical feature spec.
  • Apply redaction during bundle writing (before task splits / hashes) and thread the “visible” feature list into the feature dictionary + dataset card.
  • Strengthen validation/invariants and add end-to-end tests to ensure redacted columns never appear in student_public outputs.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
leadforge/schema/features.py Adds is_leakage_trap and derives the student-public redaction set from feature specs.
leadforge/schema/dictionaries.py Emits is_leakage_trap in feature_dictionary.csv.
leadforge/exposure/filters.py Extends BundleFilter with redacted_columns and wires student-public defaults.
leadforge/api/bundle.py Drops redacted columns pre-write and passes visible features to dictionary + dataset card rendering.
leadforge/narrative/dataset_card.py Accepts a features argument so the card reflects the published column set.
leadforge/validation/bundle_checks.py Adds _check_exposure_redaction() and wires it into validate_bundle().
leadforge/validation/invariants.py Updates exposure monotonicity checks to allow subset semantics for redacted outputs.
scripts/build_public_release.py Removes flat-CSV-specific dropping in favor of source redaction.
tests/exposure/test_redaction.py Adds end-to-end tests covering student/instructor behavior, cross-mode consistency, and validation enforcement.
tests/schema/test_features.py Adds schema-level assertions for traps, redaction set, and dictionary columns.
release/README.md Updates leakage section and column-count documentation to match redaction behavior.
release/HF_DATASET_CARD.md Updates dataset card leakage explanation and feature/column counts.
.agent-plan.md Marks the current_stage leakage issue as resolved with the chosen approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread release/HF_DATASET_CARD.md Outdated
|---|---|---|---|
| Leads | 5,000 | 5,000 | 5,000 |
| Features | 35 | 35 | 35 |
| Features | 33 (+ 1 trap, + 1 target) | 33 (+ 1 trap, + 1 target) | 33 (+ 1 trap, + 1 target) |
Comment thread release/README.md Outdated
| Contacts | 4,200 | 4,200 | 4,200 |
| Columns | 35 (34 features + 1 target) | 35 | 35 |
| Columns (student_public) | 34 (33 features + 1 target) | 34 | 34 |
| Columns (instructor) | 35 (34 features + 1 target, +`current_stage`) | -- | -- |
extra_in_student = set(s_df.index) - set(i_df.index)
if extra_in_student:
errors.append(
"feature_dictionary.csv: student has rows missing from instructor: "
extra_in_student = set(s_df.columns) - set(i_df.columns)
if extra_in_student:
errors.append(
f"Task {rel}: student has columns missing from instructor: "
Comment thread leadforge/validation/bundle_checks.py Outdated
Comment on lines +192 to +196
For ``student_public`` bundles, the columns listed in the bundle filter's
``redacted_columns`` must not appear in any task split or in the feature
dictionary. This guards CLAUDE.md invariant #1 ("no post-snapshot
features in student_public exports") against accidental regressions in
the build pipeline.
Addresses senior-dev review of the previous two commits:

1. **Single source of truth.** Replaced the awkward
   `leakage_risk + is_leakage_trap` flag pair with a single prescriptive
   field `FeatureSpec.redact_in_modes: frozenset[ExposureMode]`.
   `leakage_risk` reverts to purely descriptive ("post-snapshot
   correlated"). `current_stage` carries `redact_in_modes={student_public}`;
   `total_touches_all` keeps `leakage_risk=True` with empty `redact_in_modes`
   (deliberate trap). New `redacted_columns_for(mode, features=...)`
   parameterizable function replaces the frozen module-level set.

2. **Validation independent of the writer.** `_check_exposure_redaction`
   now derives the expected redaction set directly from `LEAD_SNAPSHOT_FEATURES`
   via `redacted_columns_for`, *not* from `BundleFilter`. A bug in the
   filter no longer agrees-with-itself past the validator. Added a real
   regression test that writes a real student_public bundle, mutates a
   parquet to reinsert `current_stage`, and asserts validation fails.

3. **Self-describing manifest.** `manifest.json` now records
   `redacted_columns: [...]`. The validator cross-checks the manifest's
   declared set against the feature-spec-derived expected set; a
   second new test mutates the manifest to claim nothing was redacted
   and asserts validation flags the disagreement.

4. **Stricter exposure monotonicity.** `check_exposure_monotonicity`
   now asserts that `instructor_columns - student_columns` *equals*
   `redacted_columns_for(student_public)` — not just "is a superset".
   A future column drop in student_public outside the redaction set is
   now caught.

5. **Reverted feature_dictionary.csv schema.** The previous commit added
   an `is_leakage_trap` column to the published CSV without flagging the
   schema change. Removed: the redaction policy is package-internal and
   the bundle's actual published schema is observable from the parquet
   files and `manifest.redacted_columns`. CSV stays back-compat with
   prior releases (6 columns: name, dtype, description, category,
   is_target, leakage_risk).

6. **De-duplicated documentation.** `release/README.md` no longer
   maintains a per-category feature count table by hand — that data is
   in the auto-generated `dataset_card.md`. Removed the duplicated
   table; added explicit caveats covering known structural-leakage
   issues (event aggregates over the label window, `is_mql` zero
   variance, `is_sql=False` near-deterministic for non-conversion).

7. **Documented the structural follow-up.** `.agent-plan.md` now
   includes a clearly-scoped "Follow-up: structural leakage" section
   covering: (a) windowed-snapshot fix for event aggregates;
   (b) measured `is_sql` leakage P(conv|is_sql=False) = 0.038/0.015/0.006
   across tiers; (c) `is_mql` is constant True (zero variance). Suggests
   filing a tracked issue (deferred — needs user confirmation per
   shared-state-action rules).

Removed: `BundleFilter.redacted_columns` (replaced by direct calls to
`redacted_columns_for(mode)` from the bundle writer).

Tests: schema tests rewritten for `redact_in_modes` semantics; redaction
tests updated for the new function-based API; +2 manifest-redaction
tests; +1 real regression test (mutated parquet) replacing the previous
manifest-string-edit test. 902 tests passing (was 899).

Hash determinism verified: 73/73 files identical across two consecutive
builds. All four release bundles regenerated; `feature_dictionary.csv`
is now back to its pre-PR 6-column schema.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5

shaypal5 commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Self-review pass: 10 issues raised, 10 addressed in commit 66b96ee

I took an outsider's view on the original two commits and found ten substantive problems. Concrete fixes per critique:

# Issue Action
1 Symptomatic vs structural fix (event aggregates also leak) Documented as a clearly-scoped follow-up in .agent-plan.md covering windowed-snapshot, is_sql, and is_mql. Not opening a tracked issue without explicit confirmation.
2 is_leakage_trap contradicts leakage_risk Collapsed: leakage_risk is now purely descriptive; new redact_in_modes: frozenset[ExposureMode] is the single prescriptive field.
3 _check_exposure_redaction was self-referential Validator now derives expected redactions from LEAD_SNAPSHOT_FEATURES directly via redacted_columns_for, independent of BundleFilter. New regression test mutates a real parquet to reinsert current_stage and asserts validation fails.
4 is_mql / is_sql review hand-waved Measured on regenerated bundle. P(conv | is_sql=False) = 0.038 / 0.015 / 0.006 (intro/intermediate/advanced) — near-deterministic for the negative class. P(conv | is_mql=True) = base rate (because is_mql is constant True — separate zero-variance bug). Captured in the follow-up note.
5 Bundle wasn't self-describing manifest.json now carries redacted_columns: [...]; validator cross-checks. New test verifies the manifest matches the spec-derived expected set.
6 README duplicated auto-generated counts Removed the per-category table; readers are pointed at dataset_card.md and feature_dictionary.csv.
7 Added is_leakage_trap column to feature_dictionary.csv without asking Reverted. CSV is back to its pre-PR 6-column schema. The redaction policy stays package-internal; the published schema is observable from parquet + manifest.redacted_columns.
8 check_exposure_monotonicity was too lenient Now asserts instructor_cols - student_cols == redacted_columns_for(student_public) exactly, not just superset.
9 Module-level STUDENT_PUBLIC_REDACTED_COLUMNS not parameterizable Replaced by redacted_columns_for(mode, features=LEAD_SNAPSHOT_FEATURES) -> frozenset[str].
10 Tautological leakage_trap_implies_leakage_risk test Dropped with #2. New schema tests assert positive properties (target never redacted; current_stage is in the student_public set; trap is not).

Verification after refactor:

  • pytest: 902 passing (was 899; +3 new validator/manifest tests, no regressions)
  • ruff / mypy: clean
  • scripts/build_public_release.py regenerates all four bundles, all pass validate_bundle()
  • scripts/verify_hash_determinism.py: 73/73 files identical across two consecutive builds
  • feature_dictionary.csv reverted to pre-PR schema; manifest.json now has redacted_columns

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #56 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: release/HF_DATASET_CARD.md
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311658
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The summary table’s “Features” count appears inconsistent with the bundle’s actual published schema described elsewhere (e.g., `student_public` task splits are 34 columns total = 33 non-target columns + 1 target, where the 33 includes the `total_touches_all` trap). “33 (+ 1 trap, + 1 target)” reads like 35 columns. Please clarify whether the 33 includes the trap or update the numbers/labels to match the actual Parquet column counts.

## COPILOT-2
Location: release/README.md
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311717
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The “Columns (instructor)” cell is confusing/inconsistent: “35 (34 features + 1 target, +`current_stage`)” suggests `current_stage` is in addition to the 34 features, but it should be included in the 34 non-target columns if the total is 35. Please rephrase to avoid double-counting (e.g., “35 (34 features incl. `current_stage` + 1 target)”).

## COPILOT-3
Location: leadforge/validation/invariants.py:174
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311750
Root author: copilot-pull-request-reviewer

Comment:
    In `check_exposure_monotonicity()`, `extra_in_student = set(s_df.index) - set(i_df.index)` represents feature names that are present in the student dictionary but *not* in the instructor dictionary. The error text currently says “student has rows missing from instructor”, which reads backwards/ambiguous; it should say that student has rows *not present* in instructor (or similar) so the violation is clear.

## COPILOT-4
Location: leadforge/validation/invariants.py:265
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311775
Root author: copilot-pull-request-reviewer

Comment:
    `extra_in_student = set(s_df.columns) - set(i_df.columns)` means the student task split has columns that are *not present* in the instructor split. The emitted message says “student has columns missing from instructor”, which is misleading (it sounds like the inverse). Please reword to something like “student has extra columns not in instructor” to match the condition being checked.

## COPILOT-5
Location: leadforge/validation/bundle_checks.py
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311801
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The docstring for `_check_exposure_redaction()` says it enforces “no post-snapshot features in student_public exports”, but the code explicitly allows leakage-risk features marked as pedagogical traps (e.g. `total_touches_all`). Consider updating the wording to reflect the actual contract (e.g., “no leakage-risk columns except explicit traps”), otherwise the comment/invariant reference is misleading.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25313065556 attempt 1
Comment timestamp: 2026-05-04T10:06:24.394223+00:00
PR head commit: 66b96ee1d567ad4fb1080c8326b25960401cda3f

@shaypal5 shaypal5 merged commit 841c6a8 into main May 4, 2026
8 checks passed
@shaypal5 shaypal5 deleted the fix/current-stage-leakage-redaction branch May 4, 2026 10:41
shaypal5 added a commit that referenced this pull request May 4, 2026
Addresses senior-dev review of #58.  Six fixes:

1. **Close the relational-table bypass.**  The previous commit redacted
   columns only from the snapshot/task splits.  Users following the
   README's "Option 3" (feature engineering off `tables/leads.parquet`)
   could trivially rejoin redacted columns by `account_id`/`contact_id`.
   Apply `redacted_columns_for(mode)` uniformly to every published
   parquet under both `tables/` and `tasks/`.  In student_public
   bundles, `tables/leads.parquet` no longer carries `current_stage` or
   `is_sql`.  This also closes the equivalent bypass that PR #56 left
   open for `current_stage`.

2. **Fully remove `is_mql`.**  Previously left as a half-measure: the
   field was dropped from the snapshot but kept on `LeadRow` and so
   still appeared as a constant-True column in `tables/leads.parquet`.
   Now removed from the entity dataclass, `DTYPE_MAP`,
   `population.py`, and `engine.py`.  Gone from every artifact in
   every mode.

3. **Multi-seed measurement for the is_sql claim.**  The PR description's
   single-seed numbers overstated the case ("99.4% deterministic").
   Re-measured across 5 seeds on full-size bundles:
   P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010
   (intermediate) / 0.011 ± 0.004 (advanced).  Pattern holds, framing
   is now honest.  CHANGELOG and release/README.md updated with the
   new numbers.

4. **Strengthen the zero-variance test.**  The strict `nunique >= 2`
   guard caught `is_mql` but missed the weaker "rarest class is 1
   row" pattern.  For low-cardinality columns on bundles ≥ 200 rows,
   additionally assert the rarest class has ≥ 1% of rows (skipped on
   tiny test fixtures where the threshold would false-positive).

5. **Add an explicit v3 schema contract.**
   `tests/render/test_bundle_schema_v3_contract.py` pins the exact
   column set per mode for both `tasks/` and `tables/leads.parquet`.
   The duplication is intentional: any future change touching the
   feature spec, LeadRow, or redaction policy must update this file
   *and* bump `BUNDLE_SCHEMA_VERSION`.  A bare schema-changing PR
   that doesn't touch this file fails here.

6. **Parameterize the realism boolean test.**  `test_detects_non_boolean
   _feature` no longer hard-codes `opportunity_created`.  It picks
   the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test
   time so future column renames don't break it for unrelated reasons.

Validation knock-on: `_check_stage_distribution` (realism.py) and the
seed-drift `current_stage` collection (drift.py) now skip when the
column is missing — consistent with the new redaction surface.
`check_exposure_monotonicity` extended to allow legitimate column
differences in `tables/leads.parquet` between modes (subset of the
expected redaction set).

Tests:
- 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance,
  -1 removed `is_mql` simulation test, others updated).
- ruff + mypy clean.
- All 4 release bundles regenerate and validate.
- `verify_hash_determinism.py`: 73/73 files identical across builds.

Per-bundle column counts after this commit:
  student_public: task=32, leads.parquet=9
  research_instructor: task=34, leads.parquet=11

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 4, 2026
* fix: redact is_sql, remove zero-variance is_mql (closes 2/3 of #57)

Bundle schema v2 → v3.  Addresses two of the three structural-leakage
concerns surfaced in PR #56's self-review and tracked in issue #57.

**`is_mql` removed from the canonical feature list.** Every lead is
initialised at MQL stage in `simulation/population.py`, making the
column constant `True` and zero-variance — no information for modelling.
The `LeadRow.is_mql` field is retained on the relational `leads.parquet`
table; only the snapshot/task-split column and feature-dictionary row
are removed. Affects all exposure modes.

**`is_sql` redacted in `student_public` mode.** Measured on the v2
bundles: P(converted | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across
intro / intermediate / advanced — at advanced tier essentially
deterministic for the negative class.  `is_sql` is added to
`redact_in_modes={ExposureMode.student_public}`; the redaction
machinery from PR #56 picks it up automatically.  `is_sql` remains in
`research_instructor` exports.

**`BUNDLE_SCHEMA_VERSION` bumped to "3".** Removing `is_mql` changes the
canonical feature set in every mode (not just a mode-conditional
redaction), so a future external consumer pinning the v2 schema and
checking for `is_mql` would silently break.  Bumping is what the field
is for.

**New regression test.** `test_no_zero_variance_features` in
`tests/exposure/test_redaction.py` reads each published student_public
task split and asserts every non-ID, non-target column has at least
two distinct values.  Guards against the `is_mql` zero-variance
regression and any future zero-variance feature slipping in.

Bundle column counts (v3):
  - student_public/{intro,intermediate,advanced}: 32 (was 34)
  - research_instructor/intermediate_instructor: 34 (was 35)

Verification:
  - 903 tests passing (was 902; +1 zero-variance regression test)
  - ruff + mypy clean
  - All 4 release bundles regenerated; all pass `validate_bundle()`
  - `verify_hash_determinism.py` reports 73/73 identical across builds

Open follow-up: issue #57 sub-item 1 (event-aggregate features computed
over the label window) — its own PR, will likely bump the schema again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor: close redaction bypass via tables/, fully remove is_mql

Addresses senior-dev review of #58.  Six fixes:

1. **Close the relational-table bypass.**  The previous commit redacted
   columns only from the snapshot/task splits.  Users following the
   README's "Option 3" (feature engineering off `tables/leads.parquet`)
   could trivially rejoin redacted columns by `account_id`/`contact_id`.
   Apply `redacted_columns_for(mode)` uniformly to every published
   parquet under both `tables/` and `tasks/`.  In student_public
   bundles, `tables/leads.parquet` no longer carries `current_stage` or
   `is_sql`.  This also closes the equivalent bypass that PR #56 left
   open for `current_stage`.

2. **Fully remove `is_mql`.**  Previously left as a half-measure: the
   field was dropped from the snapshot but kept on `LeadRow` and so
   still appeared as a constant-True column in `tables/leads.parquet`.
   Now removed from the entity dataclass, `DTYPE_MAP`,
   `population.py`, and `engine.py`.  Gone from every artifact in
   every mode.

3. **Multi-seed measurement for the is_sql claim.**  The PR description's
   single-seed numbers overstated the case ("99.4% deterministic").
   Re-measured across 5 seeds on full-size bundles:
   P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010
   (intermediate) / 0.011 ± 0.004 (advanced).  Pattern holds, framing
   is now honest.  CHANGELOG and release/README.md updated with the
   new numbers.

4. **Strengthen the zero-variance test.**  The strict `nunique >= 2`
   guard caught `is_mql` but missed the weaker "rarest class is 1
   row" pattern.  For low-cardinality columns on bundles ≥ 200 rows,
   additionally assert the rarest class has ≥ 1% of rows (skipped on
   tiny test fixtures where the threshold would false-positive).

5. **Add an explicit v3 schema contract.**
   `tests/render/test_bundle_schema_v3_contract.py` pins the exact
   column set per mode for both `tasks/` and `tables/leads.parquet`.
   The duplication is intentional: any future change touching the
   feature spec, LeadRow, or redaction policy must update this file
   *and* bump `BUNDLE_SCHEMA_VERSION`.  A bare schema-changing PR
   that doesn't touch this file fails here.

6. **Parameterize the realism boolean test.**  `test_detects_non_boolean
   _feature` no longer hard-codes `opportunity_created`.  It picks
   the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test
   time so future column renames don't break it for unrelated reasons.

Validation knock-on: `_check_stage_distribution` (realism.py) and the
seed-drift `current_stage` collection (drift.py) now skip when the
column is missing — consistent with the new redaction surface.
`check_exposure_monotonicity` extended to allow legitimate column
differences in `tables/leads.parquet` between modes (subset of the
expected redaction set).

Tests:
- 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance,
  -1 removed `is_mql` simulation test, others updated).
- ruff + mypy clean.
- All 4 release bundles regenerate and validate.
- `verify_hash_determinism.py`: 73/73 files identical across builds.

Per-bundle column counts after this commit:
  student_public: task=32, leads.parquet=9
  research_instructor: task=34, leads.parquet=11

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: exposure exposure/ truth-mode filtering layer: render render/ bundle and artifact output layer: validation validation/ invariants and checks type: bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants