fix: redact current_stage from student_public bundles by shaypal5 · Pull Request #56 · leadforge-dev/leadforge

shaypal5 · 2026-05-04T08:34:42Z

Summary

Fixes the current_stage leakage that was surviving into student_public task splits at the 90-day horizon (where the column held terminal stages closed_won/closed_lost, perfectly encoding the label). Closes the "Known issue" entry under Phase 5 of .agent-plan.md.

Why this approach (option 2 — exposure-layer redaction)

I considered both options listed in .agent-plan.md:

Windowed snapshot (snapshot at day < label horizon, like v6/v7 datasets at day 14/20). Fundamentally more correct — features anchored before label resolution — but blast radius is huge: every aggregated feature value shifts, all downstream documentation and conversion-rate copy needs rewriting, and every byte of every published Parquet changes for non-leakage reasons. The v6/v7 datasets that do use windowed snapshots are simplified one-off teaching CSVs (lead_scoring_intro_v*.csv), not the full relational release/ bundle.
Exposure-layer column redaction (chosen). Strip leakage_risk and not is_leakage_trap columns in student_public mode. Smaller blast radius (only the leaky columns disappear), generalisable rule, naturally reuses the existing BundleFilter machinery, and preserves the deliberately included pedagogical trap total_touches_all. The simulation, the relational tables, and every non-leakage feature are unchanged.

A new is_leakage_trap field on FeatureSpec distinguishes the trap (total_touches_all, intentionally included for teaching) from a true leak (current_stage, the bug). The redaction set is derived from features at import time — any future leakage_risk=True column without is_leakage_trap=True is automatically stripped from student_public exports.

is_mql / is_sql were reviewed at the same time per the task brief: they are point-in-time achievement flags that correlate with the label but don't deterministically encode it (is_sql=True for many non-converted leads who reached SQL but didn't close). Their existing leakage_risk=False is correct; no change needed.

Changes

Schema layer

schema/features.py: new is_leakage_trap: bool = False on FeatureSpec; set True on total_touches_all; export STUDENT_PUBLIC_REDACTED_COLUMNS.
schema/dictionaries.py: emit is_leakage_trap column in feature_dictionary.csv so consumers can tell traps from leaks.

Exposure layer

exposure/filters.py: BundleFilter.redacted_columns: frozenset[str]. Populated for student_public from the feature spec; empty for research_instructor.

Bundle assembly

api/bundle.py: drop redacted columns from the snapshot DataFrame before writing task splits, and pass the visible feature tuple to the dictionary writer and dataset card renderer. Manifest hashes therefore reflect the published column set.
narrative/dataset_card.py: accept a features arg so the categories / leakage section reflects what's actually published.

Validation

validation/bundle_checks.py: new _check_exposure_redaction() reads exposure_mode from the manifest and asserts that the redacted columns are not present in any task split or in the feature dictionary. Wires CLAUDE.md invariant docs: add mandatory branch/PR workflow rules #1 into validate_bundle() instead of relying on hope.
validation/invariants.py: relax check_exposure_monotonicity() from byte identity on feature_dictionary.csv and task splits to subset-with-shared-content semantics (student rows ⊆ instructor rows; shared columns hold identical values).

Build pipeline

scripts/build_public_release.py: drop _FLAT_CSV_DROP_COLS — the bundle writer now redacts at source, so the flat CSV inherits the redaction transitively.

Documentation

release/README.md and release/HF_DATASET_CARD.md: update column counts (33 + trap + target in student_public; 34 + trap + target in instructor) and rewrite the leakage section to explain stripped vs. deliberately retained.
.agent-plan.md: flip the "Known issue: current_stage leakage" entry to resolved.

Tests

tests/exposure/test_redaction.py (19 new tests):

Static checks: BundleFilter.redacted_columns matches the derived feature-spec set; research_instructor redacts nothing; the redaction set contains current_stage and excludes total_touches_all.
End-to-end (student_public): every task split lacks current_stage, every task split keeps total_touches_all, no redacted column survives in any split, the target column is intact, the feature dictionary excludes current_stage, includes the trap, and has the expected row count, and validate_bundle() produces no redaction errors.
End-to-end (research_instructor): every task split keeps current_stage and total_touches_all; the dictionary still has all 35 rows.
Cross-mode: student columns ⊆ instructor columns; the difference is exactly the redacted set; for shared columns the values match row-by-row (post-shuffle, by determinism).
Validation enforcement: an instructor bundle whose manifest has been hand-edited to claim student_public is correctly flagged by validate_bundle().

tests/schema/test_features.py extended with five new assertions covering the trap flag, the redaction set, and current_stage's role.

Verification

pytest: 899 passing (was 880; +19 new redaction tests).
ruff check / ruff format / mypy: clean.
python scripts/build_public_release.py release: all 4 bundles generated and validated. Confirmed via pyarrow.parquet.read_schema:
- release/{intro,intermediate,advanced}/.../train.parquet → 34 columns, no current_stage, total_touches_all present.
- release/intermediate_instructor/.../train.parquet → 35 columns including current_stage.
python scripts/verify_hash_determinism.py: 73/73 files hash-identical across two consecutive builds. Determinism preserved.

Out of scope

Per the task brief: uploading to Kaggle / HuggingFace and announcing the release (manual user steps, last in Phase 5).

Test plan

pytest passes (899/899)
ruff check and ruff format --check clean
mypy leadforge/ clean
scripts/build_public_release.py regenerates all 4 bundles, all pass validate_bundle()
scripts/verify_hash_determinism.py reports 73/73 identical
Manual inspection confirms current_stage is absent from student_public task splits and present in research_instructor
feature_dictionary.csv row counts: 34 in student_public (no current_stage), 35 in research_instructor

🤖 Generated with Claude Code

Add an `is_leakage_trap` flag to FeatureSpec and an exposure-layer column-redaction pass that drops `leakage_risk and not is_leakage_trap` columns from student_public snapshots, task splits, feature dictionary, and dataset card. `current_stage` (true label leak) is now removed in student_public; `total_touches_all` (intentional pedagogical trap) is preserved via the new flag. - `schema/features.py`: add `is_leakage_trap` field; mark `total_touches_all`; export `STUDENT_PUBLIC_REDACTED_COLUMNS`. - `schema/dictionaries.py`: emit `is_leakage_trap` column in `feature_dictionary.csv`. - `exposure/filters.py`: `BundleFilter.redacted_columns`; populated for student_public. - `api/bundle.py`: drop redacted columns from snapshot before splits; thread the visible feature tuple to dataset card and dictionary writers. - `narrative/dataset_card.py`: accept a `features` arg so categories / leakage section reflect what is actually published. - `validation/bundle_checks.py`: new `_check_exposure_redaction()` enforces CLAUDE.md invariant #1 — redacted columns must not appear in the published task splits or dictionary. - `validation/invariants.py`: relax exposure-monotonicity from byte identity to subset-with-shared-content for `feature_dictionary.csv` and task splits, so redaction doesn't trip the check. Tests: extend `tests/schema/test_features.py` with five new assertions covering the trap flag, the redaction set, and `current_stage`'s role. All 880 existing tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- `tests/exposure/test_redaction.py` (19 tests): bundle-level proof that `current_stage` is absent from every student_public task split, that `total_touches_all` (the pedagogical trap) is preserved, that the research_instructor bundle still contains both, that the student feature dictionary differs from the instructor's by exactly the redacted set, that shared columns hold identical values across modes, and that `validate_bundle()` flags a tampered student_public bundle whose task split still carries `current_stage`. - `scripts/build_public_release.py`: drop `_FLAT_CSV_DROP_COLS` — the bundle writer now redacts at source, so the flat CSV inherits it. - `release/README.md`, `release/HF_DATASET_CARD.md`: update column counts (33 features + 1 trap + 1 target in student_public; 34 + 1 + 1 in the instructor companion) and rewrite the leakage section to distinguish stripped columns from the deliberate trap. - `.agent-plan.md`: flip "Known issue: current_stage leakage" to resolved with a link to the redaction mechanism. Bundles regenerated; all four pass `leadforge validate`; `scripts/verify_hash_determinism.py` confirms 73/73 files identical across two consecutive builds. Full test suite: 899 passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes a label leakage issue in student_public bundles by introducing an exposure-layer redaction rule that removes leakage-risk columns (except explicitly marked pedagogical traps) from published task splits and their associated metadata (feature dictionary + dataset card).

Changes:

Add is_leakage_trap to FeatureSpec and derive STUDENT_PUBLIC_REDACTED_COLUMNS from the canonical feature spec.
Apply redaction during bundle writing (before task splits / hashes) and thread the “visible” feature list into the feature dictionary + dataset card.
Strengthen validation/invariants and add end-to-end tests to ensure redacted columns never appear in student_public outputs.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
leadforge/schema/features.py	Adds `is_leakage_trap` and derives the student-public redaction set from feature specs.
leadforge/schema/dictionaries.py	Emits `is_leakage_trap` in `feature_dictionary.csv`.
leadforge/exposure/filters.py	Extends `BundleFilter` with `redacted_columns` and wires student-public defaults.
leadforge/api/bundle.py	Drops redacted columns pre-write and passes visible features to dictionary + dataset card rendering.
leadforge/narrative/dataset_card.py	Accepts a `features` argument so the card reflects the published column set.
leadforge/validation/bundle_checks.py	Adds `_check_exposure_redaction()` and wires it into `validate_bundle()`.
leadforge/validation/invariants.py	Updates exposure monotonicity checks to allow subset semantics for redacted outputs.
scripts/build_public_release.py	Removes flat-CSV-specific dropping in favor of source redaction.
tests/exposure/test_redaction.py	Adds end-to-end tests covering student/instructor behavior, cross-mode consistency, and validation enforcement.
tests/schema/test_features.py	Adds schema-level assertions for traps, redaction set, and dictionary columns.
release/README.md	Updates leakage section and column-count documentation to match redaction behavior.
release/HF_DATASET_CARD.md	Updates dataset card leakage explanation and feature/column counts.
.agent-plan.md	Marks the `current_stage` leakage issue as resolved with the chosen approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 |---|---|---|---|
 | Leads | 5,000 | 5,000 | 5,000 |
-| Features | 35 | 35 | 35 |
+| Features | 33 (+ 1 trap, + 1 target) | 33 (+ 1 trap, + 1 target) | 33 (+ 1 trap, + 1 target) |


 | Contacts | 4,200 | 4,200 | 4,200 |
-| Columns | 35 (34 features + 1 target) | 35 | 35 |
+| Columns (student_public) | 34 (33 features + 1 target) | 34 | 34 |
+| Columns (instructor) | 35 (34 features + 1 target, +`current_stage`) | -- | -- |


+        extra_in_student = set(s_df.index) - set(i_df.index)
+        if extra_in_student:
+            errors.append(
+                "feature_dictionary.csv: student has rows missing from instructor: "


+        extra_in_student = set(s_df.columns) - set(i_df.columns)
+        if extra_in_student:
+            errors.append(
+                f"Task {rel}: student has columns missing from instructor: "


+    For ``student_public`` bundles, the columns listed in the bundle filter's
+    ``redacted_columns`` must not appear in any task split or in the feature
+    dictionary.  This guards CLAUDE.md invariant #1 ("no post-snapshot
+    features in student_public exports") against accidental regressions in
+    the build pipeline.


Addresses senior-dev review of the previous two commits: 1. **Single source of truth.** Replaced the awkward `leakage_risk + is_leakage_trap` flag pair with a single prescriptive field `FeatureSpec.redact_in_modes: frozenset[ExposureMode]`. `leakage_risk` reverts to purely descriptive ("post-snapshot correlated"). `current_stage` carries `redact_in_modes={student_public}`; `total_touches_all` keeps `leakage_risk=True` with empty `redact_in_modes` (deliberate trap). New `redacted_columns_for(mode, features=...)` parameterizable function replaces the frozen module-level set. 2. **Validation independent of the writer.** `_check_exposure_redaction` now derives the expected redaction set directly from `LEAD_SNAPSHOT_FEATURES` via `redacted_columns_for`, *not* from `BundleFilter`. A bug in the filter no longer agrees-with-itself past the validator. Added a real regression test that writes a real student_public bundle, mutates a parquet to reinsert `current_stage`, and asserts validation fails. 3. **Self-describing manifest.** `manifest.json` now records `redacted_columns: [...]`. The validator cross-checks the manifest's declared set against the feature-spec-derived expected set; a second new test mutates the manifest to claim nothing was redacted and asserts validation flags the disagreement. 4. **Stricter exposure monotonicity.** `check_exposure_monotonicity` now asserts that `instructor_columns - student_columns` *equals* `redacted_columns_for(student_public)` — not just "is a superset". A future column drop in student_public outside the redaction set is now caught. 5. **Reverted feature_dictionary.csv schema.** The previous commit added an `is_leakage_trap` column to the published CSV without flagging the schema change. Removed: the redaction policy is package-internal and the bundle's actual published schema is observable from the parquet files and `manifest.redacted_columns`. CSV stays back-compat with prior releases (6 columns: name, dtype, description, category, is_target, leakage_risk). 6. **De-duplicated documentation.** `release/README.md` no longer maintains a per-category feature count table by hand — that data is in the auto-generated `dataset_card.md`. Removed the duplicated table; added explicit caveats covering known structural-leakage issues (event aggregates over the label window, `is_mql` zero variance, `is_sql=False` near-deterministic for non-conversion). 7. **Documented the structural follow-up.** `.agent-plan.md` now includes a clearly-scoped "Follow-up: structural leakage" section covering: (a) windowed-snapshot fix for event aggregates; (b) measured `is_sql` leakage P(conv|is_sql=False) = 0.038/0.015/0.006 across tiers; (c) `is_mql` is constant True (zero variance). Suggests filing a tracked issue (deferred — needs user confirmation per shared-state-action rules). Removed: `BundleFilter.redacted_columns` (replaced by direct calls to `redacted_columns_for(mode)` from the bundle writer). Tests: schema tests rewritten for `redact_in_modes` semantics; redaction tests updated for the new function-based API; +2 manifest-redaction tests; +1 real regression test (mutated parquet) replacing the previous manifest-string-edit test. 902 tests passing (was 899). Hash determinism verified: 73/73 files identical across two consecutive builds. All four release bundles regenerated; `feature_dictionary.csv` is now back to its pre-PR 6-column schema. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

shaypal5 · 2026-05-04T10:06:16Z

Self-review pass: 10 issues raised, 10 addressed in commit `66b96ee`

I took an outsider's view on the original two commits and found ten substantive problems. Concrete fixes per critique:

#	Issue	Action
1	Symptomatic vs structural fix (event aggregates also leak)	Documented as a clearly-scoped follow-up in `.agent-plan.md` covering windowed-snapshot, `is_sql`, and `is_mql`. Not opening a tracked issue without explicit confirmation.
2	`is_leakage_trap` contradicts `leakage_risk`	Collapsed: `leakage_risk` is now purely descriptive; new `redact_in_modes: frozenset[ExposureMode]` is the single prescriptive field.
3	`_check_exposure_redaction` was self-referential	Validator now derives expected redactions from `LEAD_SNAPSHOT_FEATURES` directly via `redacted_columns_for`, independent of `BundleFilter`. New regression test mutates a real parquet to reinsert `current_stage` and asserts validation fails.
4	`is_mql` / `is_sql` review hand-waved	Measured on regenerated bundle. P(conv \| is_sql=False) = 0.038 / 0.015 / 0.006 (intro/intermediate/advanced) — near-deterministic for the negative class. P(conv \| is_mql=True) = base rate (because `is_mql` is constant True — separate zero-variance bug). Captured in the follow-up note.
5	Bundle wasn't self-describing	`manifest.json` now carries `redacted_columns: [...]`; validator cross-checks. New test verifies the manifest matches the spec-derived expected set.
6	README duplicated auto-generated counts	Removed the per-category table; readers are pointed at `dataset_card.md` and `feature_dictionary.csv`.
7	Added `is_leakage_trap` column to `feature_dictionary.csv` without asking	Reverted. CSV is back to its pre-PR 6-column schema. The redaction policy stays package-internal; the published schema is observable from parquet + `manifest.redacted_columns`.
8	`check_exposure_monotonicity` was too lenient	Now asserts `instructor_cols - student_cols == redacted_columns_for(student_public)` exactly, not just superset.
9	Module-level `STUDENT_PUBLIC_REDACTED_COLUMNS` not parameterizable	Replaced by `redacted_columns_for(mode, features=LEAD_SNAPSHOT_FEATURES) -> frozenset[str]`.
10	Tautological `leakage_trap_implies_leakage_risk` test	Dropped with #2. New schema tests assert positive properties (target never redacted; `current_stage` is in the student_public set; trap is not).

Verification after refactor:

pytest: 902 passing (was 899; +3 new validator/manifest tests, no regressions)
ruff / mypy: clean
scripts/build_public_release.py regenerates all four bundles, all pass validate_bundle()
scripts/verify_hash_determinism.py: 73/73 files identical across two consecutive builds
feature_dictionary.csv reverted to pre-PR schema; manifest.json now has redacted_columns

github-actions · 2026-05-04T10:07:13Z

pr-agent-context report:

This run includes unresolved review comments on PR #56 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: release/HF_DATASET_CARD.md
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311658
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The summary table’s “Features” count appears inconsistent with the bundle’s actual published schema described elsewhere (e.g., `student_public` task splits are 34 columns total = 33 non-target columns + 1 target, where the 33 includes the `total_touches_all` trap). “33 (+ 1 trap, + 1 target)” reads like 35 columns. Please clarify whether the 33 includes the trap or update the numbers/labels to match the actual Parquet column counts.

## COPILOT-2
Location: release/README.md
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311717
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The “Columns (instructor)” cell is confusing/inconsistent: “35 (34 features + 1 target, +`current_stage`)” suggests `current_stage` is in addition to the 34 features, but it should be included in the 34 non-target columns if the total is 35. Please rephrase to avoid double-counting (e.g., “35 (34 features incl. `current_stage` + 1 target)”).

## COPILOT-3
Location: leadforge/validation/invariants.py:174
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311750
Root author: copilot-pull-request-reviewer

Comment:
    In `check_exposure_monotonicity()`, `extra_in_student = set(s_df.index) - set(i_df.index)` represents feature names that are present in the student dictionary but *not* in the instructor dictionary. The error text currently says “student has rows missing from instructor”, which reads backwards/ambiguous; it should say that student has rows *not present* in instructor (or similar) so the violation is clear.

## COPILOT-4
Location: leadforge/validation/invariants.py:265
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311775
Root author: copilot-pull-request-reviewer

Comment:
    `extra_in_student = set(s_df.columns) - set(i_df.columns)` means the student task split has columns that are *not present* in the instructor split. The emitted message says “student has columns missing from instructor”, which is misleading (it sounds like the inverse). Please reword to something like “student has extra columns not in instructor” to match the condition being checked.

## COPILOT-5
Location: leadforge/validation/bundle_checks.py
URL: https://github.com/leadforge-dev/leadforge/pull/56#discussion_r3180311801
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The docstring for `_check_exposure_redaction()` says it enforces “no post-snapshot features in student_public exports”, but the code explicitly allows leakage-risk features marked as pedagogical traps (e.g. `total_touches_all`). Consider updating the wording to reflect the actual contract (e.g., “no leakage-risk columns except explicit traps”), otherwise the comment/invariant reference is misleading.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25313065556 attempt 1
Comment timestamp: 2026-05-04T10:06:24.394223+00:00
PR head commit: 66b96ee1d567ad4fb1080c8326b25960401cda3f

Addresses senior-dev review of #58. Six fixes: 1. **Close the relational-table bypass.** The previous commit redacted columns only from the snapshot/task splits. Users following the README's "Option 3" (feature engineering off `tables/leads.parquet`) could trivially rejoin redacted columns by `account_id`/`contact_id`. Apply `redacted_columns_for(mode)` uniformly to every published parquet under both `tables/` and `tasks/`. In student_public bundles, `tables/leads.parquet` no longer carries `current_stage` or `is_sql`. This also closes the equivalent bypass that PR #56 left open for `current_stage`. 2. **Fully remove `is_mql`.** Previously left as a half-measure: the field was dropped from the snapshot but kept on `LeadRow` and so still appeared as a constant-True column in `tables/leads.parquet`. Now removed from the entity dataclass, `DTYPE_MAP`, `population.py`, and `engine.py`. Gone from every artifact in every mode. 3. **Multi-seed measurement for the is_sql claim.** The PR description's single-seed numbers overstated the case ("99.4% deterministic"). Re-measured across 5 seeds on full-size bundles: P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) / 0.011 ± 0.004 (advanced). Pattern holds, framing is now honest. CHANGELOG and release/README.md updated with the new numbers. 4. **Strengthen the zero-variance test.** The strict `nunique >= 2` guard caught `is_mql` but missed the weaker "rarest class is 1 row" pattern. For low-cardinality columns on bundles ≥ 200 rows, additionally assert the rarest class has ≥ 1% of rows (skipped on tiny test fixtures where the threshold would false-positive). 5. **Add an explicit v3 schema contract.** `tests/render/test_bundle_schema_v3_contract.py` pins the exact column set per mode for both `tasks/` and `tables/leads.parquet`. The duplication is intentional: any future change touching the feature spec, LeadRow, or redaction policy must update this file *and* bump `BUNDLE_SCHEMA_VERSION`. A bare schema-changing PR that doesn't touch this file fails here. 6. **Parameterize the realism boolean test.** `test_detects_non_boolean _feature` no longer hard-codes `opportunity_created`. It picks the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test time so future column renames don't break it for unrelated reasons. Validation knock-on: `_check_stage_distribution` (realism.py) and the seed-drift `current_stage` collection (drift.py) now skip when the column is missing — consistent with the new redaction surface. `check_exposure_monotonicity` extended to allow legitimate column differences in `tables/leads.parquet` between modes (subset of the expected redaction set). Tests: - 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance, -1 removed `is_mql` simulation test, others updated). - ruff + mypy clean. - All 4 release bundles regenerate and validate. - `verify_hash_determinism.py`: 73/73 files identical across builds. Per-bundle column counts after this commit: student_public: task=32, leads.parquet=9 research_instructor: task=34, leads.parquet=11 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: redact is_sql, remove zero-variance is_mql (closes 2/3 of #57) Bundle schema v2 → v3. Addresses two of the three structural-leakage concerns surfaced in PR #56's self-review and tracked in issue #57. **`is_mql` removed from the canonical feature list.** Every lead is initialised at MQL stage in `simulation/population.py`, making the column constant `True` and zero-variance — no information for modelling. The `LeadRow.is_mql` field is retained on the relational `leads.parquet` table; only the snapshot/task-split column and feature-dictionary row are removed. Affects all exposure modes. **`is_sql` redacted in `student_public` mode.** Measured on the v2 bundles: P(converted | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across intro / intermediate / advanced — at advanced tier essentially deterministic for the negative class. `is_sql` is added to `redact_in_modes={ExposureMode.student_public}`; the redaction machinery from PR #56 picks it up automatically. `is_sql` remains in `research_instructor` exports. **`BUNDLE_SCHEMA_VERSION` bumped to "3".** Removing `is_mql` changes the canonical feature set in every mode (not just a mode-conditional redaction), so a future external consumer pinning the v2 schema and checking for `is_mql` would silently break. Bumping is what the field is for. **New regression test.** `test_no_zero_variance_features` in `tests/exposure/test_redaction.py` reads each published student_public task split and asserts every non-ID, non-target column has at least two distinct values. Guards against the `is_mql` zero-variance regression and any future zero-variance feature slipping in. Bundle column counts (v3): - student_public/{intro,intermediate,advanced}: 32 (was 34) - research_instructor/intermediate_instructor: 34 (was 35) Verification: - 903 tests passing (was 902; +1 zero-variance regression test) - ruff + mypy clean - All 4 release bundles regenerated; all pass `validate_bundle()` - `verify_hash_determinism.py` reports 73/73 identical across builds Open follow-up: issue #57 sub-item 1 (event-aggregate features computed over the label window) — its own PR, will likely bump the schema again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: close redaction bypass via tables/, fully remove is_mql Addresses senior-dev review of #58. Six fixes: 1. **Close the relational-table bypass.** The previous commit redacted columns only from the snapshot/task splits. Users following the README's "Option 3" (feature engineering off `tables/leads.parquet`) could trivially rejoin redacted columns by `account_id`/`contact_id`. Apply `redacted_columns_for(mode)` uniformly to every published parquet under both `tables/` and `tasks/`. In student_public bundles, `tables/leads.parquet` no longer carries `current_stage` or `is_sql`. This also closes the equivalent bypass that PR #56 left open for `current_stage`. 2. **Fully remove `is_mql`.** Previously left as a half-measure: the field was dropped from the snapshot but kept on `LeadRow` and so still appeared as a constant-True column in `tables/leads.parquet`. Now removed from the entity dataclass, `DTYPE_MAP`, `population.py`, and `engine.py`. Gone from every artifact in every mode. 3. **Multi-seed measurement for the is_sql claim.** The PR description's single-seed numbers overstated the case ("99.4% deterministic"). Re-measured across 5 seeds on full-size bundles: P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) / 0.011 ± 0.004 (advanced). Pattern holds, framing is now honest. CHANGELOG and release/README.md updated with the new numbers. 4. **Strengthen the zero-variance test.** The strict `nunique >= 2` guard caught `is_mql` but missed the weaker "rarest class is 1 row" pattern. For low-cardinality columns on bundles ≥ 200 rows, additionally assert the rarest class has ≥ 1% of rows (skipped on tiny test fixtures where the threshold would false-positive). 5. **Add an explicit v3 schema contract.** `tests/render/test_bundle_schema_v3_contract.py` pins the exact column set per mode for both `tasks/` and `tables/leads.parquet`. The duplication is intentional: any future change touching the feature spec, LeadRow, or redaction policy must update this file *and* bump `BUNDLE_SCHEMA_VERSION`. A bare schema-changing PR that doesn't touch this file fails here. 6. **Parameterize the realism boolean test.** `test_detects_non_boolean _feature` no longer hard-codes `opportunity_created`. It picks the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test time so future column renames don't break it for unrelated reasons. Validation knock-on: `_check_stage_distribution` (realism.py) and the seed-drift `current_stage` collection (drift.py) now skip when the column is missing — consistent with the new redaction surface. `check_exposure_monotonicity` extended to allow legitimate column differences in `tables/leads.parquet` between modes (subset of the expected redaction set). Tests: - 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance, -1 removed `is_mql` simulation test, others updated). - ruff + mypy clean. - All 4 release bundles regenerate and validate. - `verify_hash_determinism.py`: 73/73 files identical across builds. Per-bundle column counts after this commit: student_public: task=32, leads.parquet=9 research_instructor: task=34, leads.parquet=11 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

shaypal5 and others added 2 commits May 4, 2026 11:29

shaypal5 added this to the v1.0.0 — Polished OSS release milestone May 4, 2026

shaypal5 added type: bugfix Fixes a bug layer: render render/ bundle and artifact output layer: exposure exposure/ truth-mode filtering layer: validation validation/ invariants and checks labels May 4, 2026

Copilot AI review requested due to automatic review settings May 4, 2026 08:34

Copilot started reviewing on behalf of shaypal5 May 4, 2026 08:35 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 4, 2026

View reviewed changes

shaypal5 mentioned this pull request May 4, 2026

Structural leakage in student_public bundles (post-#56 follow-up) #57

Closed

shaypal5 merged commit 841c6a8 into main May 4, 2026
8 checks passed

shaypal5 deleted the fix/current-stage-leakage-redaction branch May 4, 2026 10:41

shaypal5 mentioned this pull request May 4, 2026

fix: redact is_sql, remove zero-variance is_mql (2/3 of #57) #58

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: redact current_stage from student_public bundles#56

fix: redact current_stage from student_public bundles#56
shaypal5 merged 3 commits into
mainfrom
fix/current-stage-leakage-redaction

shaypal5 commented May 4, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

shaypal5 commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 4, 2026

Summary

Why this approach (option 2 — exposure-layer redaction)

Changes

Tests

Verification

Out of scope

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

shaypal5 commented May 4, 2026

Self-review pass: 10 issues raised, 10 addressed in commit 66b96ee

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Self-review pass: 10 issues raised, 10 addressed in commit `66b96ee`