fix: redact is_sql, remove zero-variance is_mql (2/3 of #57)#58
Merged
Conversation
Bundle schema v2 → v3. Addresses two of the three structural-leakage concerns surfaced in PR #56's self-review and tracked in issue #57. **`is_mql` removed from the canonical feature list.** Every lead is initialised at MQL stage in `simulation/population.py`, making the column constant `True` and zero-variance — no information for modelling. The `LeadRow.is_mql` field is retained on the relational `leads.parquet` table; only the snapshot/task-split column and feature-dictionary row are removed. Affects all exposure modes. **`is_sql` redacted in `student_public` mode.** Measured on the v2 bundles: P(converted | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across intro / intermediate / advanced — at advanced tier essentially deterministic for the negative class. `is_sql` is added to `redact_in_modes={ExposureMode.student_public}`; the redaction machinery from PR #56 picks it up automatically. `is_sql` remains in `research_instructor` exports. **`BUNDLE_SCHEMA_VERSION` bumped to "3".** Removing `is_mql` changes the canonical feature set in every mode (not just a mode-conditional redaction), so a future external consumer pinning the v2 schema and checking for `is_mql` would silently break. Bumping is what the field is for. **New regression test.** `test_no_zero_variance_features` in `tests/exposure/test_redaction.py` reads each published student_public task split and asserts every non-ID, non-target column has at least two distinct values. Guards against the `is_mql` zero-variance regression and any future zero-variance feature slipping in. Bundle column counts (v3): - student_public/{intro,intermediate,advanced}: 32 (was 34) - research_instructor/intermediate_instructor: 34 (was 35) Verification: - 903 tests passing (was 902; +1 zero-variance regression test) - ruff + mypy clean - All 4 release bundles regenerated; all pass `validate_bundle()` - `verify_hash_determinism.py` reports 73/73 identical across builds Open follow-up: issue #57 sub-item 1 (event-aggregate features computed over the label window) — its own PR, will likely bump the schema again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR updates the public-release feature surface to address two issue #57 leakage follow-ups: removing zero-variance is_mql from the canonical snapshot feature list and redacting is_sql from student-facing task exports, with the matching schema/version, docs, and tests updated around that change.
Changes:
- Remove
is_mqlfrom canonical snapshot features and markis_sqlforstudent_publicredaction. - Bump bundle schema version to v3 and refresh release/changelog/docs to describe the new published column set.
- Add/adjust regression tests around redaction and realism validation inputs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/validation/test_realism.py |
Updates the boolean-dtype realism regression to corrupt opportunity_created instead of removed is_mql. |
tests/exposure/test_redaction.py |
Adds a new zero-variance regression test for student-public task exports. |
release/README.md |
Updates public release summary counts and leakage-handling documentation for schema v3. |
release/HF_DATASET_CARD.md |
Updates Hugging Face dataset card counts and leakage notes for the public bundles. |
leadforge/schema/features.py |
Removes is_mql from canonical snapshot features and redacts is_sql in student_public. |
leadforge/render/manifests.py |
Bumps BUNDLE_SCHEMA_VERSION from 2 to 3 and documents schema history. |
CHANGELOG.md |
Adds unreleased notes for schema v3, column-count changes, and follow-up status. |
.agent-plan.md |
Marks issue #57 sub-items 2 and 3 as resolved and leaves the windowed-snapshot work open. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+156
to
+164
| "Whether the lead had achieved SQL status at snapshot date. " | ||
| "Strongly correlated with the label: the simulator only converts " | ||
| "non-SQL leads via a rare direct-conversion path, so " | ||
| "is_sql=False predicts non-conversion with very high probability " | ||
| "(P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across difficulty " | ||
| "tiers). Redacted from student_public bundles.", | ||
| "lead_meta", | ||
| leakage_risk=True, | ||
| redact_in_modes=frozenset({ExposureMode.student_public}), |
Comment on lines
+132
to
145
| """ | ||
| df = pd.read_parquet(bundle / "tasks/converted_within_90_days/train.parquet") | ||
| exempt = {"account_id", "contact_id", "lead_id", "lead_created_at"} | ||
| target = next(f.name for f in LEAD_SNAPSHOT_FEATURES if f.is_target) | ||
| exempt.add(target) | ||
| for col in df.columns: | ||
| if col in exempt: | ||
| continue | ||
| assert df[col].nunique(dropna=False) >= 2, ( | ||
| f"feature {col!r} has zero variance in the published " | ||
| f"student_public bundle ({df[col].nunique(dropna=False)} distinct value)" | ||
| ) | ||
|
|
||
|
|
Comment on lines
+163
to
+164
| leakage_risk=True, | ||
| redact_in_modes=frozenset({ExposureMode.student_public}), |
Comment on lines
+30
to
+31
| `validate_bundle()` now flags any zero-variance feature in the published | ||
| student_public task split (excluding ID columns and the target). |
Comment on lines
+133
to
+138
| Three columns in the original simulator output were either label-encoding or near-deterministic for the negative class. The exposure layer handles them as follows: | ||
|
|
||
| **Known caveats** (see [PR #56](https://github.com/leadforge-dev/leadforge/pull/56) for the discussion): | ||
| - `current_stage` -- at the 90-day horizon, contains terminal stages (`closed_won`/`closed_lost`) that encode the label directly. **Stripped from `student_public` bundles**; available in `intermediate_instructor/`. | ||
| - `is_sql` -- `is_sql=False` predicts non-conversion with very high probability (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across intro / intermediate / advanced). **Stripped from `student_public` bundles**; available in `intermediate_instructor/`. | ||
| - `is_mql` -- removed entirely (in all modes): every lead is initialised at MQL stage in the simulator, so the column was constant `True` and zero-variance. | ||
| - `total_touches_all` -- counts touches over the full 90-day window, including post-snapshot events. **Deliberately retained** as a pedagogical trap (flagged `leakage_risk=True` in the dictionary). Use it as an exercise: train with and without it, compare AUC, explain the gap. |
Comment on lines
+97
to
+98
| - **Stripped from public bundles:** `current_stage` (label-encoding at the 90-day horizon) and `is_sql` (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across tiers — near-deterministic for non-conversion). Both available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped. | ||
| - **Removed entirely:** `is_mql` (constant `True` in the simulator — zero variance, no information). |
Addresses senior-dev review of #58. Six fixes: 1. **Close the relational-table bypass.** The previous commit redacted columns only from the snapshot/task splits. Users following the README's "Option 3" (feature engineering off `tables/leads.parquet`) could trivially rejoin redacted columns by `account_id`/`contact_id`. Apply `redacted_columns_for(mode)` uniformly to every published parquet under both `tables/` and `tasks/`. In student_public bundles, `tables/leads.parquet` no longer carries `current_stage` or `is_sql`. This also closes the equivalent bypass that PR #56 left open for `current_stage`. 2. **Fully remove `is_mql`.** Previously left as a half-measure: the field was dropped from the snapshot but kept on `LeadRow` and so still appeared as a constant-True column in `tables/leads.parquet`. Now removed from the entity dataclass, `DTYPE_MAP`, `population.py`, and `engine.py`. Gone from every artifact in every mode. 3. **Multi-seed measurement for the is_sql claim.** The PR description's single-seed numbers overstated the case ("99.4% deterministic"). Re-measured across 5 seeds on full-size bundles: P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) / 0.011 ± 0.004 (advanced). Pattern holds, framing is now honest. CHANGELOG and release/README.md updated with the new numbers. 4. **Strengthen the zero-variance test.** The strict `nunique >= 2` guard caught `is_mql` but missed the weaker "rarest class is 1 row" pattern. For low-cardinality columns on bundles ≥ 200 rows, additionally assert the rarest class has ≥ 1% of rows (skipped on tiny test fixtures where the threshold would false-positive). 5. **Add an explicit v3 schema contract.** `tests/render/test_bundle_schema_v3_contract.py` pins the exact column set per mode for both `tasks/` and `tables/leads.parquet`. The duplication is intentional: any future change touching the feature spec, LeadRow, or redaction policy must update this file *and* bump `BUNDLE_SCHEMA_VERSION`. A bare schema-changing PR that doesn't touch this file fails here. 6. **Parameterize the realism boolean test.** `test_detects_non_boolean _feature` no longer hard-codes `opportunity_created`. It picks the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test time so future column renames don't break it for unrelated reasons. Validation knock-on: `_check_stage_distribution` (realism.py) and the seed-drift `current_stage` collection (drift.py) now skip when the column is missing — consistent with the new redaction surface. `check_exposure_monotonicity` extended to allow legitimate column differences in `tables/leads.parquet` between modes (subset of the expected redaction set). Tests: - 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance, -1 removed `is_mql` simulation test, others updated). - ruff + mypy clean. - All 4 release bundles regenerate and validate. - `verify_hash_determinism.py`: 73/73 files identical across builds. Per-bundle column counts after this commit: student_public: task=32, leads.parquet=9 research_instructor: task=34, leads.parquet=11 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
Self-review pass: 8 issues raised, 8 addressed in commit 55e5948
Verification after refactor:
|
|
pr-agent-context report: This run includes unresolved review comments on PR #58 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: leadforge/schema/features.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550070
Root author: copilot-pull-request-reviewer
Comment:
Marking `is_sql` as redacted here only removes it from the snapshot/task exports. `write_bundle()` still publishes `tables/leads.parquet` unchanged in `student_public`, so anyone following the documented raw-table workflow can still recover `is_sql` from the public bundle and get the same near-deterministic signal this PR is trying to remove.
## COPILOT-2
Location: tests/exposure/test_redaction.py:169
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550105
Root author: copilot-pull-request-reviewer
Comment:
This regression check only reads `train.parquet` from the default-difficulty student bundle. The release publishes three student_public tiers and three splits per tier, so a zero-variance column that only shows up in `valid`/`test` or in intro/advanced would still slip through CI.
## COPILOT-3
Location: leadforge/schema/features.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550115
Root author: copilot-pull-request-reviewer
Comment:
There still isn't a test that pins `is_sql` as a required public-bundle redaction. The current redaction assertions derive their expected set from `redacted_columns_for()`, so if a future edit drops `is_sql` from `redact_in_modes`, the writer, validator, and tests will all change in lockstep and CI will stay green.
## COPILOT-4
Location: CHANGELOG.md
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550135
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The release notes say `validate_bundle()` now rejects zero-variance features, but no validator code was added for that. `validate_bundle()` still delegates to the existing realism checks, which do not inspect per-column variance, so this changelog entry overstates the shipped behavior.
## COPILOT-5
Location: release/README.md
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550158
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
These bullets read as if `is_sql` and `is_mql` are absent from the entire public bundle, but `student_public/tables/leads.parquet` is still written unchanged and still contains both columns. That will mislead users following the raw-table feature-engineering workflow documented earlier in this README.
## COPILOT-6
Location: release/HF_DATASET_CARD.md:98
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550179
Root author: copilot-pull-request-reviewer
Comment:
This section currently says `is_sql` is stripped from public bundles and `is_mql` is removed entirely, but the published `student_public/tables/leads.parquet` still contains both fields. Consumers who build features from the relational tables will get a different schema than this card promises.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundle schema v2 → v3. Closes issue #57 sub-items 2 and 3, and (after the self-review pass in commit 55e5948) closes a redaction-bypass that was already present in v2 from PR #56.
Closes 2 of 3 in #57:
is_sql=Falsenear-deterministic for non-conversion → redact in student_public (now applied to both task splits andtables/leads.parquet)is_mqlconstant True → fully remove from the schema (LeadRow, simulator, all parquets, all modes)Bonus fix in this PR: the redaction now applies to relational tables too, closing a bypass that affected
current_stagefrom v2 onward.Why these changes
is_mqlis constant True in the simulator. Every lead is initialised atcurrent_stage="mql"insimulation/population.py, sois_mql == Truefor 100% of leads. Zero-variance — no information for modelling. Solution: fully remove fromLeadRow,DTYPE_MAP,population.py,engine.py, and the canonical feature list. Gone from every artifact in every mode.is_sql=Falseis near-deterministic for non-conversion. Measured across 5 seeds on full-size bundles:At advanced tier,
is_sql=Falsepredicts non-conversion with ~99% probability across all measured seeds. A student trained their first model on(label = is_sql)would clear the baseline AUC for free — embarrassing on the hardest tier. Solution: tag withredact_in_modes={ExposureMode.student_public}, retain inresearch_instructorfor DGP-aware research.Bypass closure (added in commit 55e5948 after self-review). In v2, redaction applied only to
tasks/converted_within_90_days/*.parquet. Users following the README's "Option 3" (feature engineering off the rawtables/leads.parquet) could trivially joincurrent_stage(or nowis_sql) back into their feature set bylead_id. In v3,redacted_columns_for(mode)is applied uniformly to every published parquet under bothtables/andtasks/. This also retroactively closes the same bypass that PR #56 left open forcurrent_stage.Why bump
BUNDLE_SCHEMA_VERSIONRemoving
is_mqlchanges the canonical feature list in every exposure mode — not just a mode-conditional redaction like PR #56. Combined with the new relational-table redaction surface, downstream consumers may legitimately want to gate on the schema version.The bump was confirmed in chat with the user before commit (not in the PR thread — apologies for the prior PR description's misleading wording).
Bundle column counts (after this PR)
tables/leads.parquetstudent_public/{intro,intermediate,advanced}research_instructor/intermediate_instructormanifest.redacted_columnsfor student_public bundles lists["current_stage", "is_sql"].New automated checks
tests/render/test_bundle_schema_v3_contract.py(8 tests): hardcoded expected column lists per mode for both task splits andtables/leads.parquet. The duplication is the point — any future change to the feature spec,LeadRow, or redaction policy that also changes the published column set must update this contract. A bare schema-changing PR that doesn't touch this file fails here, forcing explicit version coordination.test_no_zero_variance_featuresstrengthened: alongside the strictnunique >= 2guard, on bundles ≥ 200 rows it also asserts low-cardinality columns have rarest class ≥ 1% of rows. Skipped on tiny test fixtures.Files changed
leadforge/schema/entities.py— dropis_mqlfromLeadRow+DTYPE_MAP.leadforge/schema/features.py— dropis_mqlFeatureSpec; addredact_in_modes={student_public}tois_sql.leadforge/simulation/population.py,engine.py— dropis_mql=Trueconstructor calls.leadforge/api/bundle.py— applyredacted_columns_for(mode)to every relational table write before snapshot.leadforge/render/manifests.py— bumpBUNDLE_SCHEMA_VERSIONto"3".leadforge/validation/realism.py,drift.py— skipcurrent_stagechecks when the column is absent (student_public mode).leadforge/validation/invariants.py—check_exposure_monotonicityaccepts legitimate column differences intables/leads.parquet(subset of the redaction set).tests/exposure/test_redaction.py— strengthen the zero-variance test.tests/render/test_bundle_schema_v3_contract.py— new schema contract.tests/schema/test_entities.py,tests/simulation/test_engine.py,tests/simulation/test_population.py,tests/validation/test_realism.py— drop / replaceis_mqlreferences; parameterize boolean-column choice.release/README.md,release/HF_DATASET_CARD.md— table-form leakage handling section, multi-seed numbers.CHANGELOG.md— describe v3 with multi-seed measurements and the relational-table redaction..agent-plan.md— strike sub-items 2 and 3 from Structural leakage in student_public bundles (post-#56 follow-up) #57.Verification
pytest: 911 passing (was 902 before this PR; +9 net: +8 contract, +1 stronger zero-variance, -1 removedis_mqlsim test, others updated)ruff check/ruff format --check/mypy leadforge/: cleanvalidate_bundle()python scripts/verify_hash_determinism.py: 73/73 files identical across two consecutive buildscurrent_stage,is_sql,is_mqlall absent from every student_public artifact (task split + relational table + feature dictionary).Out of scope
Issue #57 sub-item 1: event-aggregate features (
touch_count,session_count,pricing_page_views,expected_acv,days_since_last_touch, ...) are still computed over the same 90-day window the label resolves in. The structural fix is a windowed snapshot rebuild (snapshot_day < label_window_days). That PR will likely bump the schema again and recalibrate documented conversion rates.Test plan
pytest— 911 passingruff check/mypy— cleanbundle_schema_version: "3"and the expectedredacted_columnscurrent_stage/is_sqlintables/leads.parquetfor student_public (new contract test)is_mqlanywhere in any bundle (new contract test)🤖 Generated with Claude Code