Improve expected_acv realism and teaching affordances for intro lead-scoring datasets
Context
The current lead_scoring_intro_v7 dataset is strong enough for teaching the lead-scoring module: it supports a student-safe file, an instructor leakage file, a realistic day-20 snapshot, top-K ranking, value-aware ranking, cohort/shift analysis, and a causal temporal leakage trap.
However, while preparing the teaching notebook, one issue became visually obvious: the expected_acv distribution looks highly discretized / synthetic, with large spikes around a few repeated values. This is not a blocker for teaching, but it makes the dataset feel less realistic and requires extra explanation in class.
This issue proposes improving the realism and teaching affordances around expected_acv and similar business-value fields.
Problem 1 — expected_acv looks too discretized
The histogram of expected_acv has large spikes around common values. This suggests the value is generated mainly from pricing tiers, company-size bands, or revenue-band midpoints, with limited continuous variation.
That is plausible in B2B SaaS, but the current distribution looks more synthetic than ideal.
For teaching, this is manageable if we explain:
Many B2B SaaS companies estimate ACV using pricing tiers, company-size bands, or sales estimate buckets.
But for future releases, the distribution should feel more like a CRM field: tiered, but not overly rigid.
Suggested improvement
Generate expected_acv using a layered process:
-
Start from a base tier or midpoint:
- company size
- company revenue
- product/package tier
- opportunity stage, if available pre-snapshot
-
Add realistic variation:
- discounting / negotiation noise
- implementation complexity
- seat-count variation
- expansion potential
- buyer segment adjustments
- random sales-estimation uncertainty
-
Avoid hard clipping artifacts:
- use soft winsorization instead of aggressive clipping
- avoid too many exact values at
$120,000
- preserve the narrative range, but allow smoother spread inside it
Ideal behavior:
- visible pricing-tier structure
- no huge artificial bars
- realistic right-skew / tiered distribution
- still easy for students to interpret
Problem 2 — expected_acv has two different roles
In the teaching notebook, expected_acv is used in two ways:
- As a possible model feature.
- As a business-value field for decision policy:
expected_value = P(convert) × expected_acv
These are conceptually different roles.
A column like expected_acv is not just a numeric feature. It is also a decision-value column. It determines which leads are economically valuable once the model has estimated conversion probability.
Suggested improvement
Add semantic metadata for columns, beyond type:
expected_acv:
type: numeric
role: decision_value
available_at_snapshot: true
description: Estimated annual contract value if the lead converts
Improve expected_acv realism and teaching affordances for intro lead-scoring datasets
Context
The current
lead_scoring_intro_v7dataset is strong enough for teaching the lead-scoring module: it supports a student-safe file, an instructor leakage file, a realistic day-20 snapshot, top-K ranking, value-aware ranking, cohort/shift analysis, and a causal temporal leakage trap.However, while preparing the teaching notebook, one issue became visually obvious: the
expected_acvdistribution looks highly discretized / synthetic, with large spikes around a few repeated values. This is not a blocker for teaching, but it makes the dataset feel less realistic and requires extra explanation in class.This issue proposes improving the realism and teaching affordances around
expected_acvand similar business-value fields.Problem 1 —
expected_acvlooks too discretizedThe histogram of
expected_acvhas large spikes around common values. This suggests the value is generated mainly from pricing tiers, company-size bands, or revenue-band midpoints, with limited continuous variation.That is plausible in B2B SaaS, but the current distribution looks more synthetic than ideal.
For teaching, this is manageable if we explain:
But for future releases, the distribution should feel more like a CRM field: tiered, but not overly rigid.
Suggested improvement
Generate
expected_acvusing a layered process:Start from a base tier or midpoint:
Add realistic variation:
Avoid hard clipping artifacts:
$120,000Ideal behavior:
Problem 2 —
expected_acvhas two different rolesIn the teaching notebook,
expected_acvis used in two ways:expected_value = P(convert) × expected_acvThese are conceptually different roles.
A column like
expected_acvis not just a numeric feature. It is also a decision-value column. It determines which leads are economically valuable once the model has estimated conversion probability.Suggested improvement
Add semantic metadata for columns, beyond type: