Skip to content

Improve expected_acv realism and teaching affordances for intro lead-scoring datasets #68

@shaypal5

Description

@shaypal5

Improve expected_acv realism and teaching affordances for intro lead-scoring datasets

Context

The current lead_scoring_intro_v7 dataset is strong enough for teaching the lead-scoring module: it supports a student-safe file, an instructor leakage file, a realistic day-20 snapshot, top-K ranking, value-aware ranking, cohort/shift analysis, and a causal temporal leakage trap.

However, while preparing the teaching notebook, one issue became visually obvious: the expected_acv distribution looks highly discretized / synthetic, with large spikes around a few repeated values. This is not a blocker for teaching, but it makes the dataset feel less realistic and requires extra explanation in class.

This issue proposes improving the realism and teaching affordances around expected_acv and similar business-value fields.


Problem 1 — expected_acv looks too discretized

The histogram of expected_acv has large spikes around common values. This suggests the value is generated mainly from pricing tiers, company-size bands, or revenue-band midpoints, with limited continuous variation.

That is plausible in B2B SaaS, but the current distribution looks more synthetic than ideal.

For teaching, this is manageable if we explain:

Many B2B SaaS companies estimate ACV using pricing tiers, company-size bands, or sales estimate buckets.

But for future releases, the distribution should feel more like a CRM field: tiered, but not overly rigid.


Suggested improvement

Generate expected_acv using a layered process:

  1. Start from a base tier or midpoint:

    • company size
    • company revenue
    • product/package tier
    • opportunity stage, if available pre-snapshot
  2. Add realistic variation:

    • discounting / negotiation noise
    • implementation complexity
    • seat-count variation
    • expansion potential
    • buyer segment adjustments
    • random sales-estimation uncertainty
  3. Avoid hard clipping artifacts:

    • use soft winsorization instead of aggressive clipping
    • avoid too many exact values at $120,000
    • preserve the narrative range, but allow smoother spread inside it

Ideal behavior:

  • visible pricing-tier structure
  • no huge artificial bars
  • realistic right-skew / tiered distribution
  • still easy for students to interpret

Problem 2 — expected_acv has two different roles

In the teaching notebook, expected_acv is used in two ways:

  1. As a possible model feature.
  2. As a business-value field for decision policy:

expected_value = P(convert) × expected_acv

These are conceptually different roles.

A column like expected_acv is not just a numeric feature. It is also a decision-value column. It determines which leads are economically valuable once the model has estimated conversion probability.


Suggested improvement

Add semantic metadata for columns, beyond type:

expected_acv:
  type: numeric
  role: decision_value
  available_at_snapshot: true
  description: Estimated annual contract value if the lead converts

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requestlayer: corecore/ primitives (RNG, IDs, models, exceptions)layer: mechanismsmechanisms/ generators and transitionsquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions