Improve expected_acv realism and teaching affordances for intro lead-scoring datasets

## Improve expected_acv realism and teaching affordances for intro lead-scoring datasets

### Context

The current `lead_scoring_intro_v7` dataset is strong enough for teaching the lead-scoring module: it supports a student-safe file, an instructor leakage file, a realistic day-20 snapshot, top-K ranking, value-aware ranking, cohort/shift analysis, and a causal temporal leakage trap.

However, while preparing the teaching notebook, one issue became visually obvious: the `expected_acv` distribution looks highly discretized / synthetic, with large spikes around a few repeated values. This is not a blocker for teaching, but it makes the dataset feel less realistic and requires extra explanation in class.

This issue proposes improving the realism and teaching affordances around `expected_acv` and similar business-value fields.

---

### Problem 1 — `expected_acv` looks too discretized

The histogram of `expected_acv` has large spikes around common values. This suggests the value is generated mainly from pricing tiers, company-size bands, or revenue-band midpoints, with limited continuous variation.

That is plausible in B2B SaaS, but the current distribution looks more synthetic than ideal.

For teaching, this is manageable if we explain:

> Many B2B SaaS companies estimate ACV using pricing tiers, company-size bands, or sales estimate buckets.

But for future releases, the distribution should feel more like a CRM field: tiered, but not overly rigid.

---

### Suggested improvement

Generate `expected_acv` using a layered process:

1. Start from a base tier or midpoint:
   - company size
   - company revenue
   - product/package tier
   - opportunity stage, if available pre-snapshot

2. Add realistic variation:
   - discounting / negotiation noise
   - implementation complexity
   - seat-count variation
   - expansion potential
   - buyer segment adjustments
   - random sales-estimation uncertainty

3. Avoid hard clipping artifacts:
   - use soft winsorization instead of aggressive clipping
   - avoid too many exact values at `$120,000`
   - preserve the narrative range, but allow smoother spread inside it

Ideal behavior:
- visible pricing-tier structure
- no huge artificial bars
- realistic right-skew / tiered distribution
- still easy for students to interpret

---

### Problem 2 — `expected_acv` has two different roles

In the teaching notebook, `expected_acv` is used in two ways:

1. As a possible model feature.
2. As a business-value field for decision policy:

`expected_value = P(convert) × expected_acv`

These are conceptually different roles.

A column like `expected_acv` is not just a numeric feature. It is also a **decision-value column**. It determines which leads are economically valuable once the model has estimated conversion probability.

---

### Suggested improvement

Add semantic metadata for columns, beyond type:

```yaml
expected_acv:
  type: numeric
  role: decision_value
  available_at_snapshot: true
  description: Estimated annual contract value if the lead converts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve expected_acv realism and teaching affordances for intro lead-scoring datasets #68