Skip to content

feat: Milestone 5 — population generation and latent state initialisation#10

Merged
shaypal5 merged 2 commits into
mainfrom
feat/milestone-5-population-generation
Apr 28, 2026
Merged

feat: Milestone 5 — population generation and latent state initialisation#10
shaypal5 merged 2 commits into
mainfrom
feat/milestone-5-population-generation

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • Adds leadforge/simulation/population.py with build_population() as the single entry point
  • Generates accounts, contacts, and leads with all observable fields, FK consistency guaranteed (lead.account_id == contact.account_id)
  • Initialises LatentState with 8 hidden traits across 3 entity types (account: fit, budget readiness, process maturity; contact: problem awareness, authority, responsiveness, engagement propensity; lead: sales friction)
  • Motif-family biases in _MOTIF_LATENT_BIAS shift latent means to create structurally coherent worlds (e.g. fit_dominant raises latent_account_fit mean)
  • All randomness via RNGRoot named substreams — fully deterministic given (config.seed, world_graph.motif_family)

Test plan

  • 26 tests in tests/simulation/test_population.py
  • Entity counts match config n_accounts / n_contacts / n_leads
  • Determinism: same seed → identical rows and latent values
  • FK integrity: contact → account, lead → contact, lead account = contact account
  • All latent values in [0, 1]
  • All expected trait keys present for each entity type
  • Lead observable fields: stage=mql, is_mql=True, is_sql=False, valid channels, valid rep IDs, created_at in base window
  • Account observable fields: industry and region within narrative ICP
  • Motif bias property tests: fit_dominant > buying_committee_friction for latent_account_fit across 15 seeds
  • 358 total tests passing; ruff + mypy clean

🤖 Generated with Claude Code

…tion

Implements build_population() in leadforge/simulation/population.py:
- AccountRow generation: industry, region, employee/revenue/maturity bands,
  account created_at spread 30-730 days before world base date
- ContactRow generation: persona-driven title/role/buyer_role, conditional
  account FK, contact created_at anchored to parent account
- LeadRow generation: GTM-weighted lead_source, rep assignment from internal
  pool, lead_created_at within 30-day base window; initial stage = mql
- LatentState: 8 hidden traits across 3 entity types, all in [0,1], sampled
  from clipped Gaussians with motif-family-aware mean biases
- FK invariant: lead.account_id always equals contact.account_id
- All randomness via RNGRoot named substreams — fully deterministic

26 tests: counts, determinism, FK integrity, latent range/completeness,
motif bias properties (fit_dominant vs buying_committee_friction), and
observable field validity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 14:03
@shaypal5 shaypal5 added type: feature New capability layer: simulation simulation/ discrete-time engine labels Apr 27, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new simulation-layer entry point to generate an initial “world population” (accounts/contacts/leads) alongside per-entity latent traits, with deterministic randomness derived from RNGRoot substreams and motif-family-dependent latent mean shifts.

Changes:

  • Added leadforge/simulation/population.py implementing build_population() plus account/contact/lead generation and LatentState initialization.
  • Added tests/simulation/test_population.py with coverage for counts, determinism, FK integrity, observable field validity, latent trait completeness/ranges, and motif-bias properties.
  • Updated .agent-plan.md to mark Milestone 5 complete and advance the plan to Milestone 6.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
leadforge/simulation/population.py Adds deterministic population + latent-state generation with motif-family biasing.
tests/simulation/test_population.py Adds comprehensive tests validating population structure, determinism, and trait constraints.
.agent-plan.md Updates milestone tracking and next-task breakdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/simulation/population.py Outdated
Comment thread leadforge/simulation/population.py
Comment thread leadforge/simulation/population.py
@github-actions

This comment has been minimized.

- Docstring: correct determinism contract to include narrative and
  world_graph.motif_family (COPILOT-1)
- build_population: add _validate_narrative() up-front guard that raises
  InvalidConfigError for empty industries, geographies, personas, or
  channels (COPILOT-2)
- _channel_weights: fall back to uniform distribution when all GTM shares
  sum to zero, preventing random.choices ValueError (COPILOT-3)
- 5 new tests covering all three fixes (363 total passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#10. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: commit pushed
Workflow run: 25032051451 attempt 1
Comment timestamp: 2026-04-28T03:20:30.335940+00:00
PR head commit: 0559a83d15f375cdfa53f9d4abb3b38ed15e91e4

@shaypal5 shaypal5 merged commit 2bc3566 into main Apr 28, 2026
5 checks passed
@shaypal5 shaypal5 deleted the feat/milestone-5-population-generation branch April 28, 2026 03:26
shaypal5 added a commit that referenced this pull request May 6, 2026
Fold the brutal self-review's findings back into the PR before review.

Bugs:
- (#1) run_packager validate→write order — both packagers wrote
  README/metadata on validation failure, leaving corrupt artifacts on
  disk that would silently get committed.  Gated on `errors == ()`;
  added no-write tests for both packagers.
- (#2) Instructor README inlined the public 3-tier README into a
  1-tier dataset card.  Replaced with a dedicated `INSTRUCTOR_BODY`
  constant that links to the public dataset and describes only the
  instructor-specific additions (full-horizon tables, hidden DAG,
  latent registry, mechanism summary).
- (#3) validate_upload_dir_safe also blocks strict descendants of
  release_dir; `--huggingface-dir release/intro` would otherwise
  rmtree the intro bundle.

Architecture:
- (#5) Finished shared-primitives extraction: SOURCE_TREE_BLOCK,
  validate_readme_substitution, replace_file, replace_dir,
  load_manifest now live in scripts/_release_common.py.  Both
  packagers reduced to imports.
- (#6) Replaced 60-line hand-rolled YAML renderer with yaml.safe_dump
  + a 4-line _IndentedDumper subclass.
- (#7) Removed dead --owner / --dataset-slug CLI flags.
- (#8) assemble_upload_dir now takes rendered_readme and writes it.
- (#9) build_config_for_tier made pure (no I/O); cheap manifest-stat
  preflight via _assert_tier_dir_exists.
- (#10) --default-config with --variant=instructor errors loudly.

CI:
- (#4) Added [publish] extra (datasets>=2.14, kaggle>=1.6) so the
  gated G12.3 / G12.4 / G11.3 tests install in one line.

Cleanups: visual cruft (#13#16), test cruft (#17 — unused tmp_path,
dead tag_lines), em-dash YAML round-trip parametrised for the
instructor pretty_name.

Verification: 1223 tests pass + 5 gated skips; ruff + mypy clean;
hash determinism PASS 67/67; leakage probes 0/3 reconstruct on every
tier; validate_release_candidate --no-rebuild exits 0.
release/{kaggle,huggingface,huggingface-instructor}/dataset-metadata
.json|README.md regenerated; audit-artifact-sync tests guard them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 6, 2026
* PR 5.2: HuggingFace release packager + load_dataset smoke test

Add `scripts/package_hf_release.py` to generate `release/huggingface/README.md`
with G12.1-compliant YAML frontmatter (pretty_name, license, language,
task_categories, size_categories, tags, three configs with `default: true`
on intermediate per G12.2), inlining the rewritten `release/README.md`
body with HF-specific link rewrites.  `--variant=instructor` packages the
companion repo (G12.4) from `release/intermediate_instructor/` into a
separate `release/huggingface-instructor/` upload tree.  G12.3 covered
by a parametrised `load_dataset()` smoke test gated on the optional
`datasets` SDK.

Extract shared release-packaging primitives (link rewriter, dir-safety
guard, cover-image validator) into `scripts/_release_common.py`; refactor
the Kaggle packager to import them.  `release/kaggle/dataset-metadata.json`
is byte-stable across the refactor.

Delete the legacy `release/HF_DATASET_CARD.md` stub — superseded by the
generated card.  Gitignore `release/huggingface{,-instructor}/*` except
the committed README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 5.2 self-review fixes (Kaggle + HF packagers)

Fold the brutal self-review's findings back into the PR before review.

Bugs:
- (#1) run_packager validate→write order — both packagers wrote
  README/metadata on validation failure, leaving corrupt artifacts on
  disk that would silently get committed.  Gated on `errors == ()`;
  added no-write tests for both packagers.
- (#2) Instructor README inlined the public 3-tier README into a
  1-tier dataset card.  Replaced with a dedicated `INSTRUCTOR_BODY`
  constant that links to the public dataset and describes only the
  instructor-specific additions (full-horizon tables, hidden DAG,
  latent registry, mechanism summary).
- (#3) validate_upload_dir_safe also blocks strict descendants of
  release_dir; `--huggingface-dir release/intro` would otherwise
  rmtree the intro bundle.

Architecture:
- (#5) Finished shared-primitives extraction: SOURCE_TREE_BLOCK,
  validate_readme_substitution, replace_file, replace_dir,
  load_manifest now live in scripts/_release_common.py.  Both
  packagers reduced to imports.
- (#6) Replaced 60-line hand-rolled YAML renderer with yaml.safe_dump
  + a 4-line _IndentedDumper subclass.
- (#7) Removed dead --owner / --dataset-slug CLI flags.
- (#8) assemble_upload_dir now takes rendered_readme and writes it.
- (#9) build_config_for_tier made pure (no I/O); cheap manifest-stat
  preflight via _assert_tier_dir_exists.
- (#10) --default-config with --variant=instructor errors loudly.

CI:
- (#4) Added [publish] extra (datasets>=2.14, kaggle>=1.6) so the
  gated G12.3 / G12.4 / G11.3 tests install in one line.

Cleanups: visual cruft (#13#16), test cruft (#17 — unused tmp_path,
dead tag_lines), em-dash YAML round-trip parametrised for the
instructor pretty_name.

Verification: 1223 tests pass + 5 gated skips; ruff + mypy clean;
hash determinism PASS 67/67; leakage probes 0/3 reconstruct on every
tier; validate_release_candidate --no-rebuild exits 0.
release/{kaggle,huggingface,huggingface-instructor}/dataset-metadata
.json|README.md regenerated; audit-artifact-sync tests guard them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 5.2 Copilot-review fixes (Kaggle + HF packagers)

Fold Copilot's two real findings on the self-review revision back in.

COPILOT-1 — validate_upload_dir_safe was only invoked inside
assemble_upload_dir, which --dry-run skips.  A dry-run with
--huggingface-dir release (or .) would write the README into the
unsafe path BEFORE the safety net fired.  Hoist the check into
run_packager (both packagers) so it runs before any mkdir or write;
the inner assemble_upload_dir call stays as defence-in-depth for
direct callers.  New tests: dry-run with unsafe upload-dir raises
without writing; the same path through main() returns rc=2.

COPILOT-2 — Cover-image path resolution was inconsistent:
validate_cover_image used cover_image as passed, while
assemble_upload_dir did a separate ``release_dir / cover_image.name``
fallback.  Diverged for bare-basename inputs (false validation
failures) and two-paths-sharing-a-basename (assembler shadowing the
explicit path).  Added resolve_cover_image_path() to
_release_common.py (explicit-wins, release-dir fallback);
run_packager calls it once and threads the resolved path through
validation, the metadata's image field, and assembly.  New
tests/scripts/test_release_common.py covers the four resolution
branches; new packager-side tests confirm bare-basename success +
metadata field plumbing.

COPILOT-3 — outdated; already addressed by self-review fix #8 in
commit f2fc4a2.  Resolved as already treated; no code change.

Verification: 1232/1232 tests pass + 5 gated skips; ruff + mypy
clean; hash determinism PASS 67/67; leakage probes rc=0 on every
tier; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5.
release/{kaggle,huggingface,huggingface-instructor}/* artifacts
regenerated byte-identically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: simulation simulation/ discrete-time engine type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants