leadforge

Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds.

leadforge generates narrative-grounded synthetic revenue datasets — starting with lead scoring — designed for teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial world: a specific company, selling a specific product, to a specific kind of buyer, and renders realistic CRM-style outputs from that world.

Docs: leadforge-dev.github.io/leadforge · Dataset: HuggingFace · Kaggle: Intro · Intermediate · Advanced

What Makes LeadForge Different

World-first generation: datasets are rendered from simulated companies, products, buyers, activities, opportunities, and outcomes.
Relational CRM shape: output includes normalized tables plus task-ready train/validation/test splits for lead scoring.
Pedagogical realism: snapshot discipline, redaction modes, leakage traps, calibration issues, and difficulty tiers are deliberate teaching material.

Installation

Requires Python 3.11+.

pip install leadforge

Or install directly from GitHub:

pip install git+https://github.com/leadforge-dev/leadforge.git

For development:

git clone https://github.com/leadforge-dev/leadforge.git
cd leadforge
pip install -e ".[dev]"
pre-commit install

Quickstart

CLI

# List available recipes
leadforge list-recipes

# Generate a dataset bundle
leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 \
  --mode student_public \
  --difficulty intermediate \
  --n-leads 5000 \
  --out ./out/demo_bundle

# Inspect bundle metadata
leadforge inspect ./out/demo_bundle

# Or pipe the manifest into jq
leadforge inspect ./out/demo_bundle --json | jq .snapshot_day

# Validate bundle integrity
leadforge validate ./out/demo_bundle

Python API

from leadforge.api import Generator

gen = Generator.from_recipe(
    "b2b_saas_procurement_v1",
    seed=42,
    exposure_mode="student_public",
)
bundle = gen.generate(n_leads=5000, difficulty="intermediate")
bundle.save("./out/demo_bundle")

Generated Data Preview

A generated bundle looks like CRM and GTM data, not a generic tabular benchmark. This compact slice comes from the intermediate lead-scoring bundle:

split	industry	region	employee_band	lead_source	touch_count	session_count	opportunity_created	expected_acv	converted_within_90_days
train	logistics	UK	200-499	inbound_marketing	0	0	False	66,699	False
train	logistics	UK	500-999	inbound_marketing	5	2	False	58,372	False
train	logistics	US	200-499	partner_referral	9	3	True	15,462	False
train	healthcare_non_clinical	US	200-499	inbound_marketing	5	1	True	30,490	False
train	manufacturing	US	1000-1999	sdr_outbound	missing	1	True	42,999	False

The full bundle also includes accounts, contacts, leads, touches, sessions, sales activities, opportunities, feature dictionaries, manifests, and model-ready Parquet task splits.

Exposure Modes

Control what truth is visible in the output bundle:

Mode	Purpose	Includes
`student_public`	Teaching / portfolio use	Tables, features, task splits, dataset card
`research_instructor`	Full truth for instructors / researchers	All of the above + hidden graph, world spec, latent registry, mechanism summary

Set via --mode on the CLI or exposure_mode= in the Python API.

Difficulty Profiles

Each recipe ships with difficulty profiles that control signal-to-noise ratio:

Profile	Description
`intro`	Strong signal, low noise — good for first-time learners
`intermediate`	Moderate signal, realistic noise
`advanced`	Weak signal, high noise — challenges experienced practitioners

Set via --difficulty on the CLI or difficulty= in generate().

Output Bundle

bundle_root/
  manifest.json            # provenance, row counts, file hashes
  dataset_card.md          # human-readable dataset documentation
  feature_dictionary.csv   # feature names, types, descriptions
  tables/                  # 9 relational Parquet tables
  tasks/
    converted_within_90_days/
      train.parquet
      valid.parquet
      test.parquet
      task_manifest.json
  metadata/                # (research_instructor only) hidden graph, world spec, latents

Key Design Principles

Deterministic: same (recipe, seed, version) → identical output.
Relational-first: 9 normalized tables; flat ML exports are derived.
No external APIs: core generation never requires network access.
Simulation-driven labels: converted_within_90_days emerges from simulated events, not sampled directly.
Leakage-safe: no feature uses events after the snapshot anchor.

Documentation

Development

pip install -e ".[dev]"
pytest                        # run all tests (~800)
ruff check .                  # lint
ruff format .                 # format
mypy leadforge/               # type check
pre-commit run --all-files    # full pre-commit suite

License

MIT. See LICENSE.

Credits

Created by Shay Palachy Affek [GitHub]

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
assets		assets
docs		docs
lead_scoring_intro		lead_scoring_intro
leadforge		leadforge
release		release
scripts		scripts
tests		tests
website		website
.agent-plan.md		.agent-plan.md
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
llms.txt		llms.txt
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

leadforge

What Makes LeadForge Different

Installation

Quickstart

CLI

Python API

Generated Data Preview

Exposure Modes

Difficulty Profiles

Output Bundle

Key Design Principles

Documentation

Development

License

Credits

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

leadforge

What Makes LeadForge Different

Installation

Quickstart

CLI

Python API

Generated Data Preview

Exposure Modes

Difficulty Profiles

Output Bundle

Key Design Principles

Documentation

Development

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages