Skip to content

Latest commit

 

History

History
242 lines (172 loc) · 12.1 KB

File metadata and controls

242 lines (172 loc) · 12.1 KB

Eval Guide

Internal tooling for measuring how reliably a model + backend combo navigates multi-step tool-calling workflows. Not a test suite — run manually against a live backend.

Eval Harness

Quick Start

# Ollama — all scenarios, 10 runs each
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --stream --verbose

# llama-server — start server in one terminal, run eval in another
llama-server --jinja -m path/to/Ministral-3-14B-Instruct-2512-Q4_K_M.gguf -ngl 999 --port 8080
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode native --gguf path/to/Ministral-3-14B-Instruct-2512-Q4_K_M.gguf --runs 10 --stream --verbose

# Anthropic API
python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20251001 --runs 5 --stream --verbose

eval_runner Flags

Flag Values Default Description
--backend ollama, llamafile, anthropic ollama Backend to target
--model string (required for ollama/anthropic) Model name (Ollama-style or Anthropic model ID). Rejected for llamafile (use --gguf).
--gguf path (required for llamafile) Path to GGUF / llamafile model file. Rejected for ollama/anthropic (use --model).
--runs int 10 Runs per scenario
--stream flag off Use streaming mode
--verbose, -v flag off Print live per-message trace
--tags plumbing, model_quality, advanced_reasoning, compaction, stateful, reasoning, error_recovery all Filter scenarios by tag
--scenario name(s) all Run specific scenario(s) by name
--llamafile-mode native, prompt, auto auto FC mode for llamafile/llama-server backend
--think true, false, auto auto Thinking mode. Ollama: controls think param. Llamafile: captures [THINK] tags and reasoning_content
--budget-mode backend, manual, forge-full, forge-fast forge-full Context budget strategy. Compaction scenarios always override with their own budget
--num-ctx int none Exact token budget (requires --budget-mode manual)
--no-history flag off Disable message history collection (lighter, fewer metrics)
--probe flag off Print resolved budget from backend and exit (no eval run)
--base-url URL none Override backend base URL
--ablation reforged, no_rescue, no_nudge, no_steps, no_recovery, no_compact, bare reforged Ablation preset: selectively disable guardrails
--tool-choice auto, any none Anthropic tool_choice type. any forces tool calls
--no-cache-prompt flag off Disable llama-server prompt caching
--compact-strategy tiered, sliding, none auto Override compaction strategy for all scenarios

Scenarios

30 scenarios across five categories. The 26 non-compaction scenarios split into two difficulty tiers — OG-18 (baseline) and advanced_reasoning (hard) — with the dashboard's Suite scope filtering between them.

Plumbing (does forge's tool-calling loop work?):

  • basic_2step, sequential_3step, error_recovery

Model quality (does the model reason correctly?):

  • tool_selection, argument_fidelity, sequential_reasoning, conditional_routing, data_gap_recovery, relevance_detection

Advanced reasoning (top-tier separators — designed to weed out 8B-class winners after sampling-defaults closed the OG-18 gap):

  • data_gap_recovery_extended, argument_transformation, inconsistent_api_recovery, grounded_synthesis

Compaction chain (multi-phase compaction retention):

  • compaction_chain_baseline, compaction_chain_p1, compaction_chain_p2, compaction_chain_p3

Stateful variants (state carries between calls — wrong arguments cascade):

  • All scenarios above (except compaction chain) ship a _stateful pair: basic_2step_stateful, sequential_3step_stateful, error_recovery_stateful, tool_selection_stateful, argument_fidelity_stateful, sequential_reasoning_stateful, conditional_routing_stateful, data_gap_recovery_stateful, relevance_detection_stateful, data_gap_recovery_extended_stateful, argument_transformation_stateful, inconsistent_api_recovery_stateful, grounded_synthesis_stateful.

Lambda vs stateful: Lambda scenarios use hardcoded echo tools — tool arguments don't affect the result. Stateful scenarios use backend classes where arguments matter and state carries between calls. The delta between lambda and stateful scores for the same model isolates model reasoning quality from forge correctness.

OG-18 vs advanced_reasoning: OG-18 is the 18-scenario baseline (plumbing + model_quality + their stateful pairs). advanced_reasoning is the 8 scenarios tagged for top-tier-only batching. Most published results split aggregates across the two; see MODEL_GUIDE.md for context.

Examples

# Filter by tag
python -m tests.eval.eval_runner --backend ollama --model "qwen3:8b-q4_K_M" --runs 5 --tags plumbing

# Specific scenarios
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --scenario basic_2step sequential_3step

# Qwen3 with thinking on llama-server
llama-server --jinja -m path/to/Qwen3-8B-Q4_K_M.gguf -ngl 999 --port 8080 --reasoning-format auto
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode native --gguf path/to/Qwen3-8B-Q4_K_M.gguf --runs 10 --stream --think true

# Probe budget without running eval
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --probe

# Ablation — bare (all guardrails off)
python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20251001 --runs 5 --stream --ablation bare

All OG-18 non-stateful scenarios (copy-paste friendly):

--scenario basic_2step sequential_3step error_recovery tool_selection argument_fidelity sequential_reasoning conditional_routing data_gap_recovery relevance_detection

All OG-18 stateful scenarios:

--scenario basic_2step_stateful sequential_3step_stateful error_recovery_stateful tool_selection_stateful argument_fidelity_stateful sequential_reasoning_stateful conditional_routing_stateful data_gap_recovery_stateful relevance_detection_stateful

All advanced_reasoning scenarios (lambda + stateful, 8 total):

--scenario data_gap_recovery_extended argument_transformation inconsistent_api_recovery grounded_synthesis data_gap_recovery_extended_stateful argument_transformation_stateful inconsistent_api_recovery_stateful grounded_synthesis_stateful

Or via tag (equivalent):

--tags advanced_reasoning

Batch Eval

Run large-scale model comparisons across all backends. Results append to JSONL with automatic resume. Ollama auto-loads models, llama-server is auto-managed (start/stop/health check per GGUF), llamafile binaries require a manual server.

batch_eval Flags

Flag Values Default Description
--config all, ollama, llamaserver, llamafile, llamaserver-native, llamaserver-prompt, anthropic, anthropic-any, haiku, sonnet, opus, haiku-any, sonnet-any, opus-any all Config set to run
--runs int 50 Runs per scenario
--output path eval_results.jsonl JSONL output path
--scenario name(s) all Run specific scenario(s)
--tags tag(s) all Filter scenarios by tag
--budget-mode backend, manual, forge-full, forge-fast forge-full Context budget strategy
--num-ctx int none Exact token budget (requires --budget-mode manual)
--ablation preset name reforged Ablation preset
--model substring none Filter configs to models containing this substring
--dry-run flag off Show what would run without executing
--verbose, -v flag off Print per-run details

Examples

# Ollama (11 models, fully unattended)
python -m tests.eval.batch_eval --config ollama --runs 50

# llama-server (auto-managed, starts/stops per GGUF)
python -m tests.eval.batch_eval --config llamaserver --runs 50

# Anthropic (costs money)
python -m tests.eval.batch_eval --config anthropic --runs 50

# Dry run
python -m tests.eval.batch_eval --config all --runs 50 --dry-run

# Filter to specific model
python -m tests.eval.batch_eval --config llamaserver --model 8b-reasoning --runs 20

# Specific scenarios only
python -m tests.eval.batch_eval --config ollama --runs 50 --scenario basic_2step sequential_reasoning

Resume is automatic: re-run the same command and it skips completed scenarios.


Reports

Forge eval report

# Full table + list
python -m tests.eval.report eval_results.jsonl

# Progress (for incomplete runs)
python -m tests.eval.report eval_results.jsonl --progress

# Compact list only (phone-friendly)
python -m tests.eval.report eval_results.jsonl --list-only

# Include partially-completed configs
python -m tests.eval.report eval_results.jsonl --include-partial

# Filter by ablation
python -m tests.eval.report eval_results.jsonl --ablation reforged bare

# Filter by scenario tag
python -m tests.eval.report eval_results.jsonl --tags stateful

# Exclude specific scenarios
python -m tests.eval.report eval_results.jsonl --exclude-scenario error_recovery

# HTML dashboard (requires Node.js)
python -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html

# Markdown views
python -m tests.eval.report eval_results.jsonl --markdown docs/results/

# Both
python -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html --markdown docs/results/

report Flags

Flag Values Default Description
jsonl path eval_results.jsonl JSONL input file (positional, optional)
--list-only flag off Skip table, show list view only
--progress flag off Show progress for all configs (including incomplete)
--include-partial flag off Include configs that haven't finished all scenarios
--ablation preset name(s) all Filter to specific ablation preset(s)
--exclude-scenario name(s) none Exclude scenario(s) from aggregates and columns
--tags stateful, lambda, compaction all Filter to scenarios matching tag(s)
--html path none Write interactive HTML dashboard
--markdown dir none Write pre-filtered markdown views

BFCL Benchmark (removed)

Forge previously included a Berkeley Function Calling Leaderboard v4 integration (11 categories, ~2,183 entries). It was removed in favor of forge's own eval harness, which measures multi-step workflow completion rather than single-call argument matching. Last commit with BFCL code: a9b0257.


Ablation Presets

Ablation selectively disables forge guardrails to isolate their contribution to model performance.

Preset Rescue Retry Nudge Step Enforcement Error Recovery Compaction
reforged yes yes (5 retries) yes yes (2 errors) yes
no_rescue no yes yes yes yes
no_nudge no no yes yes yes
no_steps yes yes no yes yes
no_recovery yes yes yes no yes
no_compact yes yes yes yes no
bare no no no no no

Backend Notes

See BACKEND_SETUP.md for installation, server launch, and verification instructions for each backend (Ollama, llama-server, llamafile).

Key points for eval:

  • Ollama runs as a background service — no manual server launch needed
  • llama-server needs --jinja for native function calling; use --backend llamafile --llamafile-mode native
  • llamafile has no native FC — use --llamafile-mode prompt
  • Anthropic needs ANTHROPIC_API_KEY env var; compaction scenarios are skipped (200K context)