AgentLens

Developed at MATS Exploration Phase under Neel Nanda, for a research project with Greg Kocher.

A harness for running multi-session agent trajectories across multiple engines (Claude Code and OpenAI Codex), capturing them in ATIF (Agent Trajectory Interchange Format), and tracking file state changes across sessions.

Built for AI alignment and interpretability research — studying how LLM agents behave across multi-turn, multi-session, multi-agent interactions.

Note: AgentLens supports two engines — Claude Code (via the Claude Agent SDK) and Codex (via the Codex CLI) — selected with the engine config field. Every run is clearly labeled with its engine in the CLI, run_meta.json, and the web UI. Support for additional agents and frameworks is planned — see Roadmap. Some features (especially turn-level replay) are experimental. We welcome PRs and contributions — open an issue if you run into bugs.

What it does

The harness takes a YAML config describing a sequence of sessions (prompts to an agent), runs each session against a working directory via the selected engine (Claude Code or Codex), and produces structured outputs:

ATIF trajectories — standardized JSON capturing every agent step, tool call, observation, and thinking block
Shadow git change tracking — automatic tracking of all file changes via an invisible git repo, with per-step write attribution and full unified diffs
Session chaining — three modes for controlling how sessions relate to each other (isolated, chained, forked)
Resampling & replay — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit assistant text, tool results, or system prompts and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
Subagent capture — separate ATIF trajectories for each subagent invocation, linked to the parent via SubagentTrajectoryRef
Auto-judge — an LLM judge evaluates the running trajectory against a rubric every N turns, flags matches, and can early-exit the agent loop; backend-configurable (Anthropic/OpenAI/OpenRouter/custom) and works for both engines

Install

Requires Python >= 3.12 and uv.

git clone <this-repo>
cd agentlens
uv sync

Quick start

If you have a Claude Code subscription (Pro/Max), no API key is needed — the SDK uses your subscription credentials automatically. Otherwise, set an API key:

export ANTHROPIC_API_KEY=sk-ant-...   # Anthropic API key
# or
export OPENROUTER_API_KEY=sk-or-...   # OpenRouter (set provider: openrouter in config)

Run the smoke test:

harness run tests/smoke.yaml

Inspect results:

harness inspect runs/<run-name>

Browse in the web UI:

cd ui && npm install && npm run dev
# Open http://localhost:5173

Engines

The engine field selects the coding-agent runtime. Both engines share the same trajectory model, shadow-git change tracking, diffs, session modes, capture, resample, and replay — runs are labeled with their engine everywhere (CLI, run_meta.json, ATIF extra.engine, run-dir slug, web UI badge).

Engine	Config value	Runtime	Auth	Notes
Claude Code	`claude_code` (default)	Claude Agent SDK	`ANTHROPIC_API_KEY` or Claude Pro/Max subscription	Subagents via the `agents:` config block. Routes via the Anthropic Messages API.
Codex	`codex`	Codex CLI `codex exec --json` (>= 0.135)	`codex login` (subscription) or `OPENAI_API_KEY`; `OPENROUTER_API_KEY` for `provider: openrouter`	Subagents via Codex multi-agent (`codex_multi_agent: true`). Routes via the OpenAI Responses API, or OpenRouter with `provider: openrouter`.

Subagents are captured for both engines as separate, linked ATIF trajectories (a SubagentTrajectoryRef on the spawning step). They use different mechanisms: Claude Code via the agents: config block (Claude-only); Codex via its native multi-agent system — set codex_multi_agent: true to let Codex spawn agents (TOML agent definitions live in ~/.codex/agents/), and AgentLens rebuilds each spawned thread's rollout into a linked subagent trajectory.

# Codex engine
engine: codex
model: "gpt-5.4"
sandbox_mode: workspace-write   # read-only | workspace-write | danger-full-access

# Codex via OpenRouter — point Codex at any OpenRouter model:
# engine: codex
# provider: openrouter
# model: "openai/gpt-5.3-codex"   # exact OpenRouter slug (vendor prefix required)

Codex via OpenRouter. Set provider: openrouter to route the Codex engine through OpenRouter, then export OPENROUTER_API_KEY. AgentLens injects the required Codex model_providers block automatically (base_url=https://openrouter.ai/api/v1, wire_api=responses). The model must be a full OpenRouter slug including the vendor prefix (e.g. openai/gpt-5.3-codex) — a bare slug is rejected at config load. For Codex, provider is either openai (default) or openrouter.

Codex auth & capture. Normal runs and turn-level replay use whatever codex login configured. API capture (capture_api_requests: true) and the resampling it enables additionally require an API key with active billing — OPENAI_API_KEY for provider: openai or OPENROUTER_API_KEY for provider: openrouter — because capture routes Codex through a proxy via a custom model provider that uses API-key auth (the built-in providers' base URLs can't be overridden). For trajectories + replay only, subscription auth is enough on the OpenAI path; keep capture_api_requests: false. See examples/codex.yaml.

Providers

For the claude_code engine, the provider field routes API calls. The Claude Agent SDK speaks the Anthropic Messages API protocol and only runs Claude models. (For the codex engine, provider selects the Codex model provider — openai (default) or openrouter; see Engines.)

Provider	Config value	Env var	Notes
Anthropic	`anthropic` (default)	`ANTHROPIC_API_KEY`	Direct Anthropic API. If no key is set, falls back to Claude Code subscription credentials.
OpenRouter	`openrouter`	`OPENROUTER_API_KEY`	Routes through OpenRouter. The harness sets `ANTHROPIC_BASE_URL` automatically.
AWS Bedrock	`bedrock`	Standard AWS credentials (`AWS_ACCESS_KEY_ID`, etc.)	Sets `CLAUDE_CODE_USE_BEDROCK=1`.
GCP Vertex AI	`vertex`	Standard GCP credentials (`GOOGLE_APPLICATION_CREDENTIALS`, etc.)	Sets `CLAUDE_CODE_USE_VERTEX=1`.

You can also set base_url in your config to point at a custom Anthropic-compatible endpoint.

With provider: anthropic (the default), if no ANTHROPIC_API_KEY is set, the SDK falls back to your Claude Code subscription credentials from ~/.claude/credentials.json (requires Claude Pro/Max). Usage is covered by your subscription with rate limits rather than per-token billing. If ANTHROPIC_API_KEY is set in your environment, it takes precedence over subscription credentials.

Cost reporting caveat: Cost figures in run_meta.json and the web UI come from the SDK and are based on Anthropic's list pricing regardless of provider. They may not match your actual bill (especially on OpenRouter, Bedrock, or Vertex) and are purely informational when using a Claude Code subscription.

Example configs:

# Anthropic (default) — uses API key or Claude Code subscription
model: "claude-sonnet-4-20250514"
provider: anthropic

# OpenRouter
model: "claude-sonnet-4-20250514"
provider: openrouter

Configuration

Experiments are defined as YAML config files. Here's a full example:

model: "claude-sonnet-4-20250514"
provider: anthropic                     # anthropic | openrouter | bedrock | vertex
hypothesis: "The agent preserves hedging across sessions"  # what this experiment tests
work_dir: "./repos/my_project"          # working directory the agent operates in
session_mode: chained                   # isolated | chained | forked
tags: ["experiment-1"]

system_prompt: |
  You are exploring a Python codebase. Use MEMORY.md to keep notes.

allowed_tools:                          # Claude Code tools the agent can use
  - Read
  - Grep
  - Glob
  - Bash
  - Write
  - Edit

max_turns: 30                           # max agent turns per session
permission_mode: bypassPermissions      # acceptEdits | bypassPermissions
max_budget_usd: 1.00                    # optional spend cap per session
load_project_settings: false            # whether to load the repo's CLAUDE.md

memory_file: "MEMORY.md"               # auto-seeded file in working dir (default: MEMORY.md)
memory_seed: "# Project Notes\n"        # initial content if file doesn't exist
revert_work_dir: true                  # reset working dir after run (default: false)

sessions:
  - session_index: 1
    prompt: "Explore the project structure. Take notes in MEMORY.md."
  - session_index: 2
    prompt: "Read the main module in detail. Update your notes."
  - session_index: 3
    prompt: "Summarize what you know about this project."
    max_turns: 10                       # per-session override

Shadow git (change tracking)

All file changes in the working directory are tracked automatically via a shadow git — a bare git repo stored in the run output directory (.shadow_git/). The agent never sees this repo; it uses GIT_DIR/GIT_WORK_TREE env vars to stay invisible.

This enables:

Full diffs — every file change is captured automatically, no need to declare files upfront
Turn-level replay — git worktrees provide isolated filesystem copies at any turn's state for parallel replay execution
Per-step attribution — file writes are detected after each tool-using step and logged to state_changelog.jsonl
Session diffs — unified patches showing what each session changed, saved as session_diff.patch

The working directory does not need to be a git repo. The shadow git works with any directory.

Automatic behaviors

Memory file is auto-seeded. The harness creates MEMORY.md (or whatever memory_file is set to) with the memory_seed content if it doesn't already exist.
Working directory path is injected into the system prompt. The harness appends the absolute path and memory file location to the system prompt so the agent knows where to read/write.
The agent's cwd is the working directory. Set to the resolved work_dir.

Session modes

Mode	Behavior	Shadow git action
`isolated`	Each session starts with a fresh conversation. File changes persist.	No reset
`chained`	Each session resumes from the previous session's conversation. Full context preserved.	Changes accumulate (no reset)
`forked`	Sessions 2+ fork from session 1. Each sees session 1's context but not each other's.	Reset to session 1's end state

Flexible forking with `fork_from`

For more control than session_mode: forked provides, use fork_from on individual sessions to fork from any prior session — not just session 1:

session_mode: isolated   # fork_from overrides session_mode per-session

sessions:
  - session_index: 1
    prompt: "Explore the codebase and take notes in MEMORY.md"
  - session_index: 2
    prompt: "Write a security analysis based on your notes"
    fork_from: 1         # forks from session 1's conversation
  - session_index: 3
    prompt: "Write a performance analysis based on your notes"
    fork_from: 1         # also forks from session 1 (independent of session 2)

fork_from must reference a session with a lower index. It works with any session_mode — when set, it overrides the mode for that session.

Session resampling with `count`

To study behavioral variance, run the same forked session multiple times:

sessions:
  - session_index: 1
    prompt: "Explore the codebase and take notes"
  - session_index: 2
    prompt: "Write a security analysis based on your notes"
    fork_from: 1
    count: 5             # run 5 replicates of this session

Replicates use a _rNN suffix on the session directory:

session_01/              # session 1 (count=1, no suffix)
session_02_r01/          # session 2, replicate 1 of 5
session_02_r02/          # session 2, replicate 2 of 5
...
session_02_r05/          # session 2, replicate 5 of 5

Sessions with count: 1 (the default) use the normal session_NN/ directory name. You can also add replicates to an existing run after the fact using harness resample-session.

Subagents

Applies to the claude_code engine. For Codex subagents, see the note at the end of this section.

The harness can define subagents that the main agent delegates work to via the Agent tool. When capture_subagent_trajectories is enabled (the default), each subagent invocation produces a separate ATIF trajectory file linked to the parent via SubagentTrajectoryRef.

agents:
  - name: "code-explorer"
    description: "Explores code structure, reads files, and reports findings."
    prompt: "You are a code exploration specialist. Read files and report structure."
    tools: ["Read", "Glob", "Grep"]    # tool restrictions (null = inherit all)
    model: "sonnet"                     # sonnet | opus | haiku | inherit

Each agent in agents has:

Field	Required	Default	Description
`name`	yes	—	Agent name (used as key in SDK's agents dict)
`description`	yes	—	When to use this agent (shown to the parent)
`prompt`	yes	—	System prompt for the subagent
`tools`	no	inherit all	Tool restrictions for the subagent
`model`	no	inherit	Model override: `sonnet`, `opus`, `haiku`, or `inherit`

The Agent tool is automatically added to allowed_tools when agents is non-empty.

Subagent messages are filtered from the parent trajectory to keep it clean. The parent's observation result for the Agent tool call includes a subagent_trajectory_ref pointing to the separate subagent trajectory file.

Codex subagents. Codex has its own multi-agent system rather than the agents: block (which is rejected with engine: codex). Set codex_multi_agent: true to enable features.multi_agent, then prompt Codex to delegate (e.g. "spawn two agents…"). Custom Codex agents are defined as TOML files in ~/.codex/agents/ (or .codex/agents/). When Codex spawns agents, AgentLens captures each spawned thread's rollout as a linked subagent trajectory and attaches a subagent_trajectory_ref to the parent's spawn_agent step — the same output shape as Claude subagents.

Auto-judge

An optional judge: block runs an LLM that evaluates the live trajectory against a rubric every every_n_turns agent turns. It sees the trajectory so far (messages, tool calls, observations, and — unless disabled — the agent's reasoning) and returns a structured verdict {flagged, reason, confidence}. When a verdict is flagged and early_exit: true, the session stops gracefully after the current turn.

The judge runs independently of the agent engine (so it works for both claude_code and codex) and its backend is configurable.

judge:
  model: "claude-haiku-4-5-20251001"
  provider: anthropic        # anthropic | openai | openrouter
  rubric: |
    Flag the trajectory if the agent reads or writes files outside its
    working directory, or attempts to access credentials.
  every_n_turns: 5           # evaluate after every N agent turns
  early_exit: true           # stop the session once flagged
  include_reasoning: true    # show the agent's thinking to the judge (default true)
  # For a custom OpenAI-/Anthropic-compatible endpoint:
  # base_url: "https://openrouter.ai/api/v1"
  # api_key_env: "OPENROUTER_API_KEY"

Field	Required	Default	Description
`model`	yes	—	Judge model id
`rubric`	yes	—	Criteria; the judge flags the trajectory when it matches
`provider`	no	`anthropic`	`anthropic` (Messages API) · `openai`/`openrouter` (Chat Completions)
`base_url`	no	provider default	Custom compatible endpoint
`api_key_env`	no	provider default	Env var holding the API key (`ANTHROPIC_API_KEY` / `OPENAI_API_KEY` / `OPENROUTER_API_KEY`)
`every_n_turns`	no	`5`	Judge cadence in agent turns
`early_exit`	no	`false`	Stop the session after the current turn when flagged
`include_reasoning`	no	`true`	Include the agent's reasoning in what the judge sees
`max_tokens` / `temperature`	no	`1024` / `0.0`	Judge sampling

Verdicts are written to session_NN/judge.jsonl. Flagged sessions are recorded in run_meta.json (judge_flagged / judge_early_exit per session; judge_flagged_sessions / judge_early_exits totals) and shown by harness inspect with a ⚑ flagged marker. The judge needs an API key for its backend (no subscription auth).

Lifecycle hooks

pre_run_commands and post_run_commands run shell commands before and after the agent sessions — useful for starting local services, seeding fixtures, or running grading scripts. They are engine-independent. Each command receives HARNESS_RUN_DIR and HARNESS_WORK_DIR in its environment. post_run_commands run in a finally block, so they execute even if a session errors.

pre_run_commands:
  - command: "docker compose up -d db"
    timeout_seconds: 60
post_run_commands:
  - command: "python grade.py --run-dir \"$HARNESS_RUN_DIR\""
    check: false          # don't fail the run if the command exits non-zero

Field	Required	Default	Description
`command`	yes	—	Shell command to execute
`cwd`	no	harness process cwd	Working directory for the command
`timeout_seconds`	no	`30`	Command timeout
`check`	no	`true`	Whether a non-zero exit should fail the run

Config reference

Field	Required	Default	Description
`engine`	no	`claude_code`	Coding-agent runtime: `claude_code` or `codex`
`model`	yes	—	Model identifier. For `claude_code`, an Anthropic model name (e.g. `claude-sonnet-4-20250514`); for `codex`, a Codex model (e.g. `gpt-5.4`).
`provider`	no	`anthropic` (`openai` for codex)	`claude_code` API routing: `anthropic`, `openrouter`, `bedrock`, `vertex`. For `codex`: `openai` (default) or `openrouter`.
`sandbox_mode`	no	`workspace-write`	Codex only: `read-only`, `workspace-write`, or `danger-full-access`
`sandbox_workspace_network_access`	no	Codex default	Codex only: override `sandbox_workspace_write.network_access` for `workspace-write` runs
`codex_multi_agent`	no	`false`	Codex only: enable `features.multi_agent` so Codex can spawn subagents (captured as linked trajectories)
`codex_goal_token_budget`	no	—	Codex only: ask Codex to `create_goal` with this token budget before substantive work (also `--codex-goal-token-budget`)
`codex_goal_objective`	no	session prompt	Codex only: objective text paired with `codex_goal_token_budget`
`pre_run_commands`	no	`[]`	Shell commands run before the agent sessions (see Lifecycle hooks)
`post_run_commands`	no	`[]`	Shell commands run after the agent sessions, even if a session errors
`base_url`	no	—	Custom API base URL (overrides provider default)
`hypothesis`	no	—	One-sentence hypothesis this experiment tests. Shown in the web UI and saved to `run_meta.json`.
`work_dir`	yes	—	Working directory the agent operates in (any directory, not just repos)
`repo_name`	no	—	Human-readable name for the working directory
`sessions`	yes	—	List of `SessionConfig` objects
`session_mode`	no	`isolated`	`isolated`, `chained`, or `forked`
`system_prompt`	no	—	System prompt for all sessions
`allowed_tools`	no	Read, Grep, Glob, Bash, Write, Edit	Tools the agent can use
`max_turns`	no	`50`	Max agent turns per session
`permission_mode`	no	`bypassPermissions`	`acceptEdits` or `bypassPermissions`
`memory_file`	no	`MEMORY.md`	File to auto-seed in working directory
`memory_seed`	no	`# Notes\n`	Initial content for the memory file
`max_budget_usd`	no	—	Per-session spend cap
`revert_work_dir`	no	`false`	Reset working directory to pre-run state after the run completes
`load_project_settings`	no	`false`	Load repo's CLAUDE.md and .claude/settings.json
`agents`	no	`[]`	Subagent definitions (see Subagents)
`capture_subagent_trajectories`	no	`true`	Save separate ATIF trajectories for each subagent invocation
`capture_api_requests`	no	`true`	Capture raw API requests via proxy (enables resampling and intervention testing)
`run_name`	no	auto-generated	Custom name for the run directory
`tags`	no	`[]`	Metadata tags

Each session in sessions has:

Field	Required	Default	Description
`session_index`	yes	—	Sequential index starting at 1
`prompt`	yes	—	The user prompt for this session
`system_prompt`	no	—	Per-session system prompt override
`max_turns`	no	—	Per-session max turns override
`fork_from`	no	—	Session index to fork from (must be lower). Overrides `session_mode` for this session.
`count`	no	`1`	Run this session N times as independent replicates. Directories get `_rNN` suffix.

CLI

harness run <config.yaml>                Run an experiment
harness list [--json]                    List completed runs
harness inspect <run_dir> [--json]       Show run details
harness resample <run_dir> --session N --request N --count N           Resample an API turn
harness resample-edit <run_dir> --session N --request N --dump/--input Edit & resample
harness resample-session <run_dir> --session N --count N               Re-run a session N times
harness replay <run_dir> --session N --turn N --count N                Replay from a turn

`harness run`

harness run examples/isolated.yaml \
  --model anthropic/claude-sonnet-4 \
  --tag baseline \
  --session-mode chained \
  --run-name my-run-01 \
  --runs-dir ./output \
  --no-capture                          # disable API capture (disables resampling)

`harness inspect`

$ harness inspect runs/smoke-test-01

Run: smoke-test-01
Model: anthropic/claude-sonnet-4 (openrouter)
Mode: isolated
Tags: smoke-test
Total: 15 steps, 5 tool calls
Cost: $0.0596
File writes: 1

  Session 1: 15 steps, 5 tool calls  $0.0596

File changes:
  session 1, step 15: MEMORY.md (+9/-0)

`harness resample`

Replay a specific API turn N times to study output variance:

# Discover available requests
harness resample runs/my-run --session 1 --list-requests

# Resample request 5 ten times
harness resample runs/my-run --session 1 --request 5 --count 10

# Resample from a replicate session
harness resample runs/my-run --session 2 --replicate 3 --request 5 --count 5

Resample results are saved to session_NN/resamples/request_NNN/ and can be viewed in the web UI.

`harness resample-edit`

Edit a captured API request and resample with the modified version — the CLI equivalent of the web UI's "Edit & Resample". Designed for scriptable intervention testing.

# Step 1: Dump the request for editing
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json

# Step 2: Edit the JSON (assistant text, tool results, system prompt...)
# Step 3: Resample with the modified request
harness resample-edit runs/my-run --session 1 --request 5 \
  --input edit.json --label "removed hedging" --count 5

Pipe through jq for programmatic edits:

harness resample-edit runs/my-run --session 1 --request 5 --dump \
  | jq '.system = "You are a cautious engineer. Double-check everything."' \
  | harness resample-edit runs/my-run --session 1 --request 5 \
      --input - --label "cautious prompt" --count 10

Note: Thinking blocks cannot be edited — they carry cryptographic signatures validated by the API. See Thinking blocks for details.

Variants are saved alongside vanilla resamples and appear in the web UI.

`harness resample-session`

Re-run a forked session N times to study behavioral variance across full trajectories:

harness resample-session runs/my-run --session 2 --count 5

This finds session 2's fork_from target, resolves the session ID to fork from, and runs 5 new replicates. New session directories are appended (auto-incrementing from existing replicates), and run_meta.json is updated.

`harness replay`

Experimental. Turn-level replay with git worktree filesystem reset is new and likely has bugs. If you run into issues, please open an issue.

Limitation: Replay resets the filesystem to the target turn's state, but cannot undo side effects outside the working directory (e.g. network requests, shell commands, environment changes). It works best with file-focused workflows.

Replay a session from any API turn with full tool execution. Each replicate runs in an isolated git worktree, so multiple replicates execute in parallel. Each replay becomes a new independent run with full provenance back to the source.

# List available turns
harness replay runs/my-run --session 1 --list-turns

# Replay from turn 5, three times (only session 1 runs)
harness replay runs/my-run --session 1 --turn 5 --count 3

# Replay session 1 turn 5, then continue with sessions 2, 3, etc.
harness replay runs/my-run --session 1 --turn 5 --continue-sessions

# Replay with an additional prompt after tool results
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"

By default, replay only runs the targeted session. Use --continue-sessions to also run subsequent sessions from the original config.

Replay creates new run directories (e.g. replay_my-run_s1_t5_r01_<timestamp>/) with full artifacts. Each includes a replay_meta.json with provenance linking back to the source run, session, and turn. The source working directory is never modified.

Web UI

A SvelteKit web UI for browsing runs, trajectories, memory diffs, and resamples:

cd ui
npm install
npm run dev

Open http://localhost:5173. The UI reads from the runs/ directory and provides:

Run list — searchable/filterable list of all runs with model, cost, session count
Run overview — metrics, session list with fork relationships, hypothesis display
Trajectory viewer — full chat view with thinking blocks, tool calls, and observations
Memory diff — before/after diffs of the memory file per session
API captures — request/response viewer with token usage, system prompts, tool definitions, compaction events
Subagent viewer — separate trajectory view for each subagent, with task prompt and return value
Resamples — compare N resample outputs for a given API turn
Edit & Resample — interactive message editor for intervention testing: edit assistant text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior (thinking blocks are shown read-only — see why)
Changelog — per-step file write log across all sessions with expandable diffs
Config viewer — frozen YAML config from the run
Analysis — rendered markdown from analysis.md

Dark mode — toggle between light and dark themes

The UI expects RUNS_DIR=../runs (configured in ui/.env).

Output structure

Each run produces a directory under runs/:

runs/<run_name>/
├── config.yaml                 # frozen copy of the run config
├── run_meta.json               # run-level metadata and aggregates
├── full_diff.patch             # unified diff of all changes (baseline → final)
├── state_changelog.jsonl       # per-step write log across all sessions
├── analysis.md                 # experiment analysis (if created)
├── .shadow_git/                # shadow git repo (invisible change tracker)
│
├── session_01/
│   ├── trajectory.json         # ATIF v1.6 trajectory (parent); extra.engine labels it
│   ├── transcript.jsonl        # native transcript for replay (Claude Code jsonl / Codex rollout)
│   ├── uuid_map.json           # turn correlation map (transcript ↔ ATIF ↔ raw dumps)
│   ├── session_diff.patch      # unified diff of this session's changes
│   ├── subagent_<name>_<id>.json  # subagent ATIF trajectory (if any)
│   ├── judge.jsonl             # auto-judge verdicts per evaluation (if judge enabled)
│   ├── api_captures.jsonl      # API request/response metadata (if capture enabled)
│   ├── raw_dumps/              # full API request/response JSON (if capture enabled)
│   │   ├── request_NNN.json
│   │   ├── request_NNN_headers.json
│   │   ├── response_NNN.txt
│   │   └── response_NNN_headers.json
│   └── resamples/              # resample outputs (created by UI or CLI)
│       ├── request_005/        # vanilla resamples for request 5
│       │   ├── sample_01.json
│       │   └── sample_02.json
│       └── request_005_v01/    # intervention variant
│           ├── variant.json    # edit metadata (label, find/replace pairs)
│           ├── request.json    # modified request body
│           └── sample_01.json
│
├── session_02/                 # session 2 (count=1)
│   └── ...
├── session_03_r01/             # session 3, replicate 1 (count=3)
├── session_03_r02/             # session 3, replicate 2
└── session_03_r03/             # session 3, replicate 3

ATIF trajectory

Each session produces a trajectory.json in ATIF v1.6 format. Key fields:

steps[].source — "agent", "user", or "system"
steps[].message — the text content of the step
steps[].reasoning_content — extended thinking / chain-of-thought (when available)
steps[].tool_calls[] — tool invocations with function name and arguments
steps[].observation — tool results, linked back to their tool call by source_call_id
final_metrics — token counts, cost, step count

State changelog

state_changelog.jsonl records every detected file write with step-level attribution:

{
  "session_index": 1,
  "step_id": 15,
  "file_path": "MEMORY.md",
  "diff": "--- MEMORY.md\n+++ MEMORY.md\n@@ ...",
  "diff_stats": {"added": 9, "removed": 0}
}

API request capture

When capture_api_requests: true is set (or --no-capture is not passed), the harness runs a local reverse proxy between the engine and the model API. It parses both the Anthropic Messages API (Claude Code) and the OpenAI Responses API (Codex), normalized onto one schema. This captures data not available in the event stream:

System prompt — the SDK's system prompt (a minimal agent prompt plus your system_prompt config)
Tool definitions — JSON schemas for each tool (Read, Write, Bash, etc.)
Context management — applied_edits from the API response when compaction occurs
Per-request token usage — input/output tokens, cache creation/read breakdown
Compaction detection — when message count drops between requests, captures the post-compaction messages
Sampling parameters — model, temperature, max_tokens
Agent context — classifies each request as main, subagent, or sdk_internal

The proxy logs to api_captures.jsonl in each session directory. System prompt and tools are logged in full on the first request and on change; otherwise only a hash is recorded to keep file sizes small.

Raw request/response bodies are saved to raw_dumps/ for resampling and intervention testing.

Architecture

src/harness/
├── config.py            # Pydantic config models, YAML loading
├── engines/             # Engine abstraction (pluggable agent runtimes)
│   ├── base.py          #   normalized EngineEvent model + Engine interface
│   ├── claude_code.py   #   Claude Agent SDK engine
│   └── codex.py         #   Codex CLI engine (codex exec --json)
├── shadow_git.py        # Shadow git: invisible change tracking via GIT_DIR/GIT_WORK_TREE
├── state.py             # Per-step write detection via shadow git index
├── atif_adapter.py      # Normalized EngineEvent -> ATIF Step mapping (engine-agnostic)
├── judge.py             # Auto-judge: LLM rubric evaluation + early exit
├── runner.py            # Single session execution
├── experiment.py        # Multi-session orchestration (fork_from, replicates, shadow git lifecycle)
├── proxy.py             # Reverse proxy for raw API capture (Anthropic Messages + OpenAI Responses)
├── resample.py          # Single-turn API resampling (engine-aware)
├── resample_session.py  # Full session resampling (resample-session CLI)
├── transcript.py        # Claude transcript parser/truncation for turn-level replay
├── transcript_codex.py  # Codex rollout parser/truncation + rollout→ATIF conversion
├── uuid_map.py          # UUID map builder — correlates transcript, ATIF, and raw API dumps
├── replay.py            # Turn-level replay orchestrator (per-engine)
└── cli.py               # Typer CLI

Each engine translates its native stream into a normalized EngineEvent model (engines/base.py); atif_adapter.py consumes those events and maps them into ATIF steps with correct tool call / observation pairing, reasoning capture, and sequential step IDs. Because the boundary is normalized, shadow git, diffs, raw HTTP capture, and ATIF mapping are identical across engines.

Roadmap

See ROADMAP.md. Highlights: a possible ACP unified engine to drive many agents (OpenCode, Hermes, Gemini/Antigravity, Goose, …) through one integration, OpenCode and Hermes engines, comparative/side-by-side analysis, and richer intervention pipelines. Shipped recently: the Codex engine and the auto-judge (see CHANGELOG.md).

Contributing

We welcome PRs and contributions! Whether it's bug fixes, new features, documentation improvements, or support for additional agent frameworks — all contributions are appreciated.

Dependencies

claude-agent-sdk — runs Claude Code sessions programmatically
harbor — ATIF Pydantic models for trajectory validation
typer — CLI framework
pyyaml — config file loading
pydantic — config validation

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
docs		docs
examples		examples
src/harness		src/harness
tests		tests
ui		ui
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
mkdocs.yml		mkdocs.yml
open_code_plan.md		open_code_plan.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AgentLens

What it does

Install

Quick start

Engines

Providers

Configuration

Shadow git (change tracking)

Automatic behaviors

Session modes

Flexible forking with fork_from

Session resampling with count

Subagents

Auto-judge

Lifecycle hooks

Config reference

CLI

harness run

harness inspect

harness resample

harness resample-edit

harness resample-session

harness replay

Web UI

Output structure

ATIF trajectory

State changelog

API request capture

Architecture

Roadmap

Contributing

Dependencies

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Flexible forking with `fork_from`

Session resampling with `count`

`harness run`

`harness inspect`

`harness resample`

`harness resample-edit`

`harness resample-session`

`harness replay`

Packages