bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696
bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696DvirDukhan wants to merge 4 commits into
Conversation
Pairs with #api-v2 (api/v2/* MCP-parity endpoints). With those endpoints in place, the bench harness can now run the HTTP-transport sibling (cg) on the same verb surface as the stdio-MCP sibling (cg-mcp), so a head-to-head benchmark measures *transport overhead* rather than API-surface differences. Changes: * bench/agents/code_graph_adapter.py — add v2 client methods on CodeGraphClient that POST to the new /api/v2/* endpoints (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path_v2, ask_v2). Existing UI-shaped methods (graph_entities, get_neighbors, find_paths, ...) kept for back-compat with tests/test_cli.py. * bench/cli/cg.py — rewrite to expose the 8 MCP-style verbs (index_repo, search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path, ask) alongside the legacy UI verbs. Mirrors cg_mcp.py's _compact_list / _strip_worktree_prefix helpers so token compaction is byte-identical between transports. * bench/runners/mini_runner.py — INSTANCE_TEMPLATE_CODE_GRAPH now documents the new verb surface. The cg track exports PROJECT_NAME + BRANCH like the MCP track, and indexes via /api/analyze_folder with explicit branch=_default so both tracks share the code:<project>:<branch> graph namespace. * bench/tools/code_graph/system_preamble.md — rewritten to mirror bench/tools/code_graph_mcp/system_preamble.md verb-for-verb. Parity verified byte-for-byte on a pre-indexed pytest-6202 graph: cg search_code/get_callers/get_callees/impact_analysis returns identical output to the cg-mcp equivalents (1 KB payload diff'd). All 27 existing bench + CLI tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Iter3 root-cause: with the verb surfaces and tool outputs now byte-identical between the HTTP (cg) and MCP (cg-mcp) tracks, the remaining token gap traced entirely to reading strategy. On 2/10 instances the agent fell into a 19x full-file `cat` loop instead of reading the bounded span the graph already pointed at, inflating input tokens 3-4x on those instances. Both preambles now explicitly forbid `cat`-ing a whole source file and require `sed -n 'START,ENDp'` anchored on the graph's line number. This attacks the actual token driver and applies equally to both transports so a head-to-head stays apples-to-apples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sample_instances() was called with only `stage` (size from STAGE_SIZES), then the result was sliced `[:limit]`. That let --limit shrink the sample below the stage size but never grow it, so `--stage calibration --limit 40` silently ran just 10 instances. Pass n=args.limit straight into sample_instances so the limit sets the exact sample size (falling back to the stage size when unset). Because random.sample is prefix-stable for our seed, the n=10 calibration set stays a subset of the n=40 set, so existing trajectories/indexed graphs still resume-skip cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ace) Cascades the corrected #695 (which removed /api/v2/ask) and aligns the benchmark's parity surface with the now-ask-less MCP/HTTP tool set. Removes the GraphRAG `ask` verb from both transports so cg and cg-mcp expose the same 7 structural verbs (index_repo, search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path): - bench/agents/code_graph_adapter.py: drop ask_v2() (was POST /api/v2/ask) - bench/agents/code_graph_mcp_adapter.py: drop ask() (was call_tool("ask")) - bench/cli/cg.py, cg_mcp.py: drop the `ask` subcommand + handler + docs - scripts/mcp_smoke.py: drop "ask" from the expected MCP tool set - system_preamble.md / tools.yaml / AGENTS.md: 8 -> 7 verbs Tests: tests/mcp (54) and bench suites (40) pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing: we're standardizing the code-graph benchmark on the MCP arm (the real-world agent integration), so HTTP↔MCP bench parity isn't needed. The MCP arm in #693 is aligned to the live consolidated tool surface (get_neighbors/find_symbol/get_file_neighbors). Branch preserved if we revisit. |
Prerequisites (merge order)
Merge in order — this PR is stacked on:
/api/v2/*MCP-parity endpointsBase: #695.
Runtime prerequisite: exercises the shared 8-verb surface, so the MCP nav product stack (#701/#702) must be deployed/indexed to run the parity arm.
Summary
Pairs with #api-v2 (the
/api/v2/*MCP-parity endpoints). With those endpoints in place, the SWE-bench harness can now run the HTTP-transport sibling (cg) on the same verb surface as the stdio-MCP sibling (cg-mcp), so a head-to-head benchmark measures transport overhead rather than API-surface differences.Changes
CodeGraphClientthat POST to the new/api/v2/*endpoints (search_code,get_callers,get_callees,get_dependencies,impact_analysis,find_path_v2,ask_v2). Existing UI-shaped methods kept for back-compat withtests/test_cli.py.index_repo,search_code,get_callers,get_callees,get_dependencies,impact_analysis,find_path,ask) alongside the legacy UI verbs. Mirrorscg_mcp.py's_compact_list/_strip_worktree_prefixhelpers so token compaction is byte-identical between transports.INSTANCE_TEMPLATE_CODE_GRAPHnow documents the new verb surface. Thecgtrack exportsPROJECT_NAME+BRANCHlike the MCP track, and indexes via/api/analyze_folderwith explicitbranch=_defaultso both tracks share thecode:<project>:<branch>graph namespace.bench/tools/code_graph_mcp/system_preamble.mdverb-for-verb.Validation
Parity verified byte-for-byte on a pre-indexed pytest-6202 graph:
cg search_code/get_callers/get_callees/impact_analysisreturns identical output to the cg-mcp equivalents (1 KB payload diff'd). All 27 existing bench + CLI tests still pass.Stacked
dvirdukhan/api-v2-mcp-parity(needs the v2 endpoints).