Skip to content

Add root management and autonomous indexing tools to MCP server#15

Closed
rickross wants to merge 62 commits into
colbymchenry:mainfrom
rickross:main
Closed

Add root management and autonomous indexing tools to MCP server#15
rickross wants to merge 62 commits into
colbymchenry:mainfrom
rickross:main

Conversation

@rickross

@rickross rickross commented Feb 5, 2026

Copy link
Copy Markdown

Hi Colby!

We've started using CodeGraph extensively and find it incredibly valuable. We added several MCP tools to make it easier for managing multiple projects and for supporting AI agents handling project lifecycle issues (init/index/sync). We hope these will be helpful contributions.

What's in this PR

1. Root Management Tools

  • codegraph_get_root - Get currently active root path
  • codegraph_set_root - Switch between indexed projects
    • Now returns status immediately (files, nodes, edges, DB size) - no separate status call needed
    • Better error messages for uninitialized roots

2. Project Lifecycle Tools

AI assistants can now manage CodeGraph project lifecycle on behalf of users:

  • codegraph_init - Initialize CodeGraph in current root
  • codegraph_index - Full index of current root
  • codegraph_sync - Incremental sync of current root
  • codegraph_uninit - Remove CodeGraph from current root (cleanup)

These operate on the "current root" set by set_root, providing a clean context-based API.

3. Performance Optimization (80x speedup for reference resolution)

Optimized reference resolution with symbol caching:

  • Before: 80+ seconds for sync on projects with 20K+ unresolved refs (9% CPU)
  • After: ~1 second (95% CPU utilization)
  • Cause: Repeated SQLite queries for symbol lookups
  • Fix: Pre-load all symbols into memory maps indexed by name/qualified name

Tested on 681-file codebase with 26,233 unresolved references.

4. Tree-sitter Fixes

Merged @jasques' PR #9 fixes for tree-sitter installation issues (with full credit).

Use Cases

Multi-project workflows:

codegraph_set_root("/path/to/project-a")  // Shows: 500 files, 2000 nodes...
codegraph_search("AuthService")

codegraph_set_root("/path/to/project-b")  // Shows: 800 files, 3500 nodes...
codegraph_search("AuthService")

Autonomous project lifecycle:

codegraph_set_root("/path/to/new-project")  // Detects not initialized
codegraph_init()                              // Initialize it
codegraph_index()                             // Index it (now 80x faster!)
codegraph_search("main")                     // Ready to use!
// ... later ...
codegraph_uninit()                            // Clean up when done

Testing

Verified across 3 different projects (TypeScript, Python, mixed) with successful initialization, indexing, syncing, and querying.

Let us know if you'd like any changes or have questions about the design!

jasques and others added 30 commits January 31, 2026 18:31
Adds codegraph_get_project and codegraph_set_project tools to enable
AI assistants to work across multiple indexed projects in a single session.

- codegraph_get_project: Returns currently active project path
- codegraph_set_project: Switches to different project (closes old, opens new)
- Updated CLAUDE.md documentation

This enables multi-project workflows without restarting the MCP server.
Merges jasques' fix for tree-sitter installation issues.

- Pins tree-sitter dependencies to exact versions (no ^ ranges)
- Adds npm overrides to force tree-sitter@0.22.4
- Removes non-existent queries copy from build script

Credit: #9
Co-authored-by: Łukasz Jakóbiec <jasques@users.noreply.github.com>
Adds init, index, and sync tools to enable autonomous project management:

- codegraph_init_project: Initialize CodeGraph in a new project
- codegraph_index_project: Perform full index of all files
- codegraph_sync_project: Incremental update (changed files only)

These tools enable AI assistants to discover and index new projects
without requiring manual shell commands, making multi-project workflows
fully autonomous.
Renamed tools for clarity and simplicity:
- codegraph_get_project → codegraph_get_root
- codegraph_set_project → codegraph_set_root
- codegraph_init_project → codegraph_init (operates on current root)
- codegraph_index_project → codegraph_index (operates on current root)
- codegraph_sync_project → codegraph_sync (operates on current root)

Benefits:
- Clearer mental model: set a root, then operations work on that root
- No redundant path parameters
- Simpler API surface
Shows immediate feedback about the root you just switched to:
- Files indexed
- Total nodes/edges
- Database size

No need to run separate status command after switching.
Changed 'codegraph init' → 'codegraph_init' to reference
the correct MCP tool name instead of CLI command.
Completes the lifecycle management:
- init: create .codegraph/
- index/sync: populate/update index
- uninit: remove .codegraph/ (cleanup)

Calls CodeGraph.uninitialize() which closes DB and
deletes the .codegraph directory.
Documents how to configure CodeGraph with OpenCode
in addition to existing Claude Code instructions.
Changed from OpenCode-specific to generic MCP client config.
Keeps it neutral and broadly applicable.
Makes it clear the example is for OpenCode and shows
the typical config file location.
Problem:
- sync was taking 80+ seconds on projects with 20K+ unresolved refs
- Low CPU utilization (9%) indicated I/O bottleneck
- Root cause: 26K repeated SQLite queries for symbol lookups

Solution:
- Pre-load all symbols into memory maps indexed by name/qualified name
- Cache lookup in getNodesByName() and getNodesByQualifiedName()
- warmCaches() called once at start of resolveAll()

Results:
- sync time: 80s → 1s (80x speedup)
- CPU utilization: 9% → 95% (actually using available resources)
- Memory trade-off: ~few MB for symbol cache (negligible)

Tested on 681-file codebase with 26,233 unresolved references.
- Add insertUnresolvedRefsBatch() method using SQLite transaction
- Replace N individual inserts with single batched transaction
- Expected 10-100x speedup on post-parsing phase depending on ref count

This avoids repeated transaction overhead when indexing files with
many unresolved references.
- Add timing breakdown to IndexResult (scanning, parsing, storing, resolving)
- Report progress during 'storing' phase (was silent before)
- Track per-file parse times to identify bottlenecks
- Users can now see where time is spent during indexing

This provides visibility into performance bottlenecks and makes
long indexing operations less mysterious.
- Index command now calls resolveReferences() after indexing
- Added progress logging during resolution (every 100 refs)
- Shows resolved/unresolved counts at completion
- This was the missing 'resolving' phase that took most of the time

The 'index' command was only parsing+storing but not resolving,
so edges weren't being created. Now the full pipeline runs.
- resolveReferences() now accepts onProgress callback
- CLI shows real-time progress bar during resolution
- Updates every 100ms with current/total refs
- Shows resolution duration separately from indexing
- Much better UX during the slow resolution phase
- Changed fs.readFileSync to async fs.promises.readFile
- Process files in batches of 20 with Promise.all
- Overlaps I/O operations instead of sequential reads
- Should utilize idle CPU cores better (was 25% CPU, I/O bound)

Expected impact: 2-4x faster indexing on projects with many files
since file reading is now parallel instead of sequential.
- synchronous=NORMAL: Faster writes (safe with WAL mode)
- cache_size=64MB: Larger cache for better read performance
- temp_store=MEMORY: Keep temporary tables in RAM
- mmap_size=256MB: Memory-mapped I/O for faster access

These pragmas should improve both read and write performance
significantly without compromising data integrity.
Documents all 6 major optimizations made in performance branch:
1. Batch insert for unresolved refs
2. Detailed timing breakdown
3. Progress reporting for all phases
4. Reference resolution in index command
5. Parallel file I/O
6. SQLite performance pragmas

Includes expected impact, benchmarks, and testing instructions.
Resolution was not clearing old edges before creating new ones,
causing edges to accumulate on each index run.

Now deletes existing edges from source nodes before inserting
new resolved edges, preventing duplicates.
Schema.sql execution was accidentally removed when adding performance
pragmas, causing 'no such table' errors on fresh init.

This restores the schema initialization that creates all tables.
indexAll() was showing 'Resolving refs: 0%' placeholder that did nothing,
confusing users before the real resolution started.

Resolution happens separately after indexing via resolveReferences(),
so removed the misleading progress indicator.
Adds 'codegraph uninit' command to match MCP tool functionality.
Includes confirmation prompt to prevent accidental data loss.

Usage: codegraph uninit [path]
Users expect 'y' to work, not just 'yes'.
Also changed prompt to (y/n) to be clearer.
Shows feedback during the initialization phase (getUnresolvedReferences + warmCaches)
so users know the process hasn't hung. Message clears when progress starts.
Changed from clearing at current===100 to clearing on first callback.
This ensures the message clears properly when resolution starts.
Displays same stats block as 'codegraph status' after resolution,
showing accurate file/node/edge counts and DB size.

Eliminates confusion between intermediate counts (412 edges)
and final totals (12,159 edges after resolution).
warmCaches was calling getNodesByFile() for each file (880 queries).
Changed to single getAllNodes() query and build caches in memory.

This was causing ~60 second 'Preparing resolver' delay.
Expected to reduce to <1 second.
Added DEBUG logs to measure:
- getUnresolvedReferences
- getAllNodes (warmCaches)
- Convert refs format (22K getNodeById calls)
- resolveAndPersist

This will identify where the 60s 'stuck' phase is happening.
Fixes by GPT-5.3 Codex addressing code review findings:

Critical fixes (P0):
- Add await to sync() resolution calls (prevents DB race conditions)
- Remove dual DB handle (eliminates connection leaks)
- Fix edge cleanup key parsing (handles IDs with colons)

Important fixes (P1):
- Fix graph traversal 'both' direction (correct neighbor selection)
- Fix type hierarchy descendants (separate visited sets)
- Add arrow function extraction support

Lower priority (P2):
- Add missing languages to config validation
- Fix VSS search LIMIT issue
- Fix toFloat32Array() data copying

Improvements:
- Restore getDetectedFrameworks() API
- Enhanced test coverage
- All 200 tests passing

Co-authored-by: GPT-5.3 Codex
Recovered from commit 76e6e7b. Guide contains best practices for
AI assistants using CodeGraph tools effectively.
colbymchenry added a commit that referenced this pull request Feb 10, 2026
- Fix Float32Array embedder bug: was creating zero-filled array instead
  of copying data from TypedArray-like objects
- Fix VSS search query: use subquery pattern so LIMIT applies before JOIN
- Pin tree-sitter versions: remove caret ranges for ABI stability, add
  overrides to lock tree-sitter core at 0.22.4
- Lazy grammar loading: load native bindings on first use per language
  instead of all at startup, so one missing grammar doesn't affect others
- Remove stale src/extraction/queries copy from copy-assets script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
colbymchenry added a commit that referenced this pull request Feb 10, 2026
- SQLite performance pragmas: synchronous=NORMAL, 64MB cache,
  memory temp store, 256MB mmap (safe with WAL mode)
- Batch insert for unresolved refs: single transaction instead of
  N individual inserts per file
- Symbol caching (warmCaches): pre-load all nodes into memory maps
  before resolution, eliminating repeated SQLite queries per ref
- Async file I/O: fs.stat/readFile in indexFile() are now non-blocking
- Denormalize filePath/language onto UnresolvedReference: avoids N
  node lookups during resolution, with schema migration v2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
colbymchenry added a commit that referenced this pull request Feb 10, 2026
- Fix arrow function extraction: explicitly call extractFunction() for
  arrow functions/function expressions in variable declarations instead
  of silently skipping them (all 6 arrow function tests now pass)
- Best-candidate resolution: collect candidates from all strategies and
  return highest confidence match instead of first match
- Fix graph traversal 'both' direction: correctly determine next node
  for mixed incoming/outgoing edges in BFS and DFS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@colbymchenry

Copy link
Copy Markdown
Owner

Hey @rickross, thanks for this massive PR! While there were too many merge conflicts to merge directly, we went through it thoroughly and cherry-picked the most valuable changes into PR #19. This includes bug fixes (Float32Array, VSS query), performance improvements (SQLite pragmas, batch inserts, symbol caching), extraction quality fixes (arrow function handling, best-candidate resolution), the CLI uninit command, and MCP improvements. Appreciate the effort you put into this!

colbymchenry added a commit that referenced this pull request Feb 10, 2026
Port quality improvements from PR #15
jorgerobles pushed a commit to jorgerobles/codegraph that referenced this pull request Jun 1, 2026
- Fix Float32Array embedder bug: was creating zero-filled array instead
  of copying data from TypedArray-like objects
- Fix VSS search query: use subquery pattern so LIMIT applies before JOIN
- Pin tree-sitter versions: remove caret ranges for ABI stability, add
  overrides to lock tree-sitter core at 0.22.4
- Lazy grammar loading: load native bindings on first use per language
  instead of all at startup, so one missing grammar doesn't affect others
- Remove stale src/extraction/queries copy from copy-assets script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jorgerobles pushed a commit to jorgerobles/codegraph that referenced this pull request Jun 1, 2026
- SQLite performance pragmas: synchronous=NORMAL, 64MB cache,
  memory temp store, 256MB mmap (safe with WAL mode)
- Batch insert for unresolved refs: single transaction instead of
  N individual inserts per file
- Symbol caching (warmCaches): pre-load all nodes into memory maps
  before resolution, eliminating repeated SQLite queries per ref
- Async file I/O: fs.stat/readFile in indexFile() are now non-blocking
- Denormalize filePath/language onto UnresolvedReference: avoids N
  node lookups during resolution, with schema migration v2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jorgerobles pushed a commit to jorgerobles/codegraph that referenced this pull request Jun 1, 2026
- Fix arrow function extraction: explicitly call extractFunction() for
  arrow functions/function expressions in variable declarations instead
  of silently skipping them (all 6 arrow function tests now pass)
- Best-candidate resolution: collect candidates from all strategies and
  return highest confidence match instead of first match
- Fix graph traversal 'both' direction: correctly determine next node
  for mixed incoming/outgoing edges in BFS and DFS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jorgerobles pushed a commit to jorgerobles/codegraph that referenced this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants