Skip to content

spec(codegraph-rs): external-package references design#827

Closed
gyorgybalazsi wants to merge 21 commits into
colbymchenry:mainfrom
gyorgybalazsi:spec/external-packages
Closed

spec(codegraph-rs): external-package references design#827
gyorgybalazsi wants to merge 21 commits into
colbymchenry:mainfrom
gyorgybalazsi:spec/external-packages

Conversation

@gyorgybalazsi

Copy link
Copy Markdown

Summary

Design spec for resolving references into external dependency packages — the Daml SDK stdlib/prelude/script today (the 6,669 no_candidate calls + 20 unresolved DA.*/Prelude/Daml.Script imports left after PR #15), and any third-party DAR later. External symbols are ingested as resolution-only targets: marked external, excluded from embeddings and search, but valid CALLS/TYPE_OF/REFERENCES/IMPLEMENTS targets.

Adds docs/superpowers/specs/2026-06-12-codegraph-rs-external-packages-design.md.

Locked decisions

  • First-class Package layer + package-qualified resolution.
  • General mechanism driven by compiled artifacts (works for SDK stdlib and any DAR).
  • Ingestion vendors the proven prost-based Daml-LF decoder from the sibling canton-toolbox codegen crate (no from-scratch LF reader).
  • Dependency mapping from daml.yaml + DAR manifests; SDK packages are implicit deps.
  • External symbols resolution-only (excluded from embed/search); IMPLEMENTS/REQUIRES may target external interfaces.

Reviewed against source (codegraph-rs + canton-toolbox)

The spec was plan-verified and corrected before commit:

  • No schema_version bump — it's a strict-equality config gate (config/workspace.rs:95); bumping would break every existing workspace. The new Package(qid) constraint is additive/idempotent.
  • LF names are interned — module/def names are indices into the package string/dotted-name tables; the walk must dereference them (reuse canton-toolbox/.../resolve_type.rs).
  • Pin one LF version — the SDK ships stdlib/prim for multiple LF versions with identical module names; ingesting >1 breaks module lookup.
  • Dup-tolerant module lookupfind_module_by_qualified_name currently bails on >1; package-scoped lookup must return the in-scope match instead.
  • Local packages keyed synthetically (pkg:local:<name>) since they have no compiled package_id.

Phasing (slices)

11a model + visibility (additive, no behavior change) → 11b vendor LF decoder + DAR/DALF ingestion → 11c package-qualified resolution → 11d corpus validation.

Implementation has not started. Sign-off requested on the design.

gyorgybalazsi and others added 21 commits April 6, 2026 09:29
The embedding pipeline (ONNX model loading, batch embedding, vector
storage, cosine similarity search) was fully implemented but never
called from any user-facing code path. This wires it in:

- indexAll: generate embeddings after resolving references
- sync: generate embeddings for new/changed nodes
- MCP server: initialize embeddings on startup for semantic search

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captures decisions from the brainstorming session: Rust + Neo4j Desktop +
fastembed-rs, workspace model with multiple roots, codegraph_rs_* MCP tools,
v1 languages Rust + Daml (via tree-sitter-haskell proxy), Grafeo evaluated
and rejected for visualization and maturity reasons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- qid no longer includes start_line; line-shift edits would have cascaded
  edge invalidation across the graph
- (:UnresolvedRef) is now a permanent source-of-truth record; resolution
  edges are derived/rebuilt from it. Fixes the cross-file edge corruption
  bug in incremental sync
- No separate Impl node; methods carry type prefix in qualified_name
- Neo4j 5 database naming rules documented; examples use dashes
- CREATE DATABASE privilege requirement called out explicitly
- Symbol multi-label requirement made explicit; Impl handling specified
- Edge confidence property formalized in schema
- Embedding write Cypher pattern made explicit
- Sync DETACH DELETE extended to UnresolvedRef nodes via HAS_REF
- DAML parse_reliable property added to schema, replacing risk-only mention
- Removed Protocol label (out of scope for v1 languages)
- Tool flow documented for MCP consumers
- CLI 'query' scope clarified
- Drop ignore_extra in TOML; .codegraphignore is sole escape hatch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Resolution rebuild edge-drop now type-restricted; was unindexed full-graph scan
- Resolution ordering made explicit (Phase A imports, then B calls/refs, then C name-match);
  Stage 1 had a chicken-and-egg dependency on IMPORTS that had to come first
- IMPORTS added to confidence-bearing edges table (it's resolution-derived, not structural)
- Edge bucket assignment in dataflow updated: HAS_REF as structural; IMPORTS as resolution-derived
- init order-of-operations spec'd: validate Neo4j + CREATE DATABASE + apply schema
  BEFORE writing config files (no half-initialized state on disk)
- --workspace nonexistent-dir behavior clarified for init vs other commands
- Embedding 'model' field clarified: forward-compat reservation, v1 rejects non-default values
- Node-label table reformatted with explicit "produced in v1?" column to disambiguate
  reserved-for-future labels from active ones
- ExtractionResult struct field naming made consistent across §3 dataflow and §6 trait
- UnresolvedRef.qid stability note added (it's position-dependent but always co-deleted)
- Tool input parameter conventions documented (root = name, kinds = labels from §4)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ton)

12 tasks covering Cargo project init, CLI scaffolding, config types and
loaders, Neo4j connection wrapper, schema application, and the init/status
commands. Subsequent slices (extraction, resolution, vectors, sync, MCP,
Daml, CI) will be planned separately as each prior slice ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- max_last_indexed compile bug: `??` on Option<Row> doesn't compile in a
  function returning Result; replaced with explicit match handling both the
  zero-rows and NULL-value cases
- Connection::ping and single_count: corrected backwards .context message
  ordering (first context attaches to DB error, second to empty stream)
- Note that the build is intentionally broken between Task 4 steps 4 and 6
- Note about parallel-test env-var mutations and Rust 1.91 set_var warning
- Note clarifying "append" placement of new use statements

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks covering: extraction module skeleton + deps; ExtractionResult
types (TDD); ignore-crate file walker (TDD); BLAKE3 hashing (TDD);
per-thread tree-sitter parser (TDD); Rust extractor for function/struct/
enum/trait/type/mod/use/call (TDD across 4 tasks); db::write_extraction
with batched UNWIND/MERGE; CLI integration; tiny-rust integration test.

Slice 2 produces File/Symbol/Function/Struct/Enum/Trait/TypeAlias/Module/
Import nodes with CONTAINS edges, plus UnresolvedRef records with HAS_REF
edges. No CALLS/IMPORTS-target/REFERENCES yet — those come in Slice 3
(resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A — Critical: enable neo4rs `json` feature flag in Task 1 (BoltType::From<serde_json::Value>
  only exists with that feature; Task 10's writer would fail to build otherwise).
  Also moved serde_json dep from Task 10 to Task 1 to consolidate.
B+D — Add content_hash + last_indexed_at to RawNode (File-only, Option-typed) and
  populate them in build_file_node via source_hash() + SystemTime; write them in the
  Cypher SET clause. Couples the hash module (Task 4) to a real consumer in slice 2;
  also matches the "What gets indexed" table claim that File nodes carry these.
C — Verification notes added to Tasks 8 and 9 about tree-sitter-rust 0.24 field names
  (`argument:` on use_declaration, `function:` on call_expression). If the test produces
  zero matches, the field name needs adjustment via node-types.json.
H — Fixed final test-count breakdown in Task 12 (30+13+2+2+1+1+1 = 50, not "~37").
I — Renamed the cli_smoke test from slice1_stub_commands_fail_with_not_implemented to
  stub_commands_fail_with_not_implemented in Task 11's instructions.

Plus: updated per-task "expected pass count" lines through Tasks 7, 8, 9 to reflect
the two new file-property tests added in Task 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 tasks covering: resolution module skeleton; ResolutionStats type (TDD);
db::queries read helpers; idempotent rebuild (drop confidence-bearing edges +
reset UnresolvedRef state); Phase A (resolve Import nodes -> IMPORTS edges);
Phase B (import-driven candidate lookup for calls/type_refs/references);
Phase C (workspace name-match fallback with ambiguity detection); resolve()
orchestrator; CLI integration (resolution runs after extraction in `index`);
integration test verifying CALLS edges on tiny-rust; README + tag.

Slice 3 produces (:Function|:Method)-[:CALLS]->(...), (:File)-[:IMPORTS]->(:Module),
plus REFERENCES and TYPE_OF edges, all with a `confidence` property. Unresolvable
refs persist with `unresolved_reason` set. Idempotent on each `index` invocation.

Known limitation: slice 2's call-site extraction loses path prefixes
(`util::helper()` becomes leaf name `helper`), so resolution is leaf-name-only.
Real codebases with name collisions will see `unresolved_reason = "ambiguous"`.
Tiny-rust passes cleanly because its function names are globally unique.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks covering: sync module skeleton; FileDiff types + diff algorithm
(TDD); db::queries::list_files_in_root; db::write::delete_file_subtree;
sync orchestrator (walk/hash/diff/delete/extract/resolve); CLI integration;
integration test for add/modify/delete lifecycle on tiny-rust; README + tag.

Slice 5 makes `codegraph-rs sync` work end-to-end: re-running on unchanged
workspace is cheap (no-op), per-file changes update only what changed,
resolution edges self-heal across edits via the global rebuild from §7.

Reuses existing infrastructure: BLAKE3 hashing (slice 2), file walker
(slice 2), extract_rust + write_extraction (slice 2), resolve (slice 3).
Only new code is the diff algorithm + per-file delete + orchestrator
glue + CLI command.

Skipping Slice 4 (vector embeddings) for now — vectors are inert until
slice 6 (MCP) lands semantic search tools, while sync removes a real
workflow blocker (every `index` currently does full re-extraction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Orchestrator hash phase now matches slice 2's warn-and-skip behavior for
  unreadable files (was propagating via `?`, killing the whole sync on first
  bad read). Unreadable files that are already in the DB have their DB hash
  preserved in fs_hashes so the diff treats them as Unchanged, preventing
  transient I/O failures from triggering destructive Delete actions.
- Source contents are stashed (`fs_sources: HashMap<String, String>`) during
  the walk-and-hash pass, eliminating the double-read previously done during
  extraction. Saves one read per added/changed file.
- Out-of-scope section now explicitly notes:
    (a) full-workspace re-resolution cost — `resolve()` rebuilds for the
        whole graph on every sync, not just affected files (spec §7 v1
        acknowledged limitation; future-slice optimization).
    (b) concurrent invocations are NOT locked — same caveat as `index`;
        documented as "don't run two at once" per spec §9.
- Rephrased forward references to use "the Daml slice" / "the embeddings
  slice" / "the MCP slice" instead of slice numbers (slice 4 was skipped
  in the original 8-slice numbering, so absolute references became
  misleading).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-slice v2.0 push deepening Daml extraction:

- slice-10a-daml-structure: replace regex pass_d with tree-sitter
  queries that filter (variable) nodes by Daml keyword. Add NodeKinds
  Interface, Viewtype, Exception, Field. Templates/Choices get
  byte-precise spans; Fields become first-class CONTAINS-children.

- slice-10b-daml-edges: extract signatory/controller/observer party
  expressions, plus implements/requires between Templates/Interfaces.
  5 new EdgeKinds, 5 new UnresolvedKinds, new scope_qid resolution
  phase (A.5) for scoped name lookup.

Approach validated by inspection of DLC-link/zed-daml-lsp, which uses
tree-sitter-haskell as a proxy grammar identical to our v1 approach and
detects Daml keywords via #any-of? on (variable) nodes. Spec rules out
contract keys (deprecated in Daml 2.x) and a real tree-sitter-daml
grammar (multi-month external project, no community version exists).

Schema additions are fully additive; schema_version stays at 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 10 issues found in the self-review pass:

Critical:
- §3.2: spell out Field-vs-top-level-signature disambiguation. The
  (signature name: ... type: ...) query matches both with-block fields
  AND top-level function signatures; the extractor must check the
  ancestor chain at emit time and only emit Field when nested under a
  Template/Choice/Interface/Exception keyword anchor.
- §3.3: spell out multi-party expression decomposition. Add a sub-walk
  step that extracts identifier leaves from signatory/controller/
  observer argument subtrees, filtering known wrapper identifiers
  (Set.fromList, fromOptional, etc.). Handle [a, b], $ Set.fromList ...,
  and `parties config` shapes.
- §5.1: flag Daml-version-dependent fixture syntax. The 2.7+ form of
  `interface instance` inside a template body is not universal; Task 1
  of slice 10a now includes a daml-compile verification step.

Important:
- §2.1: spell out Field qualified_name shape for Choice and Interface
  containers (not just Template).
- §4.1 / §4.2: align the fall-through wording. Phase A.5 → Phase C
  (skipping B, since party names are scope-local not workspace-imported).
  Also document the HashMap algorithm explicitly.
- §3.2: add note on curried `(apply)` shape — haskell's left-associative
  application produces nested apply chains; queries match the innermost
  apply by default.
- §2.1: justify Viewtype-as-NodeKind (vs the cleaner edge model) with
  an explicit trade-off note.

Minor:
- §5.2: fix test count discrepancies; add 4 new tests (qualified-name
  nesting under Choice, disambiguation guard, multi-party decomposition
  variants).
- §1.2: drop confusing "literal Party" phrase.
- §2.4 / §4.3: deduplicate; §2.4 now points at §4.3 for the Rust ↔
  Neo4j compat story.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan for v2.0 slice 10a per the v2-daml-depth spec.
8 tasks: NodeKind additions, tree-sitter queries for templates /
choices / interfaces / viewtypes / exceptions / fields (with
ancestor-span disambiguation), regex Pass D deletion + fixture
extension + integration test, README + tag.

Plan self-reviewed twice. Second pass fixed:
- Process narration leaked into Tasks 2-3 (cleaned)
- Function name mismatch (pass_e_template_tree_sitter → pass_e_templates)
- Probe-step disconnect: probe shape must drive query shape, not be
  "adjusted later"
- Cross-task pass_d state summary added so readers don't have to
  reconstruct what's commented out vs deleted
- Byte-precise test strengthened (end_col > 0 distinguishes
  tree-sitter from regex; >= 7 was permissive of either)
- Probe test disposition explicit (REMOVE before commit)
- Field count assertion 8 → 6 with per-source breakdown
- Function signature variation across pass_e_* documented
- Task 4 ordering callout (interfaces before viewtypes; reversal
  silently produces zero Viewtype nodes)
- Daml-compile fallback path explicit (target Daml 2.7+ documented
  in fixture comment when CLI absent)
- Per-task regression guard step added (slice-9 integration test
  must still SKIP cleanly after extractor changes)

Final test count target: 115 (97 baseline + 18 new unit tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan for v2.0 slice 10b per the v2-daml-depth spec.
8 tasks: EdgeKind/UnresolvedKind additions, scope_qid field on
UnresolvedRef, signatory/controller/observer extraction with
multi-party decomposition, implements/requires extraction, Phase A.5
(scoped name match) resolution, orchestrator wiring, integration test
extensions, README + tag.

Builds on slice 10a's tree-sitter+tree-walk pattern; reuses helpers
enclosing_template_qid / enclosing_interface_qid / enclosing_choice_qid
/ first_identifier_in_subtree. Adds party_identifiers_in_subtree with
PARTY_WRAPPER_FILTERS for Set.fromList / fromOptional / etc.

Resolution Phase A.5 is a new file (src/resolution/phase_a5_scoped_match.rs);
runs between Phase A (imports) and Phase B (import-driven name match).
Builds one HashMap<qualified_name, qid> per resolve invocation;
party-kind refs resolve by `<scope_qname>.<unresolved_name>`,
implements/requires by suffix-match against workspace-wide Interface
qnames.

Final test count target: 133 (118 baseline + 10 + 2 + 6 + 3 = 19 new
unit tests + 0 new integration tests, since integration_daml_index
just gains Cypher assertions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 8 issues found in the self-review pass:

Critical:
- Test count math wrong (133 → 139 throughout). Task 3 adds 6 unit
  tests; the plan's later tasks consistently said "133" as if Task 3
  added 0.
- Task 7 modifies the fixture (adds `interface Owned requires
  Authorizable`) but didn't update the inherited slice 10a assertion
  `interface_count == 1`. New Step 2 in Task 7 explicitly bumps
  inherited assertions; remaining steps renumbered.

Important:
- Controller scope changed from Template-only to Choice-first with
  Phase A.5 fallback to parent Template. This handles the real-world
  case `choice Foo with x : Int controller x` (where x is a Choice
  field, not a Template field) without breaking the integration test
  (Bump's `controller delegate` references a Template field, resolved
  via the fallback). Test in Task 3 updated to assert Choice scope.
- Task 4 query examples now show both flat (Shape A) and nested apply-
  chain (Shape B) forms so the implementer doesn't commit to one shape
  before the probe.

Logical gaps:
- Phase A.5 doc comment documents the stale-edge limitation (MERGE is
  idempotent for same source+target but doesn't clear edges to removed
  targets when signatory expressions change between syncs). Acknowledges
  this is a v1-era resolution model issue, not new to 10b.

Minor:
- `info!` log line split into two records (resolution + auth) to fit
  reasonable terminal widths.
- Task 2 Step 4 clarifies that `cargo build` is the safety net for
  missed UnresolvedRef construction sites, not grep alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design for resolving references into external dependency packages (Daml
SDK stdlib/prelude/script today; any DAR later). External symbols are
ingested as resolution-only targets — marked `external`, excluded from
embeddings and search.

Key decisions: first-class Package layer + package-qualified resolution;
general DAR/DALF-metadata ingestion via a vendored copy of the
canton-toolbox prost LF decoder; dependency mapping from daml.yaml + DAR
manifests; IMPLEMENTS/REQUIRES may target external interfaces.

Reviewed against codegraph-rs + canton-toolbox source: no schema_version
bump (strict-equality config gate would break existing workspaces); LF
names are interned and need table dereferencing; pin one LF version to
avoid duplicate stdlib module names; dup-tolerant module lookup; local
packages keyed synthetically (no compiled package_id).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gyorgybalazsi

Copy link
Copy Markdown
Author

Closing — opened against the wrong base by mistake (meant for my own fork gyorgybalazsi/codegraph, not upstream). Apologies for the noise.

@gyorgybalazsi gyorgybalazsi deleted the spec/external-packages branch June 12, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant