spec(codegraph-rs): external-package references design#827
Closed
gyorgybalazsi wants to merge 21 commits into
Closed
spec(codegraph-rs): external-package references design#827gyorgybalazsi wants to merge 21 commits into
gyorgybalazsi wants to merge 21 commits into
Conversation
The embedding pipeline (ONNX model loading, batch embedding, vector storage, cosine similarity search) was fully implemented but never called from any user-facing code path. This wires it in: - indexAll: generate embeddings after resolving references - sync: generate embeddings for new/changed nodes - MCP server: initialize embeddings on startup for semantic search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captures decisions from the brainstorming session: Rust + Neo4j Desktop + fastembed-rs, workspace model with multiple roots, codegraph_rs_* MCP tools, v1 languages Rust + Daml (via tree-sitter-haskell proxy), Grafeo evaluated and rejected for visualization and maturity reasons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- qid no longer includes start_line; line-shift edits would have cascaded edge invalidation across the graph - (:UnresolvedRef) is now a permanent source-of-truth record; resolution edges are derived/rebuilt from it. Fixes the cross-file edge corruption bug in incremental sync - No separate Impl node; methods carry type prefix in qualified_name - Neo4j 5 database naming rules documented; examples use dashes - CREATE DATABASE privilege requirement called out explicitly - Symbol multi-label requirement made explicit; Impl handling specified - Edge confidence property formalized in schema - Embedding write Cypher pattern made explicit - Sync DETACH DELETE extended to UnresolvedRef nodes via HAS_REF - DAML parse_reliable property added to schema, replacing risk-only mention - Removed Protocol label (out of scope for v1 languages) - Tool flow documented for MCP consumers - CLI 'query' scope clarified - Drop ignore_extra in TOML; .codegraphignore is sole escape hatch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Resolution rebuild edge-drop now type-restricted; was unindexed full-graph scan - Resolution ordering made explicit (Phase A imports, then B calls/refs, then C name-match); Stage 1 had a chicken-and-egg dependency on IMPORTS that had to come first - IMPORTS added to confidence-bearing edges table (it's resolution-derived, not structural) - Edge bucket assignment in dataflow updated: HAS_REF as structural; IMPORTS as resolution-derived - init order-of-operations spec'd: validate Neo4j + CREATE DATABASE + apply schema BEFORE writing config files (no half-initialized state on disk) - --workspace nonexistent-dir behavior clarified for init vs other commands - Embedding 'model' field clarified: forward-compat reservation, v1 rejects non-default values - Node-label table reformatted with explicit "produced in v1?" column to disambiguate reserved-for-future labels from active ones - ExtractionResult struct field naming made consistent across §3 dataflow and §6 trait - UnresolvedRef.qid stability note added (it's position-dependent but always co-deleted) - Tool input parameter conventions documented (root = name, kinds = labels from §4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ton) 12 tasks covering Cargo project init, CLI scaffolding, config types and loaders, Neo4j connection wrapper, schema application, and the init/status commands. Subsequent slices (extraction, resolution, vectors, sync, MCP, Daml, CI) will be planned separately as each prior slice ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- max_last_indexed compile bug: `??` on Option<Row> doesn't compile in a function returning Result; replaced with explicit match handling both the zero-rows and NULL-value cases - Connection::ping and single_count: corrected backwards .context message ordering (first context attaches to DB error, second to empty stream) - Note that the build is intentionally broken between Task 4 steps 4 and 6 - Note about parallel-test env-var mutations and Rust 1.91 set_var warning - Note clarifying "append" placement of new use statements Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks covering: extraction module skeleton + deps; ExtractionResult types (TDD); ignore-crate file walker (TDD); BLAKE3 hashing (TDD); per-thread tree-sitter parser (TDD); Rust extractor for function/struct/ enum/trait/type/mod/use/call (TDD across 4 tasks); db::write_extraction with batched UNWIND/MERGE; CLI integration; tiny-rust integration test. Slice 2 produces File/Symbol/Function/Struct/Enum/Trait/TypeAlias/Module/ Import nodes with CONTAINS edges, plus UnresolvedRef records with HAS_REF edges. No CALLS/IMPORTS-target/REFERENCES yet — those come in Slice 3 (resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A — Critical: enable neo4rs `json` feature flag in Task 1 (BoltType::From<serde_json::Value> only exists with that feature; Task 10's writer would fail to build otherwise). Also moved serde_json dep from Task 10 to Task 1 to consolidate. B+D — Add content_hash + last_indexed_at to RawNode (File-only, Option-typed) and populate them in build_file_node via source_hash() + SystemTime; write them in the Cypher SET clause. Couples the hash module (Task 4) to a real consumer in slice 2; also matches the "What gets indexed" table claim that File nodes carry these. C — Verification notes added to Tasks 8 and 9 about tree-sitter-rust 0.24 field names (`argument:` on use_declaration, `function:` on call_expression). If the test produces zero matches, the field name needs adjustment via node-types.json. H — Fixed final test-count breakdown in Task 12 (30+13+2+2+1+1+1 = 50, not "~37"). I — Renamed the cli_smoke test from slice1_stub_commands_fail_with_not_implemented to stub_commands_fail_with_not_implemented in Task 11's instructions. Plus: updated per-task "expected pass count" lines through Tasks 7, 8, 9 to reflect the two new file-property tests added in Task 6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 tasks covering: resolution module skeleton; ResolutionStats type (TDD); db::queries read helpers; idempotent rebuild (drop confidence-bearing edges + reset UnresolvedRef state); Phase A (resolve Import nodes -> IMPORTS edges); Phase B (import-driven candidate lookup for calls/type_refs/references); Phase C (workspace name-match fallback with ambiguity detection); resolve() orchestrator; CLI integration (resolution runs after extraction in `index`); integration test verifying CALLS edges on tiny-rust; README + tag. Slice 3 produces (:Function|:Method)-[:CALLS]->(...), (:File)-[:IMPORTS]->(:Module), plus REFERENCES and TYPE_OF edges, all with a `confidence` property. Unresolvable refs persist with `unresolved_reason` set. Idempotent on each `index` invocation. Known limitation: slice 2's call-site extraction loses path prefixes (`util::helper()` becomes leaf name `helper`), so resolution is leaf-name-only. Real codebases with name collisions will see `unresolved_reason = "ambiguous"`. Tiny-rust passes cleanly because its function names are globally unique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks covering: sync module skeleton; FileDiff types + diff algorithm (TDD); db::queries::list_files_in_root; db::write::delete_file_subtree; sync orchestrator (walk/hash/diff/delete/extract/resolve); CLI integration; integration test for add/modify/delete lifecycle on tiny-rust; README + tag. Slice 5 makes `codegraph-rs sync` work end-to-end: re-running on unchanged workspace is cheap (no-op), per-file changes update only what changed, resolution edges self-heal across edits via the global rebuild from §7. Reuses existing infrastructure: BLAKE3 hashing (slice 2), file walker (slice 2), extract_rust + write_extraction (slice 2), resolve (slice 3). Only new code is the diff algorithm + per-file delete + orchestrator glue + CLI command. Skipping Slice 4 (vector embeddings) for now — vectors are inert until slice 6 (MCP) lands semantic search tools, while sync removes a real workflow blocker (every `index` currently does full re-extraction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Orchestrator hash phase now matches slice 2's warn-and-skip behavior for
unreadable files (was propagating via `?`, killing the whole sync on first
bad read). Unreadable files that are already in the DB have their DB hash
preserved in fs_hashes so the diff treats them as Unchanged, preventing
transient I/O failures from triggering destructive Delete actions.
- Source contents are stashed (`fs_sources: HashMap<String, String>`) during
the walk-and-hash pass, eliminating the double-read previously done during
extraction. Saves one read per added/changed file.
- Out-of-scope section now explicitly notes:
(a) full-workspace re-resolution cost — `resolve()` rebuilds for the
whole graph on every sync, not just affected files (spec §7 v1
acknowledged limitation; future-slice optimization).
(b) concurrent invocations are NOT locked — same caveat as `index`;
documented as "don't run two at once" per spec §9.
- Rephrased forward references to use "the Daml slice" / "the embeddings
slice" / "the MCP slice" instead of slice numbers (slice 4 was skipped
in the original 8-slice numbering, so absolute references became
misleading).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-slice v2.0 push deepening Daml extraction: - slice-10a-daml-structure: replace regex pass_d with tree-sitter queries that filter (variable) nodes by Daml keyword. Add NodeKinds Interface, Viewtype, Exception, Field. Templates/Choices get byte-precise spans; Fields become first-class CONTAINS-children. - slice-10b-daml-edges: extract signatory/controller/observer party expressions, plus implements/requires between Templates/Interfaces. 5 new EdgeKinds, 5 new UnresolvedKinds, new scope_qid resolution phase (A.5) for scoped name lookup. Approach validated by inspection of DLC-link/zed-daml-lsp, which uses tree-sitter-haskell as a proxy grammar identical to our v1 approach and detects Daml keywords via #any-of? on (variable) nodes. Spec rules out contract keys (deprecated in Daml 2.x) and a real tree-sitter-daml grammar (multi-month external project, no community version exists). Schema additions are fully additive; schema_version stays at 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 10 issues found in the self-review pass: Critical: - §3.2: spell out Field-vs-top-level-signature disambiguation. The (signature name: ... type: ...) query matches both with-block fields AND top-level function signatures; the extractor must check the ancestor chain at emit time and only emit Field when nested under a Template/Choice/Interface/Exception keyword anchor. - §3.3: spell out multi-party expression decomposition. Add a sub-walk step that extracts identifier leaves from signatory/controller/ observer argument subtrees, filtering known wrapper identifiers (Set.fromList, fromOptional, etc.). Handle [a, b], $ Set.fromList ..., and `parties config` shapes. - §5.1: flag Daml-version-dependent fixture syntax. The 2.7+ form of `interface instance` inside a template body is not universal; Task 1 of slice 10a now includes a daml-compile verification step. Important: - §2.1: spell out Field qualified_name shape for Choice and Interface containers (not just Template). - §4.1 / §4.2: align the fall-through wording. Phase A.5 → Phase C (skipping B, since party names are scope-local not workspace-imported). Also document the HashMap algorithm explicitly. - §3.2: add note on curried `(apply)` shape — haskell's left-associative application produces nested apply chains; queries match the innermost apply by default. - §2.1: justify Viewtype-as-NodeKind (vs the cleaner edge model) with an explicit trade-off note. Minor: - §5.2: fix test count discrepancies; add 4 new tests (qualified-name nesting under Choice, disambiguation guard, multi-party decomposition variants). - §1.2: drop confusing "literal Party" phrase. - §2.4 / §4.3: deduplicate; §2.4 now points at §4.3 for the Rust ↔ Neo4j compat story. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan for v2.0 slice 10a per the v2-daml-depth spec. 8 tasks: NodeKind additions, tree-sitter queries for templates / choices / interfaces / viewtypes / exceptions / fields (with ancestor-span disambiguation), regex Pass D deletion + fixture extension + integration test, README + tag. Plan self-reviewed twice. Second pass fixed: - Process narration leaked into Tasks 2-3 (cleaned) - Function name mismatch (pass_e_template_tree_sitter → pass_e_templates) - Probe-step disconnect: probe shape must drive query shape, not be "adjusted later" - Cross-task pass_d state summary added so readers don't have to reconstruct what's commented out vs deleted - Byte-precise test strengthened (end_col > 0 distinguishes tree-sitter from regex; >= 7 was permissive of either) - Probe test disposition explicit (REMOVE before commit) - Field count assertion 8 → 6 with per-source breakdown - Function signature variation across pass_e_* documented - Task 4 ordering callout (interfaces before viewtypes; reversal silently produces zero Viewtype nodes) - Daml-compile fallback path explicit (target Daml 2.7+ documented in fixture comment when CLI absent) - Per-task regression guard step added (slice-9 integration test must still SKIP cleanly after extractor changes) Final test count target: 115 (97 baseline + 18 new unit tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan for v2.0 slice 10b per the v2-daml-depth spec. 8 tasks: EdgeKind/UnresolvedKind additions, scope_qid field on UnresolvedRef, signatory/controller/observer extraction with multi-party decomposition, implements/requires extraction, Phase A.5 (scoped name match) resolution, orchestrator wiring, integration test extensions, README + tag. Builds on slice 10a's tree-sitter+tree-walk pattern; reuses helpers enclosing_template_qid / enclosing_interface_qid / enclosing_choice_qid / first_identifier_in_subtree. Adds party_identifiers_in_subtree with PARTY_WRAPPER_FILTERS for Set.fromList / fromOptional / etc. Resolution Phase A.5 is a new file (src/resolution/phase_a5_scoped_match.rs); runs between Phase A (imports) and Phase B (import-driven name match). Builds one HashMap<qualified_name, qid> per resolve invocation; party-kind refs resolve by `<scope_qname>.<unresolved_name>`, implements/requires by suffix-match against workspace-wide Interface qnames. Final test count target: 133 (118 baseline + 10 + 2 + 6 + 3 = 19 new unit tests + 0 new integration tests, since integration_daml_index just gains Cypher assertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 8 issues found in the self-review pass: Critical: - Test count math wrong (133 → 139 throughout). Task 3 adds 6 unit tests; the plan's later tasks consistently said "133" as if Task 3 added 0. - Task 7 modifies the fixture (adds `interface Owned requires Authorizable`) but didn't update the inherited slice 10a assertion `interface_count == 1`. New Step 2 in Task 7 explicitly bumps inherited assertions; remaining steps renumbered. Important: - Controller scope changed from Template-only to Choice-first with Phase A.5 fallback to parent Template. This handles the real-world case `choice Foo with x : Int controller x` (where x is a Choice field, not a Template field) without breaking the integration test (Bump's `controller delegate` references a Template field, resolved via the fallback). Test in Task 3 updated to assert Choice scope. - Task 4 query examples now show both flat (Shape A) and nested apply- chain (Shape B) forms so the implementer doesn't commit to one shape before the probe. Logical gaps: - Phase A.5 doc comment documents the stale-edge limitation (MERGE is idempotent for same source+target but doesn't clear edges to removed targets when signatory expressions change between syncs). Acknowledges this is a v1-era resolution model issue, not new to 10b. Minor: - `info!` log line split into two records (resolution + auth) to fit reasonable terminal widths. - Task 2 Step 4 clarifies that `cargo build` is the safety net for missed UnresolvedRef construction sites, not grep alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design for resolving references into external dependency packages (Daml SDK stdlib/prelude/script today; any DAR later). External symbols are ingested as resolution-only targets — marked `external`, excluded from embeddings and search. Key decisions: first-class Package layer + package-qualified resolution; general DAR/DALF-metadata ingestion via a vendored copy of the canton-toolbox prost LF decoder; dependency mapping from daml.yaml + DAR manifests; IMPLEMENTS/REQUIRES may target external interfaces. Reviewed against codegraph-rs + canton-toolbox source: no schema_version bump (strict-equality config gate would break existing workspaces); LF names are interned and need table dereferencing; pin one LF version to avoid duplicate stdlib module names; dup-tolerant module lookup; local packages keyed synthetically (no compiled package_id). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
Closing — opened against the wrong base by mistake (meant for my own fork gyorgybalazsi/codegraph, not upstream). Apologies for the noise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Design spec for resolving references into external dependency packages — the Daml SDK stdlib/prelude/script today (the 6,669
no_candidatecalls + 20 unresolvedDA.*/Prelude/Daml.Scriptimports left after PR #15), and any third-party DAR later. External symbols are ingested as resolution-only targets: markedexternal, excluded from embeddings and search, but validCALLS/TYPE_OF/REFERENCES/IMPLEMENTStargets.Adds
docs/superpowers/specs/2026-06-12-codegraph-rs-external-packages-design.md.Locked decisions
Packagelayer + package-qualified resolution.canton-toolboxcodegencrate (no from-scratch LF reader).daml.yaml+ DAR manifests; SDK packages are implicit deps.IMPLEMENTS/REQUIRESmay target external interfaces.Reviewed against source (codegraph-rs + canton-toolbox)
The spec was plan-verified and corrected before commit:
schema_versionbump — it's a strict-equality config gate (config/workspace.rs:95); bumping would break every existing workspace. The newPackage(qid)constraint is additive/idempotent.canton-toolbox/.../resolve_type.rs).find_module_by_qualified_namecurrently bails on >1; package-scoped lookup must return the in-scope match instead.pkg:local:<name>) since they have no compiledpackage_id.Phasing (slices)
11a model + visibility (additive, no behavior change) → 11b vendor LF decoder + DAR/DALF ingestion → 11c package-qualified resolution → 11d corpus validation.
Implementation has not started. Sign-off requested on the design.