code2skill does not try to train a model over the repository. It builds a
small structural graph and a compact AST skeleton, then gives the LLM grounded
evidence for planning and Skill generation.
- AST path evidence: inspired by code2vec, which showed that paths through code structure are stronger signals than plain token lists.
- Program graph evidence: inspired by graph-based program representation work and Code Property Graphs, which combine syntax and semantic edges instead of treating code as isolated files.
- Data-flow evidence: inspired by GraphCodeBERT, which uses data-flow structure to connect variables and operations beyond lexical proximity.
- Python AST extraction records imports, exports, functions, classes, methods,
route decorators, model/schema signals, call targets, type references,
raised exceptions, dynamic imports, class attributes, and simple data-flow
edges such as
scope:target<-source. - Import graph construction uses detailed
ImportInfo, includingfrom ... import ...names and dynamic imports, so package-level imports resolve to concrete internal files when possible. - Symbol-aware dependency resolution maps extracted call targets, instantiated classes, type references, decorators, and raised exceptions back to internal files when they match an imported alias, a package re-export, or a unique repository symbol.
- File priority combines path heuristics with content evidence. Route, service, model, main-guard, call-target, type-reference, and data-flow signals can raise selection priority.
- Evidence coverage is summarized in the blueprint and project summary so users can see how many source files, symbols, routes, calls, types, flows, dynamic imports, exceptions, and dependency edges were captured.
- Planner prompts receive dependency, call, type, and flow evidence for core modules. Generation prompts use the same skeleton lines when large files are summarized instead of inlined.
The extractor is deliberately conservative. It records shallow data-flow edges from assignments, loops, and context managers, but it does not attempt full interprocedural static analysis, control-flow reconstruction, type inference, or runtime import evaluation. Missing or ambiguous evidence should still be marked as uncertain by generated Skills. Plain symbol references are linked only when the symbol is unique in the repository or tied to an import alias/re-export.
- Alon et al., code2vec: Learning Distributed Representations of Code: https://arxiv.org/abs/1803.09473
- Allamanis et al., Learning to Represent Programs with Graphs: https://arxiv.org/abs/1711.00740
- Yamaguchi et al., Modeling and Discovering Vulnerabilities with Code Property Graphs: https://ieeexplore.ieee.org/document/6956581
- Guo et al., GraphCodeBERT: Pre-training Code Representations with Data Flow: https://arxiv.org/abs/2009.08366