docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183
docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183dimitri-yatsenko wants to merge 2 commits into
Conversation
…venance Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3: - Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(), walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the Cascade Specification. Supports trace[TableClass] returning a pre-restricted QueryExpression, and trace[str] returning a FreeTable. - self.upstream — property on AutoPopulate that the framework sets to Diagram.trace(self & key) before each make() invocation. Pre-restricted ancestor access for ergonomic, provenance-safe reads. - dj.config["strict_provenance"] — runtime flag (default False). When True, gates make() so reads must go through self.upstream and writes must target self or self's Parts with key-consistent primary keys. Spec structure: - Why this exists — the convention/enforcement gap, why three pieces. - Concepts — trace as cascade's mirror, OR convergence, the make() provenance boundary, why upward rules are shared with cascade. - §1 Diagram.trace — signature, behavior, indexing (class + string), counts() / heading() / iter, worked examples. - §2 self.upstream — construction lifecycle, allowed table set, per-key behavior, examples. - §3 strict_provenance — config key, read enforcement table, write enforcement table, key-consistency rule, opt-in rationale, examples of compliant + violating make() bodies. - Integration — end-to-end strict-mode example, properties guaranteed. - Migration path — staged adoption from on-default-off to enforcement. - Out of scope — static analysis, default flip, downstream provenance metadata persistence. Examples use core DataJoint types (int32, float64, longblob) per project convention. Cross-links to cascade.md (shared upward rules), autopopulate.md, diagram.md, entity-integrity.md. Nav entry under Reference > Specifications > Data Operations between AutoPopulate and Job Metadata. Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c (#1425) implementations land against this spec.
MilagrosMarin
left a comment
There was a problem hiding this comment.
Read carefully. This is a thoughtful spec — the three-piece architecture (graph operation + ergonomic surface + enforcement layer) is well-motivated, and the migration path gives teams a concrete 5-step rollout. A few observations worth raising before the implementation work begins:
Context
This is a forward-looking spec for three unshipped features. All three referenced issues (#1423/#1424/#1425) exist as OPEN in datajoint-python. No implementation exists in source yet — grep returned zero matches for load_all_upstream, def trace, self.upstream, strict_provenance across the relevant modules.
Sequencing concern
cascade.md cross-link will be broken until #182 lands. The spec links to cascade.md repeatedly:
- Line 14: "see Cascade Specification"
- Line 44: "upward propagation rules (U1, U2, U3) defined in the Cascade Specification"
- Line 179: "paralleling Cascade Spec §Worked Example 1"
- Line 404: References
These are load-bearing — the upward rules are defined exclusively in #182. Sequencing requirement: #182 must merge first (or batch together).
Observations
1. Forward-looking surface needs a more prominent banner. The current "version-added" admonition at line 11 helps but doesn't convey "implementation lands across three separate PRs — this is the design target, not shipped behavior". An "Implementation status" admonition near the top (like the one #177 eventually adopted) would make this clearer. A reader could mistake the current framing for "available now in 2.3".
2. load_all_upstream is also unshipped. Line 84 references connection.dependencies.load_all_upstream() as if it exists; it doesn't. The spec should note this is also new (or cite the issue/PR where it lands).
3. Dangling "Implementation notes" reference. Line 245 says "the value seen depends on thread-local context — see Implementation notes". There is no "Implementation notes" section in the spec. Either add the section or drop the reference.
4. AND-flavored upstream dismissed too quickly. Line 40: "The AND-flavored upstream analog — 'ancestors that contributed via every path' — is not a useful query and is not provided." Strong claim. Audit / intersection use cases might want it. Softer wording: "… not provided in 2.3" or "… not currently considered useful for typical compute pipelines".
5. Strict-provenance edge cases worth addressing:
- Does
connection.query(raw_sql)get blocked under strict? - What about an audit log inside
make()? Under strict, a team can't write toAuditLog. Suggesting "log to a Part ofself" or "use a temporary escape via context manager" would close this loop. - The runtime mechanism — overriding
fetch/insert, descriptor, AST analysis? A brief implementation notes paragraph (resolving the dangling reference from #3) would close this.
6. make_kwargs mention without definition. Line 235: "make_kwargs and key are unaffected." — make_kwargs isn't defined elsewhere in the spec. Readers unfamiliar with this internal will be confused.
7. Caching/lazy behavior of self.upstream. Line 187 says construction is lazy. If make() accesses self.upstream[Session] twice, are both fetches independent SELECTs, or is there per-call caching? Worth one sentence either way.
8. Concurrency clarification at line 245. "the value seen depends on thread-local context" — dj.config["strict_provenance"] reads from a module-level dict, which is global, not thread-local. The implementation might need to make it thread-local but the spec should say so explicitly.
9. self.PartName vs self.PartName() inconsistency. Examples use both forms in different places (e.g., line 257 "Reading self or self.PartName" vs line 309 self.Bin.insert1(...)). Pick one canonical form.
10. Cosmetic: example error message at line 326-329 shows key['subject_id'] (Python expression) rather than the runtime value. Real error messages would have the value substituted in.
Strengths to highlight
- The cascade↔trace comparison table (line 34) is a clear pedagogical anchor
- Three-piece argument (line 20-26) makes the case for why all three are needed together
- "Why opt-in" rationale (lines 285-289) is well-argued — production vs development is the right framing
- 5-step migration path is concrete and incremental
- "What is not in this specification" appropriately defers static analysis and default-flip
Holding off formal approval until the cross-link dependency on #182 is resolved and the forward-looking framing is sharpened. The spec content is solid; this is mostly about presentation and sequencing.
Summary
Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3. Implementation against this spec follows in three sub-tasks (T2.2.a/b/c):
Spec structure
Examples use core DataJoint types (`int32`, `float64`, `longblob`) per project convention.
Why a spec-first approach
The three sub-tasks (T2.2.a/b/c) form a tightly-coupled feature where partial shipping is worse than not shipping (you can't enforce against an upstream view that doesn't exist; you can't expose `self.upstream` ergonomically without the trace primitive). Writing the spec first locks the design contract that the three PRs implement against, surfaces design questions early, and gives reviewers a single coherent artifact to evaluate before the code arrives.
Cross-links
Sequencing
Reviewable now as a design document. The three implementation PRs (datajoint-python) will land against this spec. Order:
Test plan