docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance by dimitri-yatsenko · Pull Request #183 · datajoint/datajoint-docs

dimitri-yatsenko · 2026-06-10T23:08:09Z

Summary

Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3. Implementation against this spec follows in three sub-tasks (T2.2.a/b/c):

`Diagram.trace(table_expr)` (#1423) — upstream mirror of `Diagram.cascade()`. Walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the just-merged Cascade Specification.
`self.upstream` (#1424) — property on `AutoPopulate` that the framework sets to `Diagram.trace(self & key)` before each `make()`. Pre-restricted ancestor access for ergonomic, provenance-safe reads.
`dj.config["strict_provenance"]` (#1425) — runtime flag (default `False`). When `True`, gates `make()` so reads must go through `self.upstream` and writes must target `self` or `self`'s Parts with key-consistent primary keys.

Spec structure

Section	What it covers
Why this exists	The convention/enforcement gap; why three pieces are needed together
Concepts	trace as cascade's mirror; OR convergence; the `make()` provenance boundary; shared upward rules
§1 `Diagram.trace`	Signature, behavior, `trace[T]` indexing (class + string), `counts()`/`heading()`/`iter`, worked examples
§2 `self.upstream`	Construction lifecycle, allowed table set, per-`key` behavior, examples
§3 `strict_provenance`	Config key, read/write enforcement tables, key-consistency rule, opt-in rationale, compliant + violating examples
Integration	End-to-end strict-mode example, properties guaranteed
Migration path	Staged adoption from default-off to full enforcement
Out of scope	Static analysis, default flip, downstream provenance-metadata persistence

Examples use core DataJoint types (`int32`, `float64`, `longblob`) per project convention.

Why a spec-first approach

The three sub-tasks (T2.2.a/b/c) form a tightly-coupled feature where partial shipping is worse than not shipping (you can't enforce against an upstream view that doesn't exist; you can't expose `self.upstream` ergonomically without the trace primitive). Writing the spec first locks the design contract that the three PRs implement against, surfaces design questions early, and gives reviewers a single coherent artifact to evaluate before the code arrives.

Cross-links

Cascade Specification (PR docs(#1429): add Cascade Specification #182) — defines the upward rules that `trace` reuses. This PR should land after docs(#1429): add Cascade Specification #182.
AutoPopulate, Diagram, Entity Integrity — existing spec/concept pages referenced.

Sequencing

Reviewable now as a design document. The three implementation PRs (datajoint-python) will land against this spec. Order:

This PR (spec) — review and merge
T2.2.a: `Diagram.trace()` implementation (depends on #1468 cascade-rules landing first)
T2.2.b: `self.upstream` implementation (depends on T2.2.a)
T2.2.c: `strict_provenance` implementation (depends on T2.2.b)

Test plan

`mkdocs serve` renders the new spec under Reference → Specifications → Data Operations
Cross-links resolve (cascade.md, autopopulate.md, diagram.md, entity-integrity.md, the three issues)
Examples use core DataJoint types
Read enforcement table and write enforcement table are precise enough to be implementable as-is

…venance Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3: - Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(), walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the Cascade Specification. Supports trace[TableClass] returning a pre-restricted QueryExpression, and trace[str] returning a FreeTable. - self.upstream — property on AutoPopulate that the framework sets to Diagram.trace(self & key) before each make() invocation. Pre-restricted ancestor access for ergonomic, provenance-safe reads. - dj.config["strict_provenance"] — runtime flag (default False). When True, gates make() so reads must go through self.upstream and writes must target self or self's Parts with key-consistent primary keys. Spec structure: - Why this exists — the convention/enforcement gap, why three pieces. - Concepts — trace as cascade's mirror, OR convergence, the make() provenance boundary, why upward rules are shared with cascade. - §1 Diagram.trace — signature, behavior, indexing (class + string), counts() / heading() / iter, worked examples. - §2 self.upstream — construction lifecycle, allowed table set, per-key behavior, examples. - §3 strict_provenance — config key, read enforcement table, write enforcement table, key-consistency rule, opt-in rationale, examples of compliant + violating make() bodies. - Integration — end-to-end strict-mode example, properties guaranteed. - Migration path — staged adoption from on-default-off to enforcement. - Out of scope — static analysis, default flip, downstream provenance metadata persistence. Examples use core DataJoint types (int32, float64, longblob) per project convention. Cross-links to cascade.md (shared upward rules), autopopulate.md, diagram.md, entity-integrity.md. Nav entry under Reference > Specifications > Data Operations between AutoPopulate and Job Metadata. Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c (#1425) implementations land against this spec.

MilagrosMarin

Read carefully. This is a thoughtful spec — the three-piece architecture (graph operation + ergonomic surface + enforcement layer) is well-motivated, and the migration path gives teams a concrete 5-step rollout. A few observations worth raising before the implementation work begins:

Context

This is a forward-looking spec for three unshipped features. All three referenced issues (#1423/#1424/#1425) exist as OPEN in datajoint-python. No implementation exists in source yet — grep returned zero matches for load_all_upstream, def trace, self.upstream, strict_provenance across the relevant modules.

Sequencing concern

cascade.md cross-link will be broken until #182 lands. The spec links to cascade.md repeatedly:

Line 14: "see Cascade Specification"
Line 44: "upward propagation rules (U1, U2, U3) defined in the Cascade Specification"
Line 179: "paralleling Cascade Spec §Worked Example 1"
Line 404: References

These are load-bearing — the upward rules are defined exclusively in #182. Sequencing requirement: #182 must merge first (or batch together).

Observations

1. Forward-looking surface needs a more prominent banner. The current "version-added" admonition at line 11 helps but doesn't convey "implementation lands across three separate PRs — this is the design target, not shipped behavior". An "Implementation status" admonition near the top (like the one #177 eventually adopted) would make this clearer. A reader could mistake the current framing for "available now in 2.3".

2. load_all_upstream is also unshipped. Line 84 references connection.dependencies.load_all_upstream() as if it exists; it doesn't. The spec should note this is also new (or cite the issue/PR where it lands).

3. Dangling "Implementation notes" reference. Line 245 says "the value seen depends on thread-local context — see Implementation notes". There is no "Implementation notes" section in the spec. Either add the section or drop the reference.

4. AND-flavored upstream dismissed too quickly. Line 40: "The AND-flavored upstream analog — 'ancestors that contributed via every path' — is not a useful query and is not provided." Strong claim. Audit / intersection use cases might want it. Softer wording: "… not provided in 2.3" or "… not currently considered useful for typical compute pipelines".

5. Strict-provenance edge cases worth addressing:

Does connection.query(raw_sql) get blocked under strict?
What about an audit log inside make()? Under strict, a team can't write to AuditLog. Suggesting "log to a Part of self" or "use a temporary escape via context manager" would close this loop.
The runtime mechanism — overriding fetch/insert, descriptor, AST analysis? A brief implementation notes paragraph (resolving the dangling reference from #3) would close this.

6. make_kwargs mention without definition. Line 235: "make_kwargs and key are unaffected." — make_kwargs isn't defined elsewhere in the spec. Readers unfamiliar with this internal will be confused.

7. Caching/lazy behavior of self.upstream. Line 187 says construction is lazy. If make() accesses self.upstream[Session] twice, are both fetches independent SELECTs, or is there per-call caching? Worth one sentence either way.

8. Concurrency clarification at line 245. "the value seen depends on thread-local context" — dj.config["strict_provenance"] reads from a module-level dict, which is global, not thread-local. The implementation might need to make it thread-local but the spec should say so explicitly.

9. self.PartName vs self.PartName() inconsistency. Examples use both forms in different places (e.g., line 257 "Reading self or self.PartName" vs line 309 self.Bin.insert1(...)). Pick one canonical form.

10. Cosmetic: example error message at line 326-329 shows key['subject_id'] (Python expression) rather than the runtime value. Real error messages would have the value substituted in.

Strengths to highlight

The cascade↔trace comparison table (line 34) is a clear pedagogical anchor
Three-piece argument (line 20-26) makes the case for why all three are needed together
"Why opt-in" rationale (lines 285-289) is well-argued — production vs development is the right framing
5-step migration path is concrete and incremental
"What is not in this specification" appropriately defers static analysis and default-flip

Holding off formal approval until the cross-link dependency on #182 is resolved and the forward-looking framing is sharpened. The spec content is solid; this is mostly about presentation and sequencing.

dimitri-yatsenko requested review from MilagrosMarin, esutlie and ttngu207 June 10, 2026 23:08

dimitri-yatsenko marked this pull request as draft June 10, 2026 23:14

MilagrosMarin reviewed Jun 10, 2026

View reviewed changes

Merge branch 'main' into feat/provenance-trinity-spec

3d582db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183
dimitri-yatsenko wants to merge 2 commits into
mainfrom
feat/provenance-trinity-spec

dimitri-yatsenko commented Jun 10, 2026

Uh oh!

MilagrosMarin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dimitri-yatsenko commented Jun 10, 2026

Summary

Spec structure

Why a spec-first approach

Cross-links

Sequencing

Test plan

Uh oh!

MilagrosMarin left a comment

Choose a reason for hiding this comment

Context

Sequencing concern

Observations

Strengths to highlight

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants