Skip to content

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183

Draft
dimitri-yatsenko wants to merge 2 commits into
mainfrom
feat/provenance-trinity-spec
Draft

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183
dimitri-yatsenko wants to merge 2 commits into
mainfrom
feat/provenance-trinity-spec

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3. Implementation against this spec follows in three sub-tasks (T2.2.a/b/c):

  • `Diagram.trace(table_expr)` (#1423) — upstream mirror of `Diagram.cascade()`. Walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the just-merged Cascade Specification.
  • `self.upstream` (#1424) — property on `AutoPopulate` that the framework sets to `Diagram.trace(self & key)` before each `make()`. Pre-restricted ancestor access for ergonomic, provenance-safe reads.
  • `dj.config["strict_provenance"]` (#1425) — runtime flag (default `False`). When `True`, gates `make()` so reads must go through `self.upstream` and writes must target `self` or `self`'s Parts with key-consistent primary keys.

Spec structure

Section What it covers
Why this exists The convention/enforcement gap; why three pieces are needed together
Concepts trace as cascade's mirror; OR convergence; the `make()` provenance boundary; shared upward rules
§1 `Diagram.trace` Signature, behavior, `trace[T]` indexing (class + string), `counts()`/`heading()`/`iter`, worked examples
§2 `self.upstream` Construction lifecycle, allowed table set, per-`key` behavior, examples
§3 `strict_provenance` Config key, read/write enforcement tables, key-consistency rule, opt-in rationale, compliant + violating examples
Integration End-to-end strict-mode example, properties guaranteed
Migration path Staged adoption from default-off to full enforcement
Out of scope Static analysis, default flip, downstream provenance-metadata persistence

Examples use core DataJoint types (`int32`, `float64`, `longblob`) per project convention.

Why a spec-first approach

The three sub-tasks (T2.2.a/b/c) form a tightly-coupled feature where partial shipping is worse than not shipping (you can't enforce against an upstream view that doesn't exist; you can't expose `self.upstream` ergonomically without the trace primitive). Writing the spec first locks the design contract that the three PRs implement against, surfaces design questions early, and gives reviewers a single coherent artifact to evaluate before the code arrives.

Cross-links

Sequencing

Reviewable now as a design document. The three implementation PRs (datajoint-python) will land against this spec. Order:

  1. This PR (spec) — review and merge
  2. T2.2.a: `Diagram.trace()` implementation (depends on #1468 cascade-rules landing first)
  3. T2.2.b: `self.upstream` implementation (depends on T2.2.a)
  4. T2.2.c: `strict_provenance` implementation (depends on T2.2.b)

Test plan

  • `mkdocs serve` renders the new spec under Reference → Specifications → Data Operations
  • Cross-links resolve (cascade.md, autopopulate.md, diagram.md, entity-integrity.md, the three issues)
  • Examples use core DataJoint types
  • Read enforcement table and write enforcement table are precise enough to be implementable as-is

…venance

Detailed normative specification for the canonical provenance trinity
landing in DataJoint 2.3:

- Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(),
  walks the FK graph from a restricted seed to every ancestor with OR
  convergence. Reuses the upward propagation rules (U1/U2/U3) defined
  in the Cascade Specification. Supports trace[TableClass] returning a
  pre-restricted QueryExpression, and trace[str] returning a FreeTable.
- self.upstream — property on AutoPopulate that the framework sets to
  Diagram.trace(self & key) before each make() invocation. Pre-restricted
  ancestor access for ergonomic, provenance-safe reads.
- dj.config["strict_provenance"] — runtime flag (default False). When
  True, gates make() so reads must go through self.upstream and writes
  must target self or self's Parts with key-consistent primary keys.

Spec structure:
- Why this exists — the convention/enforcement gap, why three pieces.
- Concepts — trace as cascade's mirror, OR convergence, the make()
  provenance boundary, why upward rules are shared with cascade.
- §1 Diagram.trace — signature, behavior, indexing (class + string),
  counts() / heading() / iter, worked examples.
- §2 self.upstream — construction lifecycle, allowed table set,
  per-key behavior, examples.
- §3 strict_provenance — config key, read enforcement table, write
  enforcement table, key-consistency rule, opt-in rationale, examples
  of compliant + violating make() bodies.
- Integration — end-to-end strict-mode example, properties guaranteed.
- Migration path — staged adoption from on-default-off to enforcement.
- Out of scope — static analysis, default flip, downstream provenance
  metadata persistence.

Examples use core DataJoint types (int32, float64, longblob) per
project convention. Cross-links to cascade.md (shared upward rules),
autopopulate.md, diagram.md, entity-integrity.md.

Nav entry under Reference > Specifications > Data Operations between
AutoPopulate and Job Metadata.

Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c
(#1425) implementations land against this spec.
@dimitri-yatsenko dimitri-yatsenko marked this pull request as draft June 10, 2026 23:14

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read carefully. This is a thoughtful spec — the three-piece architecture (graph operation + ergonomic surface + enforcement layer) is well-motivated, and the migration path gives teams a concrete 5-step rollout. A few observations worth raising before the implementation work begins:

Context

This is a forward-looking spec for three unshipped features. All three referenced issues (#1423/#1424/#1425) exist as OPEN in datajoint-python. No implementation exists in source yet — grep returned zero matches for load_all_upstream, def trace, self.upstream, strict_provenance across the relevant modules.

Sequencing concern

cascade.md cross-link will be broken until #182 lands. The spec links to cascade.md repeatedly:

These are load-bearing — the upward rules are defined exclusively in #182. Sequencing requirement: #182 must merge first (or batch together).

Observations

1. Forward-looking surface needs a more prominent banner. The current "version-added" admonition at line 11 helps but doesn't convey "implementation lands across three separate PRs — this is the design target, not shipped behavior". An "Implementation status" admonition near the top (like the one #177 eventually adopted) would make this clearer. A reader could mistake the current framing for "available now in 2.3".

2. load_all_upstream is also unshipped. Line 84 references connection.dependencies.load_all_upstream() as if it exists; it doesn't. The spec should note this is also new (or cite the issue/PR where it lands).

3. Dangling "Implementation notes" reference. Line 245 says "the value seen depends on thread-local context — see Implementation notes". There is no "Implementation notes" section in the spec. Either add the section or drop the reference.

4. AND-flavored upstream dismissed too quickly. Line 40: "The AND-flavored upstream analog — 'ancestors that contributed via every path' — is not a useful query and is not provided." Strong claim. Audit / intersection use cases might want it. Softer wording: "… not provided in 2.3" or "… not currently considered useful for typical compute pipelines".

5. Strict-provenance edge cases worth addressing:

  • Does connection.query(raw_sql) get blocked under strict?
  • What about an audit log inside make()? Under strict, a team can't write to AuditLog. Suggesting "log to a Part of self" or "use a temporary escape via context manager" would close this loop.
  • The runtime mechanism — overriding fetch/insert, descriptor, AST analysis? A brief implementation notes paragraph (resolving the dangling reference from #3) would close this.

6. make_kwargs mention without definition. Line 235: "make_kwargs and key are unaffected."make_kwargs isn't defined elsewhere in the spec. Readers unfamiliar with this internal will be confused.

7. Caching/lazy behavior of self.upstream. Line 187 says construction is lazy. If make() accesses self.upstream[Session] twice, are both fetches independent SELECTs, or is there per-call caching? Worth one sentence either way.

8. Concurrency clarification at line 245. "the value seen depends on thread-local context"dj.config["strict_provenance"] reads from a module-level dict, which is global, not thread-local. The implementation might need to make it thread-local but the spec should say so explicitly.

9. self.PartName vs self.PartName() inconsistency. Examples use both forms in different places (e.g., line 257 "Reading self or self.PartName" vs line 309 self.Bin.insert1(...)). Pick one canonical form.

10. Cosmetic: example error message at line 326-329 shows key['subject_id'] (Python expression) rather than the runtime value. Real error messages would have the value substituted in.

Strengths to highlight

  • The cascade↔trace comparison table (line 34) is a clear pedagogical anchor
  • Three-piece argument (line 20-26) makes the case for why all three are needed together
  • "Why opt-in" rationale (lines 285-289) is well-argued — production vs development is the right framing
  • 5-step migration path is concrete and incremental
  • "What is not in this specification" appropriately defers static analysis and default-flip

Holding off formal approval until the cross-link dependency on #182 is resolved and the forward-looking framing is sharpened. The spec content is solid; this is mostly about presentation and sequencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants