docs: dedicated how-to page for staged insert by dimitri-yatsenko · Pull Request #175 · datajoint/datajoint-docs

dimitri-yatsenko · 2026-05-21T01:06:27Z

Summary

staged_insert1 is the right tool for inserting multi-GB Zarr/HDF5/streaming objects with atomic database+storage semantics, but it had no nav entry and no cross-link from insert-data.md — it was a subsection inside use-object-storage.md that readers only found by scrolling. Promote it to a dedicated How-To with full coverage.

Changes

New page: src/how-to/staged-insert.md — overview, "How it Works" (atomicity contract on both clean exit and exception), full API (rec, store, open, fs), patterns for Zarr/HDF5/instrument streaming, limitations, troubleshooting, See Also.
Nav: added under How-To → Object Storage, sandwiched between Use Object Storage and Use NPY Codec.
use-object-storage.md: replaced the "Write Directly to Object Storage" subsection with a brief pointer + link, so readers in the object-storage flow still get the handoff but the canonical content lives on the new page.
insert-data.md: inline note after "Insert with Blobs" and an entry at the top of See Also, so readers landing on the standard insert page discover the right tool for large objects.

Why a dedicated page (not a more prominent subsection)

Per offline discussion: the topic is valuable enough to warrant its own discoverability surface. Keeping it as a subsection — even renamed and repositioned — wouldn't appear in the left sidebar TOC, which is the highest-traffic discovery path.

Test plan

mkdocs serve and verify Staged Insert appears in the left sidebar under How-To → Object Storage
Open /how-to/staged-insert/ — confirm page renders end-to-end, all code blocks syntax-highlight, See Also links resolve
Open /how-to/use-object-storage/ — confirm the pointer near the bottom links to the new page
Open /how-to/insert-data/ — confirm both the inline "For multi-GB arrays..." and the See Also entry link to the new page
mkdocs-linkcheck (or equivalent) clean

`staged_insert1` was buried in `use-object-storage.md` (subsection "Write Directly to Object Storage") with no nav entry, so readers scanning the How-To sidebar couldn't find it and `insert-data.md` didn't link to it. Promote it to its own how-to: - New `src/how-to/staged-insert.md` covering overview, atomicity guarantees, full API (`rec`, `store`, `open`, `fs`), Zarr/HDF5/ streaming patterns, limitations, and troubleshooting. - Add nav entry under Object Storage in `mkdocs.yaml`. - Replace the section in `use-object-storage.md` with a short pointer and link to the new page. - Cross-link from `insert-data.md` (inline after "Insert with Blobs" and in See Also).

MilagrosMarin

Thanks @dimitri-yatsenko! Verified the API surface and atomicity contract against src/datajoint/staged_insert.py and tests/integration/test_object.py::TestStagedInsert:

✅ API matches source — staged.rec, staged.store(field, ext='') → fsspec.FSMap, staged.open(field, ext='', mode='wb') → IO, staged.fs → AbstractFileSystem.
✅ Atomicity is accurately described end-to-end. Worth highlighting: the doc correctly says database-insert failure during exit also cleans up — confirmed because _finalize() is called inside the outer try (staged_insert.py:304-310), so a duplicate-PK error gets caught and _cleanup() runs.
✅ Cross-links from insert-data.md and use-object-storage.md solve the discoverability gap. Nav placement under "How-To → Object Storage" is right.
✅ The pattern of not explicitly assigning staged.rec['field'] = z (which is shown in the dj-python source docstring) is actually more correct than the source docstring — _finalize overwrites it with computed metadata anyway. Good call.

Two accuracy issues worth fixing before merge:

1. <blob@> support claim is incorrect. Line 15 says:

It is only available for object-typed fields (<...@> syntax) and codecs that support direct storage handles — primarily <object@> (Zarr / HDF5 / multi-file) and <blob@> written via a file handle.

But staged_insert.py:100-101 explicitly:

if not (attr.codec and attr.codec.name == "object"):
    raise DataJointError(f"Attribute '{field}' is not an <object> type")

Only <object@> passes. The tests only exercise <object@local>. A <blob@> field would raise today. Either drop the <blob@> clause or note it as a planned extension.

2. "Content hash for hash-addressed codecs" describes nothing the implementation does. Line 64:

Computes object metadata (size, manifest, content hash for hash-addressed codecs) from the staged objects.

_compute_metadata (:165-249) sets "hash": None in both the directory and single-file branches. Since the only supported codec is <object@> (path-addressed), no hash is ever computed. Suggest just "Computes object metadata (size, manifest) from the staged objects."

Minor (optional):

<object@store> named-store form isn't mentioned — Quick Start uses <object@> (default store), but tests use <object@local>. One-line note in API Reference would round it out.
The context manager catches Exception, not BaseException, so KeyboardInterrupt mid-write leaves staged objects behind. Worth a bullet in Limitations so readers know not to rely on cleanup for Ctrl+C scenarios.

Otherwise this is a clean, well-structured how-to — much better than the buried subsection. Happy to approve once the two accuracy items are fixed.

@MilagrosMarin

From @MilagrosMarin's review on #175: - Drop the inaccurate '<blob@> written via a file handle' claim. staged_insert.py:100-101 explicitly rejects anything except codec name == 'object', so only <object@> is supported. Note the actual error behavior instead. - Drop 'content hash for hash-addressed codecs' from the metadata list. _compute_metadata always sets hash: None for both directory and single-file branches; no hash is ever computed. - Mention the named-store form '<object@name>' alongside '<object@>' in the Table.staged_insert1 API reference. - Add a Limitations bullet noting that cleanup catches Exception not BaseException, so KeyboardInterrupt mid-write can leave staged objects behind; point to the garbage-collection how-to.

dimitri-yatsenko · 2026-05-21T14:07:35Z

Thanks @MilagrosMarin — pushed e0f4de0 addressing all four points.

Required fixes:

<blob@> claim — dropped. The line now says only <object@> is supported and that attempting staged.store() / staged.open() on any other field type raises DataJointError. Confirmed against staged_insert.py:100-101.
Content hash — dropped from the metadata list. The "How It Works" step now reads "Computes object metadata (size, manifest)" — matches _compute_metadata's hash: None in both branches.

Optional (both done):
3. Named-store form — added a sentence in the Table.staged_insert1 API reference: "<object@> uses stores.default, and <object@name> uses the named store."
4. BaseException / Ctrl+C cleanup gap — added a Limitations bullet calling out that KeyboardInterrupt bypasses the cleanup path, with a pointer to Clean Up Storage for reclaiming orphaned staged objects.

Ready for another look.

dimitri-yatsenko · 2026-05-21T14:42:21Z

@MilagrosMarin — thanks again for the careful read here. Your accuracy points led directly to a broader design question that's worth flagging: once we asked why only <object@> is supported by staged insert, it turned out the gate is narrower than necessary. The same lifecycle generalizes cleanly to <npy@> (schema-addressed) and — with a staging-then-rename step — to <blob@> and <attach@> (hash-addressed). The "content hash for hash-addressed codecs" framing that we removed from this how-to is in fact real — for hash-addressed codecs the hash is the path — it just doesn't apply to <object@> today.

I've opened a parallel spec PR that pins this down end-to-end: #177 — Staged Insert Specification. It defines the lifecycle, the codec-side protocol (staged_handle / finalize_staged / cleanup_staged), path/metadata contracts for both addressing schemes, the atomicity model (including your BaseException point), the concurrency model, and the codec compatibility matrix. The implementation PR in datajoint-python will reference it and land after the spec is reviewed.

For #175: it's correct as-is for the current <object@>-only implementation. Once the implementation PR lands, a small follow-up will update this how-to to reflect the broader codec support and link to the spec for normative details. Happy for you to approve this PR standalone in the meantime so the discoverability/structure improvements ship.

MilagrosMarin

Thanks @dimitri-yatsenko — all four points verified in e0f4de0. <blob@> claim dropped, "content hash" phrase dropped, named-store form added, BaseException/Ctrl+C caveat in Limitations with pointer to garbage-collection.md (target exists). Spec-PR scoping (#177) is sensible — will review separately.

Approving.

…ert spec Corrections grounded in datajoint-python master: - Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67). - Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`; corrected to `_hash/{schema}/{content_hash}` (flat) or `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching hash_registry.build_hash_path. The {schema} segment is load-bearing for isolation; subfolding is per-store-tunable. - <object@> normative metadata shape: pinned to ObjectCodec.encode's actual output `{path, store, size, ext, is_dir, item_count, timestamp}` (builtin_codecs/object.py:166-174). Noted the two-place convergence work the impl PR will do (StagedInsert._compute_metadata refactor; earlier draft of this spec). - <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and AttachCodec.encode return raw bytes, and the dict shape comes from the chained <hash@> codec — the impl PR refactors them to return dicts directly. Also noted that HashCodec's three-way documented inconsistency will be consolidated as part of the same refactor. - Implementation-status banner: added at top of spec to signal which pieces are forward-looking vs as-shipped, with source line numbers as anchors. Items still in flight (planned for final pre-merge commit on this branch): - Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs per Milagros' suggestion) - Cross-link sequencing vs PR #175 (how-to) - aa0f66d Zarr framing edits (already in)

@Schema

* docs(specs): add Staged Insert Specification Defines the staged-insert contract as a normative spec so the implementation has a single source of truth and third-party codec authors have a documented protocol to implement. Covers: - Lifecycle (setup → drafting → finalization → unwinding) - The codec-side staged-write protocol (staged_handle / finalize_staged / cleanup_staged on the Codec base class) - Two concrete lifecycle variants: schema-addressed (handle at canonical path, finalize computes metadata) and hash-addressed (handle at _staging path, finalize hashes content and renames to canonical _hash/ path with dedup) - Path-construction shapes for both addressing schemes - Per-codec metadata contracts (testable invariants matching each codec's encode() output) - Atomicity model (at-most-once with cleanup; not transactional) - Concurrency behavior (per-PK, hash dedup, transaction interaction, BaseException leakage) - Codec compatibility matrix (the four built-in object-store codecs in, in-table and reference codecs explicitly out) - Worked examples for <object@>, <npy@>, <blob@>, <attach@> - Future-work scope notes for filepath staging, multi-row variants, and resumable inserts Implementation is deferred to a follow-up PR in datajoint-python; this spec is the design that PR will reference. Nav: add under Reference → Specifications → Data Operations alongside data-manipulation.md and autopopulate.md. * docs(specs): elevate <zarr@> to first-class staged-insert codec Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in the staged-insert spec: - New "Concrete protocol behavior" subsection describing both usage paths: ordinary insert1 (canonical for in-memory arrays) and staged_insert1 (for arrays too large to materialize, via direct FSMap-driven Zarr writes). - New row in the Codec compatibility matrix. - New Examples entry showing both paths side-by-side; demoted the generic <object@> example to a multi-file/directory fallback. * docs(specs): address review feedback from MilagrosMarin on staged-insert spec Corrections grounded in datajoint-python master: - Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67). - Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`; corrected to `_hash/{schema}/{content_hash}` (flat) or `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching hash_registry.build_hash_path. The {schema} segment is load-bearing for isolation; subfolding is per-store-tunable. - <object@> normative metadata shape: pinned to ObjectCodec.encode's actual output `{path, store, size, ext, is_dir, item_count, timestamp}` (builtin_codecs/object.py:166-174). Noted the two-place convergence work the impl PR will do (StagedInsert._compute_metadata refactor; earlier draft of this spec). - <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and AttachCodec.encode return raw bytes, and the dict shape comes from the chained <hash@> codec — the impl PR refactors them to return dicts directly. Also noted that HashCodec's three-way documented inconsistency will be consolidated as part of the same refactor. - Implementation-status banner: added at top of spec to signal which pieces are forward-looking vs as-shipped, with source line numbers as anchors. Items still in flight (planned for final pre-merge commit on this branch): - Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs per Milagros' suggestion) - Cross-link sequencing vs PR #175 (how-to) - aa0f66d Zarr framing edits (already in) * docs(specs): add table declarations to all staged-insert examples Every example in §Examples now includes the @Schema class declaration with definition string, matching the house style in codec-api.md. Readers can copy a complete, self-contained snippet rather than mentally fill in the table schema. int32 used throughout per the core-types-in-docs convention. Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>, <blob@>, <attach@>. * docs(specs): scope staged-insert spec to <object@> only (minimal spec) Staged insert applies only to codecs whose content format has an incremental-write API. Of the built-in codecs, only <object@> qualifies: - <blob@>, <hash@>: atomic byte sequences from a materialized Python object - <npy@>: np.save takes a materialized array - <attach@>: file is already on disk; ordinary insert1 suffices - <filepath@>: reference, not copy Rewrite the spec around this principle: - Drop the codec-protocol generalization (staged_handle / finalize_staged / cleanup_staged on Codec base class) — defer until a second codec actually needs it. - Drop the hash-addressed lifecycle and all <blob@>/<attach@>/<hash@> staged paths — no pathway exists. - Drop the <zarr@> staged example — the proper dj-zarr-codecs API is insert1(array); a staged <zarr@> path is future work. The Zarr example shown is honest about column type: declared <object@>, written through <object@>'s FSMap with zarr.open. - Drop the "Implementation status" admonition — no forward-looking pieces remain; the spec now documents shipped behavior. - Trim 421 → 154 lines. Future work section captures the deferred surface: staged insert for other codecs (incremental-API candidates: <zarr@>, <hdf5@>, parquet), hash-addressed staged, multi-row, resumable. Also fix data-manipulation.md §2.9: drop stale "codec protocol" / "codec compatibility matrix" cross-link phrasing (those sections no longer exist), and fix the snippet that incorrectly assigned the zarr handle to staged.rec — the framework computes the metadata dict; the caller does not assign anything to the staged field.

dimitri-yatsenko requested a review from MilagrosMarin May 21, 2026 01:06

MilagrosMarin requested changes May 21, 2026

View reviewed changes

dimitri-yatsenko mentioned this pull request May 21, 2026

docs(specs): add Staged Insert Specification #177

Merged

4 tasks

MilagrosMarin approved these changes May 21, 2026

View reviewed changes

MilagrosMarin merged commit 6f867bf into main May 21, 2026
3 checks passed

dimitri-yatsenko deleted the docs/staged-insert-howto branch May 21, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: dedicated how-to page for staged insert#175

docs: dedicated how-to page for staged insert#175
MilagrosMarin merged 2 commits into
mainfrom
docs/staged-insert-howto

dimitri-yatsenko commented May 21, 2026

Uh oh!

MilagrosMarin left a comment

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

MilagrosMarin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dimitri-yatsenko commented May 21, 2026

Summary

Changes

Why a dedicated page (not a more prominent subsection)

Test plan

Uh oh!

MilagrosMarin left a comment

Choose a reason for hiding this comment

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

MilagrosMarin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants