docs: dedicated how-to page for staged insert#175
Conversation
`staged_insert1` was buried in `use-object-storage.md` (subsection "Write Directly to Object Storage") with no nav entry, so readers scanning the How-To sidebar couldn't find it and `insert-data.md` didn't link to it. Promote it to its own how-to: - New `src/how-to/staged-insert.md` covering overview, atomicity guarantees, full API (`rec`, `store`, `open`, `fs`), Zarr/HDF5/ streaming patterns, limitations, and troubleshooting. - Add nav entry under Object Storage in `mkdocs.yaml`. - Replace the section in `use-object-storage.md` with a short pointer and link to the new page. - Cross-link from `insert-data.md` (inline after "Insert with Blobs" and in See Also).
MilagrosMarin
left a comment
There was a problem hiding this comment.
Thanks @dimitri-yatsenko! Verified the API surface and atomicity contract against src/datajoint/staged_insert.py and tests/integration/test_object.py::TestStagedInsert:
✅ API matches source — staged.rec, staged.store(field, ext='') → fsspec.FSMap, staged.open(field, ext='', mode='wb') → IO, staged.fs → AbstractFileSystem.
✅ Atomicity is accurately described end-to-end. Worth highlighting: the doc correctly says database-insert failure during exit also cleans up — confirmed because _finalize() is called inside the outer try (staged_insert.py:304-310), so a duplicate-PK error gets caught and _cleanup() runs.
✅ Cross-links from insert-data.md and use-object-storage.md solve the discoverability gap. Nav placement under "How-To → Object Storage" is right.
✅ The pattern of not explicitly assigning staged.rec['field'] = z (which is shown in the dj-python source docstring) is actually more correct than the source docstring — _finalize overwrites it with computed metadata anyway. Good call.
Two accuracy issues worth fixing before merge:
1. <blob@> support claim is incorrect. Line 15 says:
It is only available for object-typed fields (
<...@>syntax) and codecs that support direct storage handles — primarily<object@>(Zarr / HDF5 / multi-file) and<blob@>written via a file handle.
But staged_insert.py:100-101 explicitly:
if not (attr.codec and attr.codec.name == "object"):
raise DataJointError(f"Attribute '{field}' is not an <object> type")Only <object@> passes. The tests only exercise <object@local>. A <blob@> field would raise today. Either drop the <blob@> clause or note it as a planned extension.
2. "Content hash for hash-addressed codecs" describes nothing the implementation does. Line 64:
Computes object metadata (size, manifest, content hash for hash-addressed codecs) from the staged objects.
_compute_metadata (:165-249) sets "hash": None in both the directory and single-file branches. Since the only supported codec is <object@> (path-addressed), no hash is ever computed. Suggest just "Computes object metadata (size, manifest) from the staged objects."
Minor (optional):
<object@store>named-store form isn't mentioned — Quick Start uses<object@>(default store), but tests use<object@local>. One-line note in API Reference would round it out.- The context manager catches
Exception, notBaseException, soKeyboardInterruptmid-write leaves staged objects behind. Worth a bullet in Limitations so readers know not to rely on cleanup for Ctrl+C scenarios.
Otherwise this is a clean, well-structured how-to — much better than the buried subsection. Happy to approve once the two accuracy items are fixed.
From @MilagrosMarin's review on #175: - Drop the inaccurate '<blob@> written via a file handle' claim. staged_insert.py:100-101 explicitly rejects anything except codec name == 'object', so only <object@> is supported. Note the actual error behavior instead. - Drop 'content hash for hash-addressed codecs' from the metadata list. _compute_metadata always sets hash: None for both directory and single-file branches; no hash is ever computed. - Mention the named-store form '<object@name>' alongside '<object@>' in the Table.staged_insert1 API reference. - Add a Limitations bullet noting that cleanup catches Exception not BaseException, so KeyboardInterrupt mid-write can leave staged objects behind; point to the garbage-collection how-to.
|
Thanks @MilagrosMarin — pushed e0f4de0 addressing all four points. Required fixes:
Optional (both done): Ready for another look. |
|
@MilagrosMarin — thanks again for the careful read here. Your accuracy points led directly to a broader design question that's worth flagging: once we asked why only I've opened a parallel spec PR that pins this down end-to-end: #177 — Staged Insert Specification. It defines the lifecycle, the codec-side protocol ( For #175: it's correct as-is for the current |
MilagrosMarin
left a comment
There was a problem hiding this comment.
Thanks @dimitri-yatsenko — all four points verified in e0f4de0. <blob@> claim dropped, "content hash" phrase dropped, named-store form added, BaseException/Ctrl+C caveat in Limitations with pointer to garbage-collection.md (target exists). Spec-PR scoping (#177) is sensible — will review separately.
Approving.
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
* docs(specs): add Staged Insert Specification
Defines the staged-insert contract as a normative spec so the
implementation has a single source of truth and third-party codec
authors have a documented protocol to implement.
Covers:
- Lifecycle (setup → drafting → finalization → unwinding)
- The codec-side staged-write protocol (staged_handle / finalize_staged
/ cleanup_staged on the Codec base class)
- Two concrete lifecycle variants: schema-addressed (handle at canonical
path, finalize computes metadata) and hash-addressed (handle at
_staging path, finalize hashes content and renames to canonical
_hash/ path with dedup)
- Path-construction shapes for both addressing schemes
- Per-codec metadata contracts (testable invariants matching each
codec's encode() output)
- Atomicity model (at-most-once with cleanup; not transactional)
- Concurrency behavior (per-PK, hash dedup, transaction interaction,
BaseException leakage)
- Codec compatibility matrix (the four built-in object-store codecs in,
in-table and reference codecs explicitly out)
- Worked examples for <object@>, <npy@>, <blob@>, <attach@>
- Future-work scope notes for filepath staging, multi-row variants,
and resumable inserts
Implementation is deferred to a follow-up PR in datajoint-python; this
spec is the design that PR will reference.
Nav: add under Reference → Specifications → Data Operations alongside
data-manipulation.md and autopopulate.md.
* docs(specs): elevate <zarr@> to first-class staged-insert codec
Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in
the staged-insert spec:
- New "Concrete protocol behavior" subsection describing both usage
paths: ordinary insert1 (canonical for in-memory arrays) and
staged_insert1 (for arrays too large to materialize, via direct
FSMap-driven Zarr writes).
- New row in the Codec compatibility matrix.
- New Examples entry showing both paths side-by-side; demoted the
generic <object@> example to a multi-file/directory fallback.
* docs(specs): address review feedback from MilagrosMarin on staged-insert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
* docs(specs): add table declarations to all staged-insert examples
Every example in §Examples now includes the @Schema class declaration
with definition string, matching the house style in codec-api.md.
Readers can copy a complete, self-contained snippet rather than
mentally fill in the table schema. int32 used throughout per the
core-types-in-docs convention.
Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>,
<blob@>, <attach@>.
* docs(specs): scope staged-insert spec to <object@> only (minimal spec)
Staged insert applies only to codecs whose content format has an
incremental-write API. Of the built-in codecs, only <object@> qualifies:
- <blob@>, <hash@>: atomic byte sequences from a materialized Python object
- <npy@>: np.save takes a materialized array
- <attach@>: file is already on disk; ordinary insert1 suffices
- <filepath@>: reference, not copy
Rewrite the spec around this principle:
- Drop the codec-protocol generalization (staged_handle / finalize_staged
/ cleanup_staged on Codec base class) — defer until a second codec
actually needs it.
- Drop the hash-addressed lifecycle and all <blob@>/<attach@>/<hash@>
staged paths — no pathway exists.
- Drop the <zarr@> staged example — the proper dj-zarr-codecs API is
insert1(array); a staged <zarr@> path is future work. The Zarr example
shown is honest about column type: declared <object@>, written through
<object@>'s FSMap with zarr.open.
- Drop the "Implementation status" admonition — no forward-looking
pieces remain; the spec now documents shipped behavior.
- Trim 421 → 154 lines.
Future work section captures the deferred surface: staged insert for
other codecs (incremental-API candidates: <zarr@>, <hdf5@>, parquet),
hash-addressed staged, multi-row, resumable.
Also fix data-manipulation.md §2.9: drop stale "codec protocol" /
"codec compatibility matrix" cross-link phrasing (those sections no
longer exist), and fix the snippet that incorrectly assigned the zarr
handle to staged.rec — the framework computes the metadata dict; the
caller does not assign anything to the staged field.
Summary
staged_insert1is the right tool for inserting multi-GB Zarr/HDF5/streaming objects with atomic database+storage semantics, but it had no nav entry and no cross-link frominsert-data.md— it was a subsection insideuse-object-storage.mdthat readers only found by scrolling. Promote it to a dedicated How-To with full coverage.Changes
src/how-to/staged-insert.md— overview, "How it Works" (atomicity contract on both clean exit and exception), full API (rec,store,open,fs), patterns for Zarr/HDF5/instrument streaming, limitations, troubleshooting, See Also.How-To → Object Storage, sandwiched betweenUse Object StorageandUse NPY Codec.use-object-storage.md: replaced the "Write Directly to Object Storage" subsection with a brief pointer + link, so readers in the object-storage flow still get the handoff but the canonical content lives on the new page.insert-data.md: inline note after "Insert with Blobs" and an entry at the top of See Also, so readers landing on the standard insert page discover the right tool for large objects.Why a dedicated page (not a more prominent subsection)
Per offline discussion: the topic is valuable enough to warrant its own discoverability surface. Keeping it as a subsection — even renamed and repositioned — wouldn't appear in the left sidebar TOC, which is the highest-traffic discovery path.
Test plan
mkdocs serveand verifyStaged Insertappears in the left sidebar under How-To → Object Storage/how-to/staged-insert/— confirm page renders end-to-end, all code blocks syntax-highlight, See Also links resolve/how-to/use-object-storage/— confirm the pointer near the bottom links to the new page/how-to/insert-data/— confirm both the inline "For multi-GB arrays..." and the See Also entry link to the new pagemkdocs-linkcheck(or equivalent) clean