fix(staged_insert): converge metadata shape with ObjectCodec.encode#1465
fix(staged_insert): converge metadata shape with ObjectCodec.encode#1465dimitri-yatsenko wants to merge 1 commit into
Conversation
Bring the staged-insert metadata dict into structural equality with the
ordinary insert1 path, per the Staged Insert Specification
(datajoint-docs#177). Without this change the same content stored via
the two paths yields different column dicts:
ObjectCodec.encode -> {path, store, size, ext, is_dir, item_count, timestamp}
staged (directory) -> {path, size, hash, ext, is_dir, item_count, timestamp}
staged (single file)-> {path, size, hash, ext, is_dir, timestamp, mime_type?}
The staged path is now the canonical shape. Drops:
- hash: None (never carried information)
- mime_type (file case) (not in encode shape)
The file case now also carries item_count: None (matching encode).
Also fix the store_name divergence: staged_insert resolved the backend
from stores.default regardless of the field's type spec. A field declared
<object@local> would write through stores.default — and the store key
recorded in the metadata column would point at the wrong store. Now
resolve store_name from attr.store via resolve_dtype() and use that for
both path/backend resolution and the metadata's store field.
Drop the .manifest.json sidecar that the staged path wrote and the
encode path didn't. The metadata dict already records total size and
item_count; per-file listings are recoverable by walking the canonical
directory if ever needed.
Fix docstrings that showed `staged.rec['raw_data'] = z` — the framework
computes the metadata; the caller does not assign anything to the
staged field on staged.rec.
Add tests/integration/test_object.py::TestStagedInsert::
test_staged_insert_metadata_shape_matches_encode covering both the
single-file and directory cases against ObjectCodec.encode for
equivalent content.
Slated for DataJoint 2.3.
MilagrosMarin
left a comment
There was a problem hiding this comment.
Verified end-to-end against the spec contract in #177:
✅ Metadata shape {path, store, size, ext, is_dir, item_count, timestamp} matches ObjectCodec.encode (builtin_codecs/object.py:166-174) exactly — drops hash: None, drops mime_type, adds store and item_count consistently across file and directory cases.
✅ Per-field store resolution via resolve_dtype(..., store_name=attr.store) — <object@local> now correctly writes through stores.local and records "store": "local" in the metadata.
✅ ObjectRef.from_json is forward-compatible (data.get("hash"), data.get("item_count") defaults) — existing rows with hash: None still parse cleanly.
✅ The new test_staged_insert_metadata_shape_matches_encode is exactly the structural-equality assertion the spec calls for — keys match, and store/ext/is_dir/item_count agree across both paths.
Two minor notes, neither blocking:
- Test covers
<object@local>viaObjectFile/ObjectFolder. The default-store path (<object@>,store_name=None) uses the same code branch but isn't directly exercised — worth a one-line note if you want comprehensive coverage. - Existing
.manifest.jsonsidecars from earlier writes become orphans. Garbage collector should reclaim them; not a regression.
Approving — this closes the convergence concern from #177.
Summary
Bring `staged_insert1` into compliance with the Staged Insert Specification (
datajoint-docs#177). The spec defines a normative metadata-dict shape; today's implementation diverges from it in two ways that are observable in the database.Slated for the DataJoint 2.3 release.
What changes
1. Metadata shape now matches
ObjectCodec.encodeexactly.The same content stored via the two insert paths used to yield different column dicts:
ObjectCodec.encode(builtin_codecs/object.py:166-174){path, store, size, ext, is_dir, item_count, timestamp}{path, size, hash, ext, is_dir, item_count, timestamp}{path, size, hash, ext, is_dir, timestamp, mime_type?}After this PR,
_compute_metadatareturns the encode shape in both cases:hash: None(never carried information)mime_type(not in encode shape)item_count: Nonestore: <resolved store name>2. Per-field store resolution.
Today's
staged_insertresolves the backend fromstores.defaultregardless of the field's type spec. A field declared<object@local>would write throughstores.default, and thestorekey recorded in the row's metadata column would name the wrong store (or be missing entirely). Now we resolvestore_namefromattr.storeviaresolve_dtype()and use that for both path/backend resolution and the metadata'sstorefield — matching what ordinaryinsert1does throughtable.py:1342.3. Drops the
.manifest.jsonsidecar.The staged path was writing a per-object manifest file that the encode path doesn't produce. The metadata dict already records total
sizeanditem_count; per-file listings are recoverable by walking the canonical directory. Removing the sidecar eliminates a small storage-layout divergence between the two paths.4. Docstring fix.
Module docstring and the
staged_insert1function docstring both showedstaged.rec['raw_data'] = z. Per the spec, the caller does not assign anything tostaged.rec[<staged field>]— the framework computes the metadata dict. The implementation already overwrote the user's assignment in_finalize, so this was harmless but misleading. Examples updated.Test added
tests/integration/test_object.py::TestStagedInsert::test_staged_insert_metadata_shape_matches_encode— exercises both the single-file and directory paths, captures the metadata dict assigned to the row, and asserts:ObjectCodec.encode's output exactly for equivalent contentstore,ext,is_dir,item_count)Test plan
tests/integration/test_object.pypassRelated
datajoint-docs#177's "Implementation status" note refers to once merged. After both merge, the spec ceases to be forward-looking for the<object@>path.