Skip to content

Commit 0a9ca68

Browse files
timsaucerclaude
andauthored
docs: user guide + runnable examples for distributing expressions (#1547)
* docs: user guide page + runnable examples for distributing expressions Wraps up the Expr-pickle work with the user-facing material: * docs/source/user-guide/io/distributing_work.rst — new user guide page covering the multiprocessing, Ray, and datafusion-distributed patterns. Includes the Security section that is the canonical home for the cloudpickle / pickle.loads threat model. * docs/source/user-guide/io/index.rst — toctree entry. * examples/multiprocessing_pickle_expr.py — runnable example: a Pool.map of a closure-capturing UDF across processes, with worker context registration in the initializer. * examples/ray_pickle_expr.py — Ray actor analogue. * examples/datafusion-ffi-example/python/tests/_test_pickle_strict_ffi.py — exercises the strict-mode refusal end to end against an FFI capsule scalar UDF (kept under the FFI example crate because the test needs that crate's compiled artifacts). * examples/README.md — index entries for the new files. Also tightens three docstrings that previously duplicated the security warning so they point at the canonical Security section instead: * PythonLogicalCodec::with_python_udf_inlining (rustdoc): one-line summary plus a relative pointer to distributing_work.rst and the upstream Python pickle module security warning. * SessionContext.with_python_udf_inlining: one-sentence summary plus :doc: link to the user guide. * datafusion.ipc module docstring: cross-reference to the user guide for the full pattern. The crate-level codec.rs module rustdoc also updates "pure-Python scalar UDFs" to "scalar / aggregate / window UDFs" now that all three are covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: document Python-version and import portability caveats for inline UDFs Reviewer feedback on the Expr-pickle PRs (#1544) asked that the cloudpickle portability caveats be discoverable on the user-facing page, not only in docstrings. The distributing_work.rst page is the designated canonical home for the distribution story, so add them here: * New 'Portability requirements for inline Python UDFs' subsection covering the matching-Python-minor-version requirement and the by-value vs by-reference import-capture rule (imported modules must be importable on the worker). * Qualify the 'fully portable' Python-UDF bullet to point at the new requirements. * Cross-reference the new subsection from the closure-capture note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: restore version-byte and cloudpickle-cache rustdoc wording Two codec.rs docstrings were reworded in PR4 in ways that dropped information: * try_encode_python_scalar_udf: restore the `DFPYUDF` family prefix + version byte description of the payload framing (PR4 had collapsed it to `DFPYUDF1` prefix, dropping the version-byte mention). * cloudpickle cached-handle comment: restore "The encode/decode helpers above" wording. * docs: fix reversed tuple order in multiprocessing example docstring The 'Worker layout' docstring described tasks as `(expr, label)` but the code builds and unpacks them as `(label, expr)`. Correct the doc to match. * Respond to first batch of reviewer comments * docs: relocate and restructure distributing-work guide Move the page from user-guide/io/ to the top level of user-guide/ — distributing work is a runtime/operational concern, not a file-format topic, and the shorter "Distributing work" title fits the sidebar cleanly. Restructure the body to lead with the practical worker-setup pattern instead of the four-slot SessionContext taxonomy. The taxonomy survives at the bottom as a reference subsection; the worker-init example and portability rules now reach the reader before they need it. Also addresses reviewer NIT: wrap the `if __name__ == "__main__":` guidance in a `.. note::` admonition and link to the Python multiprocessing docs. Add a header paragraph to each runnable example pointing to the user-guide page so a reader who jumps straight to the example gets the surrounding context. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent baec559 commit 0a9ca68

7 files changed

Lines changed: 771 additions & 7 deletions

File tree

crates/core/src/codec.rs

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,11 @@
1919
//!
2020
//! Datafusion-python plans can carry references to Python-defined
2121
//! objects that the upstream protobuf codecs do not know how to
22-
//! serialize: pure-Python scalar UDFs, Python query-planning
23-
//! extensions, and so on. Their state lives inside `Py<PyAny>`
24-
//! callables and closures rather than being recoverable from a name
25-
//! in the receiver's function registry. To ship a plan across a
26-
//! process boundary (pickle, `multiprocessing`, Ray actor,
22+
//! serialize: pure-Python scalar / aggregate / window UDFs, Python
23+
//! query-planning extensions, and so on. Their state lives inside
24+
//! `Py<PyAny>` callables and closures rather than being recoverable
25+
//! from a name in the receiver's function registry. To ship a plan
26+
//! across a process boundary (pickle, `multiprocessing`, Ray actor,
2727
//! `datafusion-distributed`, etc.) those payloads have to be encoded
2828
//! into the proto wire format itself.
2929
//!
@@ -256,7 +256,12 @@ impl PythonLogicalCodec {
256256
/// `cloudpickle.loads` on the inline `DFPY*` payload. It does
257257
/// **not** make `pickle.loads(untrusted_bytes)` safe; treat every
258258
/// `pickle.loads` on untrusted input as unsafe regardless of this
259-
/// setting.
259+
/// setting. See `docs/source/user-guide/io/distributing_work.rst`
260+
/// (Security section) for the full threat model, and Python's
261+
/// [pickle module security warning][1] for why `pickle.loads` is
262+
/// unsafe in general.
263+
///
264+
/// [1]: https://docs.python.org/3/library/pickle.html#module-pickle
260265
pub fn with_python_udf_inlining(mut self, enabled: bool) -> Self {
261266
self.python_udf_inlining = enabled;
262267
self
@@ -433,7 +438,7 @@ fn refuse_inline_payload(kind: &str, name: &str) -> datafusion::error::DataFusio
433438
/// encoding on this layer too — otherwise a plan with a Python UDF
434439
/// would round-trip at the logical level but break at the physical
435440
/// level. Both layers reuse the shared payload framing
436-
/// ([`PY_SCALAR_UDF_FAMILY`]) so the wire format is identical.
441+
/// ([`PY_SCALAR_UDF_FAMILY`] et al.) so the wire format is identical.
437442
#[derive(Debug)]
438443
pub struct PythonPhysicalCodec {
439444
inner: Arc<dyn PhysicalExtensionCodec>,

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ Example
7676
user-guide/common-operations/index
7777
user-guide/io/index
7878
user-guide/configuration
79+
user-guide/distributing-work
7980
user-guide/sql
8081
user-guide/upgrade-guides
8182
user-guide/ai-coding-assistants

0 commit comments

Comments
 (0)