Cache ModelMixin.dtype to avoid named_parameters walk per access#13571
Cache ModelMixin.dtype to avoid named_parameters walk per access#13571akshan-main wants to merge 5 commits into
Conversation
|
Profiled SD3 too (eager + compile, RTX PRO 6000 Blackwell, 2 steps) following the profiling guide. Denoising loop is clean. 0 syncs in Pre-loop has 2x ~10ms Tested adding
The sync was queue-drain. GPU has to do that work anyway, CPU just doesn't wait for it. Unlike Z-Image #13461, no per-step |
DN6
left a comment
There was a problem hiding this comment.
Proposal looks good to me. But we also need to account for enable_layerwise_casting.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@akshan-main Could you take a look at the CI failures please. |
|
@DN6 test_to_dtype broke because StableUnCLIP's custom .to() bypasses _apply; I added register_parameter invalidation. That's a third hook though, if you'd rather not chase every mutation path, I'm happy to instead fix the normalizer's .to() or drop the cache. wdyt. |
What does this PR do?
Addresses #13401
ModelMixin.dtypecallsget_parameter_dtype(), which walksnamed_parameters()on every access. Pipelines readself.transformer.dtype/text_encoder.dtype/vae.dtypeinside denoise loops, so the walk fires every step.This PR caches
dtypeon first access and invalidates in_apply, which.to(),.cpu(),.cuda(),.half(), and.bfloat16()all route through. Generation outputs are bit-identical. A microbench onAutoencoderKLdrops cached.dtypeaccess from ~88us to ~0.1us (~1000x).deviceis intentionally not cached: with group offloading the effective device changes per-forward, so a cache there would be wrong.Same shape as the
cache_context._set_contextcache in #13356Profiling: 10 pipelines (eager, 2 inference steps, A100)
The fix removes the walk where it appears. The largest single saving is
hunyuanv15at 82.59 ms over 2 inference steps, scaling linearly to ~2.1 s at a typical 50 steps. Pipelines without the walk (chroma,ltx2) are unaffected.Reproduction notebook (Colab)
Before submitting
Who can review?
@sayakpaul @dg845