Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py by HaozheZhang6 · Pull Request #13895 · huggingface/diffusers

HaozheZhang6 · 2026-06-09T08:45:25Z

What does this PR do?

Training train_dreambooth_lora.py with --mixed_precision="fp16" and --validation_prompt crashes on the first optimizer step after a validation run:

ValueError: Attempting to unscale FP16 gradients.

Removing --validation_prompt avoids it, which points at log_validation.

Root cause

Under fp16, cast_training_params(models, dtype=torch.float32) keeps the trainable LoRA params in fp32 (the standard fp16 mitigation, see #6514 / #6554).

The in-loop validation pipeline is built with the same live unet object:

pipeline = DiffusionPipeline.from_pretrained(..., unet=unwrap_model(unet), torch_dtype=weight_dtype)

log_validation then runs pipeline.to(accelerator.device, dtype=torch_dtype) with torch_dtype=weight_dtype (fp16). That .to(..., dtype=fp16) downcasts the shared unet's fp32 LoRA params back to fp16, so the next backward produces fp16 grads and GradScaler.unscale_ raises.

train_dreambooth_lora_sdxl.py does not hit this — its log_validation moves the pipeline with .to(accelerator.device) only (no dtype). This PR makes the SD1.5 script consistent.

Fix

Drop the dtype=torch_dtype from the .to(...) in log_validation (plus an explanatory comment) so the shared unet keeps its fp32 LoRA params. The validation pipeline is already built with torch_dtype=weight_dtype, and inference runs under torch.amp.autocast, so validation behavior is unchanged.

Verification

The crash is GPU-only and hard to exercise in the CPU-based example CI (see "On a regression test" below), so I verified the mechanism directly on CPU: a module with fp16 base weights + fp32 LoRA params, run through autocast forward → backward → GradScaler.unscale_:

pipeline move	LoRA param dtype	`unscale_`
`.to(device, dtype=fp16)` (before)	fp16	`ValueError: Attempting to unscale FP16 gradients.`
`.to(device)` (after)	fp32	OK

ruff check and ruff format --check pass on the changed file.

On a regression test

I looked into adding a subprocess test under examples/dreambooth/test_dreambooth_lora.py, but this bug cannot be reproduced by the CPU-based example CI, for two independent reasons:

The fp16 GradScaler is CUDA-only. With Accelerator(mixed_precision="fp16") on CPU, accelerator.scaler is None, so unscale_ is never called and the error can't fire. This is also why the example suite currently has no --mixed_precision fp16 tests.
Validation inference crashes on the tiny test checkpoint regardless of this fix. Running the script with --validation_prompt on hf-internal-testing/tiny-stable-diffusion-pipe fails earlier inside log_validation with ValueError: Input image size (224*224) doesn't match model (30*30), before the post-validation step is ever reached.

So a green/red CPU test isn't achievable here. I kept the change minimal and consistent with the SDXL script instead. Happy to add a @require_torch_gpu nightly test (or anything else you'd prefer) if that's the convention you'd like for this path.

Note: a prior attempt (#13510) was self-closed unreviewed; it used the alternative approach of re-running cast_training_params after validation. This PR instead removes the source of the downcast, matching the SDXL script.

Before submitting

Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue? Fixes train_dreambooth_lora.py -- ValueError: Attempting to unscale FP16 gradients caused by "--validation_prompt" param. #13124.
Did you write any new necessary tests? See "On a regression test" above — a CPU CI test can't reproduce this GPU-only crash; verified manually.

Who can review?

@sayakpaul

When training with `--mixed_precision="fp16"` and `--validation_prompt`, the first optimizer step after a validation run fails with `ValueError: Attempting to unscale FP16 gradients`. Under fp16, `cast_training_params` keeps the trainable LoRA params in fp32. The in-loop validation pipeline is built with the same live `unet` object, and `log_validation` then calls `pipeline.to(device, dtype=torch_dtype)`, which downcasts those fp32 LoRA params back to fp16. The next backward therefore produces fp16 grads and `GradScaler.unscale_` raises. Drop the dtype cast from that `.to(...)` so the shared `unet` keeps its fp32 LoRA params. This matches train_dreambooth_lora_sdxl.py, which moves the validation pipeline with `.to(accelerator.device)` only. Fixes huggingface#13124

HuggingFaceDocBuilderDev · 2026-06-09T09:57:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…LoRA scripts Follow-up to huggingface#13895, which fixed this for examples/dreambooth/train_dreambooth_lora.py. The same fp16 footgun is present in the other DreamBooth LoRA training scripts: under `--mixed_precision="fp16"`, `cast_training_params(..., dtype=torch.float32)` keeps the trainable LoRA params in fp32, but `log_validation` rebuilds the in-loop validation pipeline around the *live* training transformer (`transformer=unwrap_model(transformer)`) and then casts it to fp16. That downcasts the fp32 LoRA params, so the next optimizer step raises `ValueError: Attempting to unscale FP16 gradients`. Apply the same fix across the remaining scripts: - flux, flux_kontext, qwen_image, hidream, and advanced flux: drop `dtype=torch_dtype` from `pipeline.to(accelerator.device, ...)` (keep the device move), matching huggingface#13895. - z_image and the flux2 variants: the cast is `pipeline.to(dtype=torch_dtype)` with no device move, immediately followed by `enable_model_cpu_offload()`, so just drop the cast line. Frozen weights already use `weight_dtype` and the offload call handles device placement. The final (post-training) validation in every script builds a fresh pipeline from the saved weights, so it is unaffected either way.

github-actions Bot added fixes-issue size/S PR with diff < 50 LOC examples and removed size/S PR with diff < 50 LOC labels Jun 9, 2026

HaozheZhang6 force-pushed the fix-dreambooth-lora-fp16-validation-unscale branch from fd60288 to 4a9c51a Compare June 9, 2026 08:54

github-actions Bot added the size/S PR with diff < 50 LOC label Jun 9, 2026

sayakpaul approved these changes Jun 9, 2026

View reviewed changes

Merge branch 'main' into fix-dreambooth-lora-fp16-validation-unscale

1de289a

sayakpaul merged commit e377c0a into huggingface:main Jun 9, 2026
27 of 30 checks passed

HaozheZhang6 mentioned this pull request Jun 9, 2026

Fix fp16 LoRA unscale crash after validation in remaining DreamBooth LoRA scripts #13899

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py#13895

Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py#13895
sayakpaul merged 2 commits into
huggingface:mainfrom
HaozheZhang6:fix-dreambooth-lora-fp16-validation-unscale

HaozheZhang6 commented Jun 9, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

HaozheZhang6 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Root cause

Fix

Verification

On a regression test

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HaozheZhang6 commented Jun 9, 2026 •

edited

Loading