Skip to content

Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py#13895

Merged
sayakpaul merged 2 commits into
huggingface:mainfrom
HaozheZhang6:fix-dreambooth-lora-fp16-validation-unscale
Jun 9, 2026
Merged

Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py#13895
sayakpaul merged 2 commits into
huggingface:mainfrom
HaozheZhang6:fix-dreambooth-lora-fp16-validation-unscale

Conversation

@HaozheZhang6

@HaozheZhang6 HaozheZhang6 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #13124

Training train_dreambooth_lora.py with --mixed_precision="fp16" and --validation_prompt crashes on the first optimizer step after a validation run:

ValueError: Attempting to unscale FP16 gradients.

Removing --validation_prompt avoids it, which points at log_validation.

Root cause

Under fp16, cast_training_params(models, dtype=torch.float32) keeps the trainable LoRA params in fp32 (the standard fp16 mitigation, see #6514 / #6554).

The in-loop validation pipeline is built with the same live unet object:

pipeline = DiffusionPipeline.from_pretrained(..., unet=unwrap_model(unet), torch_dtype=weight_dtype)

log_validation then runs pipeline.to(accelerator.device, dtype=torch_dtype) with torch_dtype=weight_dtype (fp16). That .to(..., dtype=fp16) downcasts the shared unet's fp32 LoRA params back to fp16, so the next backward produces fp16 grads and GradScaler.unscale_ raises.

train_dreambooth_lora_sdxl.py does not hit this — its log_validation moves the pipeline with .to(accelerator.device) only (no dtype). This PR makes the SD1.5 script consistent.

Fix

Drop the dtype=torch_dtype from the .to(...) in log_validation (plus an explanatory comment) so the shared unet keeps its fp32 LoRA params. The validation pipeline is already built with torch_dtype=weight_dtype, and inference runs under torch.amp.autocast, so validation behavior is unchanged.

Verification

The crash is GPU-only and hard to exercise in the CPU-based example CI (see "On a regression test" below), so I verified the mechanism directly on CPU: a module with fp16 base weights + fp32 LoRA params, run through autocast forward → backward → GradScaler.unscale_:

pipeline move LoRA param dtype unscale_
.to(device, dtype=fp16) (before) fp16 ValueError: Attempting to unscale FP16 gradients.
.to(device) (after) fp32 OK

ruff check and ruff format --check pass on the changed file.

On a regression test

I looked into adding a subprocess test under examples/dreambooth/test_dreambooth_lora.py, but this bug cannot be reproduced by the CPU-based example CI, for two independent reasons:

  1. The fp16 GradScaler is CUDA-only. With Accelerator(mixed_precision="fp16") on CPU, accelerator.scaler is None, so unscale_ is never called and the error can't fire. This is also why the example suite currently has no --mixed_precision fp16 tests.
  2. Validation inference crashes on the tiny test checkpoint regardless of this fix. Running the script with --validation_prompt on hf-internal-testing/tiny-stable-diffusion-pipe fails earlier inside log_validation with ValueError: Input image size (224*224) doesn't match model (30*30), before the post-validation step is ever reached.

So a green/red CPU test isn't achievable here. I kept the change minimal and consistent with the SDXL script instead. Happy to add a @require_torch_gpu nightly test (or anything else you'd prefer) if that's the convention you'd like for this path.

Note: a prior attempt (#13510) was self-closed unreviewed; it used the alternative approach of re-running cast_training_params after validation. This PR instead removes the source of the downcast, matching the SDXL script.

Before submitting

Who can review?

@sayakpaul

@github-actions github-actions Bot added fixes-issue size/S PR with diff < 50 LOC examples and removed size/S PR with diff < 50 LOC labels Jun 9, 2026
When training with `--mixed_precision="fp16"` and `--validation_prompt`,
the first optimizer step after a validation run fails with
`ValueError: Attempting to unscale FP16 gradients`.

Under fp16, `cast_training_params` keeps the trainable LoRA params in
fp32. The in-loop validation pipeline is built with the same live `unet`
object, and `log_validation` then calls `pipeline.to(device, dtype=torch_dtype)`,
which downcasts those fp32 LoRA params back to fp16. The next backward
therefore produces fp16 grads and `GradScaler.unscale_` raises.

Drop the dtype cast from that `.to(...)` so the shared `unet` keeps its
fp32 LoRA params. This matches train_dreambooth_lora_sdxl.py, which moves
the validation pipeline with `.to(accelerator.device)` only.

Fixes huggingface#13124
@HaozheZhang6 HaozheZhang6 force-pushed the fix-dreambooth-lora-fp16-validation-unscale branch from fd60288 to 4a9c51a Compare June 9, 2026 08:54
@github-actions github-actions Bot added the size/S PR with diff < 50 LOC label Jun 9, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul sayakpaul merged commit e377c0a into huggingface:main Jun 9, 2026
27 of 30 checks passed
HaozheZhang6 added a commit to HaozheZhang6/diffusers that referenced this pull request Jun 10, 2026
…LoRA scripts

Follow-up to huggingface#13895, which fixed this for examples/dreambooth/train_dreambooth_lora.py.
The same fp16 footgun is present in the other DreamBooth LoRA training scripts: under
`--mixed_precision="fp16"`, `cast_training_params(..., dtype=torch.float32)` keeps the
trainable LoRA params in fp32, but `log_validation` rebuilds the in-loop validation
pipeline around the *live* training transformer (`transformer=unwrap_model(transformer)`)
and then casts it to fp16. That downcasts the fp32 LoRA params, so the next optimizer
step raises `ValueError: Attempting to unscale FP16 gradients`.

Apply the same fix across the remaining scripts:
- flux, flux_kontext, qwen_image, hidream, and advanced flux: drop `dtype=torch_dtype`
  from `pipeline.to(accelerator.device, ...)` (keep the device move), matching huggingface#13895.
- z_image and the flux2 variants: the cast is `pipeline.to(dtype=torch_dtype)` with no
  device move, immediately followed by `enable_model_cpu_offload()`, so just drop the
  cast line. Frozen weights already use `weight_dtype` and the offload call handles
  device placement.

The final (post-training) validation in every script builds a fresh pipeline from the
saved weights, so it is unaffected either way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

train_dreambooth_lora.py -- ValueError: Attempting to unscale FP16 gradients caused by "--validation_prompt" param.

3 participants