Skip to content

nnUNetV2Runner cannot be run with NVIDIA MIG configuration #7497

Description

@che85

Describe the bug

python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

When providing the UUID of the MIG device as gpu_id, I am getting the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 113, in run_ddp
    torch.cuda.set_device(torch.device('cuda', dist.get_rank()))
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Similarly, setting CUDA_VISIBLE_DEVICES (CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model) is overwritten by nnUNetV2Runner and not working.

Running nnUNet natively works fine with:

CUDA_VISIBLE_DEVICES={MIG_UUID} nnUNetv2_train ... 2d 4

To Reproduce
Steps to reproduce the behavior:

  1. Use computer with MIG device
  2. run
python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

OR

CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 

Expected behavior

CUDA_VISIBLE_DEVICES should not be overwritten if it was provided.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions