Fix Windows unittest CI: force CPU-only build (CUDA 13.2 toolkit on runner breaks _portable_lib load)#20527
Fix Windows unittest CI: force CPU-only build (CUDA 13.2 toolkit on runner breaks _portable_lib load)#20527Gasoonjia wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20527
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Pending, 2 Unclassified FailuresAs of commit d2b3fac with merge base 6021a58 ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
This PR needs a
|
…failure The Windows CI image ships CUDA toolkits on PATH. After adding (13, 2) to SUPPORTED_CUDA_VERSIONS (#20440), install_executorch's auto-detection (setup.py: is_cuda_available() via nvcc) started returning True on the Windows runner (which has the CUDA 13.2 toolkit), so it flipped EXECUTORCH_BUILD_CUDA=ON. But the unittest jobs install CPU torch, so the resulting CUDA build of _portable_lib fails to load its CUDA DLLs at import time: ImportError: DLL load failed while importing _portable_lib causing all pytest collection to error out (unittest / unittest-editable / unittest-release on windows). Add a -cpuOnly switch to setup-windows.ps1 that forces -DEXECUTORCH_BUILD_CUDA=OFF via CMAKE_ARGS, and pass it from the CPU unittest workflow. The CUDA Windows jobs (cuda-windows.yml) keep the default and are unaffected.
747da69 to
952e121
Compare
…LL load failure Same root cause as the unittest fix in this PR, second site. The Windows wheel build (build-wheels-windows.yml -> .ci/scripts/wheel/) does not go through setup-windows.ps1. The Windows CI image has the CUDA 13.2 toolkit on PATH, so after #20440 added (13, 2) to SUPPORTED_CUDA_VERSIONS, install_executorch's auto-detection enables EXECUTORCH_BUILD_CUDA and bakes a CUDA _portable_lib + aoti_cuda_shims.lib into the CPU wheel. The smoke test then fails with: ImportError: DLL load failed while importing _portable_lib Windows wheels are CPU-only (with-cuda: disabled), so force -DEXECUTORCH_BUILD_CUDA=OFF via CMAKE_ARGS in pre_build_script.sh on Windows.
Summary
Fixes the Windows unittest CI breakage introduced by #20440 (
Add CUDA 13.2 support and drop unsupported 12.8/12.9).unittest / windows,unittest-editable / windows, andunittest-release / windowshave been red onmainsince c0643f5 (parent was green).Root cause
The Windows CI image ships CUDA toolkits on
PATH(it has bothv13.2andv13.0;nvccresolves to 13.2.78).install_executorchauto-enables the CUDA backend wheninstall_utils.is_cuda_available()returns True (setup.py~L882-889), and that check is driven purely by thenvccversion being inSUPPORTED_CUDA_VERSIONS.13.2 ∉ SUPPORTED_CUDA_VERSIONS→is_cuda_available()= False → CPU-only build → green.(13, 2)makesis_cuda_available()= True on the Windows runner →setup.pyflips-DEXECUTORCH_BUILD_CUDA=ON. But the unittest jobs install CPU torch, so the CUDA build of_portable_libcan't find its CUDA DLLs:That aborts pytest collection (
24 errors during collection) and fails the job.Fix
Add a
-cpuOnlyswitch to the shared.ci/scripts/setup-windows.ps1that forces-DEXECUTORCH_BUILD_CUDA=OFFviaCMAKE_ARGS, and pass it from the CPU unittest workflow (_unittest.yml). This restores the pre-#20440 CPU-only behavior for these jobs.The CUDA Windows jobs (
cuda-windows.yml) call the same script without-cpuOnly, so they are unaffected and keep building CUDA.Note / follow-up
The deeper issue is that the auto-detection keys off
nvccpresence rather than whether the installed torch is actually a CUDA build. A more general fix would be to only enableEXECUTORCH_BUILD_CUDAwhentorch.version.cudais set. Left out here to keep the unblock low-risk; happy to follow up.