Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939
Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939vlad-perevezentsev wants to merge 8 commits into
Conversation
|
View rendered docs @ https://intelpython.github.io/dpnp/pull/2939/index.html |
|
Array API standard conformance tests for dpnp=0.21.0dev1=py313h509198e_17 ran successfully. |
| _shape = a.shape | ||
| # Max element displacement reachable by the strides. | ||
| # Negative strides are handled by _copy_array, so only | ||
| # positive strides are possible here |
There was a problem hiding this comment.
There is another regression also, which can be handled separately:
import dpnp as np
a = np.arange(4, dtype='c8')
b = np.broadcast_to(a, (3, 4))
np.fft.fft(b, axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 np.fft.fft(b, axis=0)
File ~/code/dpnp/dpnp/fft/dpnp_iface_fft.py:122, in fft(a, n, axis, norm, out)
47 """
48 Compute the one-dimensional discrete Fourier Transform.
49
(...) 118
119 """
121 dpnp.check_supported_arrays_type(a)
--> 122 return dpnp_fft(
123 a, forward=True, real=False, n=n, axis=axis, norm=norm, out=out
124 )
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:664, in dpnp_fft(a, forward, real, n, axis, norm, out)
658 if c2r:
659 # input array should be Hermitian for c2r FFT
660 a = _make_array_hermitian(
661 a, axis, dpnp.are_same_logical_tensors(a, a_orig)
662 )
--> 664 return _fft(
665 a,
666 norm=norm,
667 out=out,
668 forward=forward,
669 # TODO: currently in-place is only implemented for c2c, see SAT-7154
670 in_place=in_place and c2c,
671 c2c=c2c,
672 axes=axis,
673 batch_fft=a_ndim != 1,
674 )
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:447, in _fft(a, norm, out, forward, in_place, c2c, axes, batch_fft)
443 a_strides = _standardize_strides_to_nonzero(strides, a.shape)
444 dsc, out_strides = _commit_descriptor(
445 a, forward, in_place, c2c, a_strides, index, batch_fft
446 )
--> 447 res = _compute_result(dsc, a, out, forward, c2c, out_strides)
448 res = _scale_result(res, a.shape, norm, forward, index)
450 # Revert swapped axes
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:239, in _compute_result(dsc, a, out, forward, c2c, out_strides)
231 result = dpnp_array(
232 out_shape,
233 dtype=out_dtype,
(...) 236 sycl_queue=exec_q,
237 )
238 res_usm = result.get_array()
--> 239 ht_fft_event, fft_event = fi._fft_out_of_place(
240 dsc, a_usm, res_usm, forward, depends=dep_evs
241 )
242 _manager.add_event_pair(ht_fft_event, fft_event)
244 if not isinstance(result, dpnp_array):
ValueError: Memory addressed by the output array is not sufficiently ample.There was a problem hiding this comment.
That is a great catch!
I will address it in the following PR
Thank you!
antonwolfy
left a comment
There was a problem hiding this comment.
Thank you @vlad-perevezentsev, LGTM with small nits
| # for the output array and unnecessary copy to contiguous | ||
| # after oneMKL FFT | ||
| elem_strides = dpnp.get_usm_ndarray(a).strides | ||
| _shape = a.shape |
There was a problem hiding this comment.
I'd rename and move away above the if branch, since the same is used at lines 439, 444 and so on below
| _shape = a.shape | |
| a_shape = a.shape |
| # If so, copy to contiguous to avoid oversized allocation | ||
| # for the output array and unnecessary copy to contiguous | ||
| # after oneMKL FFT | ||
| elem_strides = dpnp.get_usm_ndarray(a).strides |
There was a problem hiding this comment.
We have exactly the same call at line 443, it'd be better to move above the if-branch to call once.
This PR fixes an issue discovered while implementing #2927 where several FFT tests started failing after adding validation for stride configurations with oversized memory footprints.
The problem was that FFT could allocate output arrays using the same strided layout as a non-contiguous input. For inputs such as
a[::2, this resulted in oversized allocations followed by an additional copy to a contiguous array.This fix checks whether the memory footprint implied by the input strides exceeds the number of elements in the array. If so, the input is copied to a contiguous layout before configuring the FFT descriptor allowing oneMKL FFT to produce a contiguous output directly
By avoiding oversized allocations and the extra copy this significantly improves the performance of all FFT operations in dpnp with complex strided inputs
Additionally, the cuFFT C-contiguous requirement is now enforced for all code paths. Previously it was only applied within
batch_fftbranch which was a latent bug for 1D non-batch inputs on CUDA.