GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679
GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679Alex-PLACET wants to merge 29 commits into
Conversation
|
|
- Added a new benchmark file `vector_search_sorted_benchmark.cc` to evaluate the performance of the SearchSorted function for various data types including Int64, String, and Binary. - Created a comprehensive test suite in `vector_search_sorted_test.cc` to validate the correctness of SearchSorted across different scenarios, including handling of null values, scalar needles, and run-end encoded arrays. - Ensured that the benchmarks cover both left and right search options, as well as edge cases like empty arrays and arrays with leading/trailing nulls.
…rks for needles with null runs
…tation overview and flow
…lize ranges for leading/trailing null counts
…ensive tests for supported types
8e09ea3 to
4ec630e
Compare
…xcept to length method
pitrou
left a comment
There was a problem hiding this comment.
I haven't looked at the implementation yet, but have reviewed the tests and benchmarks (which are quite comprehensive, thank you!).
One missing item is support for chunked arrays. Besides that, see comments below :)
| SearchSortedOptions(SearchSortedOptions::Left))); | ||
| ASSERT_OK_AND_ASSIGN(auto right, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Right))); |
There was a problem hiding this comment.
Let's call ValidateFull on both results here too?
(also I'm curious, why not reuse CheckSimpleSearchSorted?)
| std::string scalar_needle_json; | ||
| uint64_t expected_scalar_left; | ||
| uint64_t expected_scalar_right; |
There was a problem hiding this comment.
Note that you could also generate the scalar needle tests automatically by calling GetScalar on the array needles and the expected results. This would make this easier to maintain later.
…and add tests for chunked values and needles
|
@Alex-PLACET Is this ready for review again? |
|
@Alex-PLACET It looks like the benchmarks don't compile anymore? |
pitrou
left a comment
There was a problem hiding this comment.
Now I also took a look at the implementation and some aspects of it seem overly-complicated (but I may miss something). See comments below.
| "[1, 3, 3, 5]", | ||
| "[0, 3, 4, 6]", |
There was a problem hiding this comment.
Can we make the values and needles arrays different lengths?
| auto values = ArrayFromJSON(int16(), "[]"); | ||
| auto needles = ArrayFromJSON(int16(), "[1, 2, 3]"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto result, SearchSorted(Datum(values), Datum(needles))); |
There was a problem hiding this comment.
Why not reuse CheckSimpleSearchSorted? It would validate the results and also stress scalar needles.
There was a problem hiding this comment.
You are right, now it's the case
| auto values = ArrayFromJSON(int32(), "[null, 200, 300, 300]"); | ||
| auto needles = ArrayFromJSON(int32(), "[50, 200, 250, 400]"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto left, |
There was a problem hiding this comment.
Again, let's re-use the testing helper CheckSimpleSearchSorted everywhere possible, otherwise we're forgetting some checks.
| {"Float32", float32(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]", | ||
| "[0, 1, 3, 4]", "[0, 3, 3, 4]"}, | ||
| {"Float64", float64(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]", | ||
| "[0, 1, 3, 4]", "[0, 3, 3, 4]"}, |
There was a problem hiding this comment.
Do we want to test NaNs here? They will probably need special treatment, whether in the values or the needles.
For the record, sorting an array with Arrow always put NaNs between nulls and non-null values.
e.g. either [1.0, 3.0, 3.0, 5.0, NaN, NaN, null] or [null, NaN, NaN, 1.0, 3.0, 3.0, 5.0].
There was a problem hiding this comment.
I completely missed this case, I added some tests for that:
FloatValuesWithTrailingNaNsAndNulls and FloatValuesWithTrailingNaNsAndNulls
| ASSERT_OK_AND_ASSIGN(auto left, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Left))); | ||
| ASSERT_OK_AND_ASSIGN(auto right, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Right))); |
There was a problem hiding this comment.
Either you write or reuse a helper here, or you need to call ValidateFull directly.
There was a problem hiding this comment.
And ideally we should also check scalar needles too.
| auto values_type = run_end_encoded(int32(), int32()); | ||
| ASSERT_OK_AND_ASSIGN(auto ree_values, | ||
| REEFromJSON(values_type, "[0, 0, 1, 1, 1, 4, 4, 9]")); | ||
| auto sliced = ree_values->Slice(2, 5); |
There was a problem hiding this comment.
Can we use a slice that does not happily falls on run boundaries? (for example Slice(1, 5))
| AssertArraysEqual(*ArrayFromJSON(uint64(), "[0, 0, 3, 3, 5]"), *result.make_array()); | ||
| } | ||
|
|
||
| TEST(SearchSorted, BinaryValues) { |
There was a problem hiding this comment.
This is already part of the smoke tests, right?
There was a problem hiding this comment.
It was a left over your are right, it's fixed now
| ValueType Value(int64_t index) const { | ||
| const auto physical_index = span_.PhysicalIndex(index); | ||
| return GetViewType<ArrowType>::LogicalValue(values_.GetView(physical_index)); | ||
| } |
There was a problem hiding this comment.
Hmm, so we're processing each run-end logical value individually, and then finding it in the array of physical values? This is wasteful, as it's O(n_needles * log n_values), while each physical value will produce the same output after all.
What run-end-encoded values should do instead is:
- run search_sorted on the run values, giving "physical results"
- turn the "physical results" into "logical results" by looking up the run start for each physical result
In other words, given:
values = run_end_encoded([10,10,10,30,30])(i.e. run values[10, 30]and run ends[3, 5])needles = [10, 20, 30, 40]
we would first compute the physical result [0, 1, 1, 2] of searching the needles in [10, 30], then turn it into the logical results [0, 3, 3, 5].
This would be O(n_needles * log n_value_runs) and therefore more efficient. It should also generate less template specializations, because you don't need to template this algorithm per (physical value type, run end type) pair.
| const auto& needle_data = needles.array(); | ||
| if (needle_data->type->id() == Type::RUN_END_ENCODED) { | ||
| RunEndEncodedArray ree(needle_data); | ||
| return DispatchRunEndEncodedByRunEndType<Status>( |
There was a problem hiding this comment.
Similarly to what I pointed above about run-end-encoded values, you don't need complex indexing for run-end-encoded needles. It's actually even simpler as you can just search each physical needle and run-end-decode the (needle run ends, physical results).
| const int64_t leading_null_count = CountLeadingNulls( | ||
| values.length, [&](int64_t index) { return values.IsNull(index); }); | ||
| const int64_t trailing_null_count = | ||
| leading_null_count > 0 ? 0 : CountTrailingNulls(values.length, [&](int64_t index) { | ||
| return values.IsNull(index); | ||
| }); | ||
|
|
There was a problem hiding this comment.
This is... complicated. We know there are null_count nulls, and they are either clustered at the start or at the end. We just have to look up the first validity bit (values.IsNull(0)) and then we know how many leading/trailing nulls there are.
And we can do the exact same thing for chunked arrays.
There was a problem hiding this comment.
You are right, I simplified and refactored it
Co-authored-by: Copilot <copilot@github.com>
…ckSearchSorted function Co-authored-by: Copilot <copilot@github.com>
…ns for clarity and reusability Co-authored-by: Copilot <copilot@github.com>
…tency, remove redundant tests Co-authored-by: Copilot <copilot@github.com>
…ases. Remove RunEndCType template from RunEndEncodedValuesAccessor
…Values with ChunkedNeedles
Co-authored-by: Copilot <copilot@github.com>
…sts for edge cases Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
pitrou
left a comment
There was a problem hiding this comment.
Thanks for the update! This is getting better, though I think there are still opportunities for making the implementation simpler.
| ArrayVector{empty_chunk, null_chunk, empty_chunk, last_null_chunk}); | ||
| auto needles = ArrayFromJSON(int32(), "[1, 4, null]"); | ||
|
|
||
| CheckSearchSorted(Datum(values), Datum(needles), "[3, 3, null]", "[3, 3, null]"); |
There was a problem hiding this comment.
This one raises an interesting question: what is the intended null placement of the sorted values? At end or at start? If nulls are at end, we should return 0 (before the nulls), if nulls are at start, we should return 3 (after the nulls).
The default null placement for sorting in Arrow C++ is NullPlacement::AtEnd so I think this should return 0, not 3.
(we can make it configurable later if there's some demand)
| const auto values_length = values.length(); | ||
| const auto needles_length = needles.length(); | ||
| state.counters["values_length"] = static_cast<double>(values_length); | ||
| state.counters["needles_length"] = static_cast<double>(needles_length); |
There was a problem hiding this comment.
Let's perhaps also add the null percentage of the values:
| state.counters["needles_length"] = static_cast<double>(needles_length); | |
| state.counters["values_null_percentage"] = values.null_count() * 100.0 / values_length; | |
| state.counters["needles_length"] = static_cast<double>(needles_length); |
| VISIT(UInt16Type) \ | ||
| VISIT(UInt32Type) \ | ||
| VISIT(UInt64Type) \ | ||
| VISIT(FloatType) \ |
There was a problem hiding this comment.
We can probably add HalfFloatType?
There was a problem hiding this comment.
Thanks, but can you also add a test for it?
| RunEndEncodedValuesAccessor<ArrowType> non_null_values(values_accessor.array(), | ||
| non_null_values_range.offset, | ||
| non_null_values_range.length); |
There was a problem hiding this comment.
It seems that you are assuming that nulls in a REE array can be anywhere? But actually the nulls are stored in the REE values (a REE array does not have a top-level null bitmap).
So you don't need all the complication that seems to be related to computing physical offsets depending on the logical null range of the REE array. Just use the physical null range of the REE values.
You should therefore be able to simplify a lot of stuff away, and perhaps even all the REE Accessor classes. You might have to keep the Accessor / ChunkedAccessor distinction, however.
| /// This preserves the cheapest materialization strategy: write repeated | ||
| /// insertion indices directly into the preallocated result buffer, while still | ||
| /// sharing the same EmitInsertionIndices traversal used by the nullable path. | ||
| class PreallocatedInsertionIndexOutput { |
There was a problem hiding this comment.
If you like to have this optimized insertion strategy, you might want to use a TypedBufferBuilder<uint64_t> for both insertion strategies and a TypedBufferBuilder<bool> for the null bitmap (perhaps with a bool template parameter to say whether nulls should be emitted or not).
Something like (rough sketch):
template <bool kEmitNulls>
class InsertionIndexBuilder {
public:
explicit InsertionIndexBuilder(MemoryPool* pool)
: indices_builder_(pool), bitmap_builder_pool_(pool) {}
Status Init(int64_t length) {
RETURN_NOT_OK(indices_builder_.Reserve(length));
if constexpr (kEmitNulls) {
RETURN_NOT_OK(bitmap_builder_.Reserve(length));
}
return Status::OK();
}
void AppendIndices(uint64_t value, int64_t run_length) {
indices_builder_.UnsafeAppend(run_length, value);
if constexpr (kEmitNulls) {
bitmap_builder_.UnsafeAppend(run_length, /*value=*/true);
}
}
void AppendNulls(int64_t run_length) {
if constexpr (kEmitNulls) {
bitmap_builder_.UnsafeAppend(run_length, /*value=*/false);
}
}
private:
TypedBufferBuilder<uint64_t> indices_builder_;
TypedBufferBuilder<uint64_t> bitmap_builder_;
};| } | ||
|
|
||
| TEST(SearchSorted, FloatValuesWithLeadingNullsAndTrailingNaNs) { | ||
| CheckSimpleSearchSortedAndScalar(float64(), "[null, 1.0, 3.0, 3.0, 5.0, NaN, NaN]", |
There was a problem hiding this comment.
This is not a valid sorted array (nulls and NaNs should be at the same end, so we can remove this test).
| "[0, 0, 1, 3]", "[0, 1, 1, 3]"); | ||
| } | ||
|
|
||
| TEST(SearchSorted, FloatValuesWithTrailingNaNsAndNulls) { |
There was a problem hiding this comment.
Can we also have a leading NaNs and nulls test?
| "[0.0, 3.0, 4.0, NaN]", "[0, 1, 3, 4]", | ||
| "[0, 3, 3, 6]"); |
There was a problem hiding this comment.
Can you also add a null in the needles?
…or improved null bitmap management and streamline test cases for consistency in expected results. Co-authored-by: Copilot <copilot@github.com>
…d encoded values, ensuring null runs outside the slice are ignored. Co-authored-by: Copilot <copilot@github.com>
|
I took a look at the compiled code size here and we're really generating too much code: 48M for the search_sorted kernels alone is almost 20% of the total size of compute kernels (256M for libarrow_compute): This is also confirmed by code sizes (the |
|
To alleviate this combinatorial explosion, it seems several axes of code generation can be decoupled:
Also, one important improvement is to use |
pitrou
left a comment
There was a problem hiding this comment.
Some more comments below. Most important though is the combinatorial explosion I mentioned in the comments.
| VISIT(UInt16Type) \ | ||
| VISIT(UInt32Type) \ | ||
| VISIT(UInt64Type) \ | ||
| VISIT(FloatType) \ |
There was a problem hiding this comment.
Thanks, but can you also add a test for it?
|
|
||
| private: | ||
| int64_t LogicalRunEnd(int64_t physical_index) const { | ||
| const int64_t logical_run_end = std::max<int64_t>( |
There was a problem hiding this comment.
The std::max is needed because the logical offset can fall in the middle of a physical run, right? Can you add a comment about that?
| GetRunEndValue(::arrow::ree_util::RunEndsArray(array_span_), physical_index) - | ||
| array_.offset(), | ||
| 0); | ||
| return std::min(logical_run_end, array_.length()); |
There was a problem hiding this comment.
And for the same reason we need std::min here, right?
| SearchSortedOptions::Side side, uint64_t insertion_offset, | ||
| Output* output) { | ||
| auto emit_search_result = [&](const VisitedNeedle<ArrowType, EmitNulls>& needle, | ||
| int64_t run_length) -> Status { |
There was a problem hiding this comment.
run_length is always 1 in all call sites, right?
| template <typename ArrowType, typename ValuesAccessor> | ||
| Result<Datum> ComputeRunEndEncodedNeedleInsertionIndices( | ||
| const ValuesAccessor& sorted_values, const RunEndEncodedArray& needles, | ||
| SearchSortedOptions::Side side, uint64_t insertion_offset, ExecContext* ctx) { |
There was a problem hiding this comment.
I don't think this needs to be templated on ArrowType or ValuesAccessor. Running search_sorted on the needles should be an opaque operation. This function is mostly interested in:
- unpacking the physical needles from the REE needles
- running search_sorted on the physical needles
- re-packing the search_sorted results with the original run-ends
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new search_sorted compute function across C++ and Python, including options, docs, tests, and benchmarks.
Changes:
- Introduces C++
SearchSortedAPI + kernel registration and implementation, plus benchmarks. - Adds Python bindings for
SearchSortedOptionsand Python-level tests/docstrings forpc.search_sorted. - Updates C++ user docs to list
search_sortedand document its constraints/behavior.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| python/pyarrow/tests/test_compute.py | Adds Python tests validating pc.search_sorted behavior (left/right, nulls, REE, errors). |
| python/pyarrow/includes/libarrow.pxd | Exposes SearchSortedOptions and side enum to Cython. |
| python/pyarrow/compute.py | Exports SearchSortedOptions in the public Python compute namespace. |
| python/pyarrow/_compute_docstrings.py | Adds Python docstring examples for search_sorted. |
| python/pyarrow/_compute.pyx | Adds SearchSortedOptions Python option class + side parsing. |
| docs/source/cpp/compute.rst | Documents search_sorted in the C++ compute function table and notes constraints. |
| cpp/src/arrow/meson.build | Adds the new kernel source to Meson build. |
| cpp/src/arrow/compute/registry_internal.h | Declares RegisterVectorSearchSorted. |
| cpp/src/arrow/compute/kernels/vector_search_sorted_test.cc | Adds comprehensive C++ kernel tests. |
| cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc | Adds microbenchmarks for SearchSorted. |
| cpp/src/arrow/compute/kernels/vector_search_sorted.cc | Implements the search_sorted meta-function/kernel (arrays, chunked, REE, null handling). |
| cpp/src/arrow/compute/kernels/meson.build | Adds the new benchmark target to Meson. |
| cpp/src/arrow/compute/kernels/CMakeLists.txt | Adds the new C++ test and benchmark targets. |
| cpp/src/arrow/compute/initialize.cc | Registers the kernel at startup. |
| cpp/src/arrow/compute/api_vector.h | Adds SearchSortedOptions and the public SearchSorted API declaration/docs. |
| cpp/src/arrow/compute/api_vector.cc | Registers options type and wires SearchSorted to CallFunction. |
| cpp/src/arrow/CMakeLists.txt | Adds the new kernel source to CMake build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// The `values` datum must be a plain array or run-end encoded array sorted in | ||
| /// ascending order. `needles` may be a scalar, plain array, or run-end encoded | ||
| /// array whose logical value type matches `values`. |
| #include <memory> | ||
| #include <numeric> | ||
| #include <optional> | ||
| #include <ranges> |
There was a problem hiding this comment.
Apache Arrow targets C++20 now
| const auto it = std::ranges::find_if( | ||
| values.chunks(), | ||
| [](const std::shared_ptr<Array>& chunk) { return chunk->length() != 0; }); |
There was a problem hiding this comment.
Apache Arrow targets C++20 now
| auto chunk_null_counts = values.chunks() | std::views::transform([](const auto& chunk) { | ||
| return GetLogicalNullCount(*chunk->data()); | ||
| }); | ||
| return std::reduce(chunk_null_counts.begin(), chunk_null_counts.end(), int64_t{0}); |
There was a problem hiding this comment.
Apache Arrow targets C++20 now
| std::static_pointer_cast<Int64Array>(rand.Int64(length, 0, max_value, 0.0)); | ||
| std::vector<int64_t> data(values->raw_values(), | ||
| values->raw_values() + values->length()); | ||
| std::ranges::sort(data); |
There was a problem hiding this comment.
Apache Arrow targets C++20 now
| for (int64_t index = 0; index < values->length(); ++index) { | ||
| data.push_back(values->GetString(index)); | ||
| } | ||
| std::ranges::sort(data); |
There was a problem hiding this comment.
Apache Arrow targets C++20 now
Rationale for this change
Add the implemenation of the search sorted compute kernel based on the numpy function: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html
What changes are included in this PR?
Implementation of the C++ kernel + Python API.
Tests in C++ and Python
Are these changes tested?
Yes
Are there any user-facing changes?
No breaking change
search_sortedkernel for all primitive types and run-end encoded arrays #49677