Fix duplicate filtering path in Arrow task batches#3448
Conversation
Replace the two weak tests that passed on the unfixed code with proper regression coverage: 1. test_task_to_record_batches_does_not_use_table_filter_without_positional_deletes Replaces pyarrow.Table with a sentinel class whose from_batches raises AssertionError. A custom metaclass preserves isinstance checks so the rest of the code-path is unaffected. With the fix the sentinel is never reached; without the fix it fires immediately, detecting the old pa.Table.from_batches / filter / combine_chunks / to_batches[0] path that caused the SIGSEGV on Apple Silicon. 2. test_task_to_record_batches_filter_applied_after_positional_deletes Uses data [1,2,3,4,5] with positional deletes on positions 1 and 3 (removing values 2 and 4), then applies filter id > 2. Expected result [3, 5] would be wrong if either deletes or the subsequent filter were skipped.
Fokko
left a comment
There was a problem hiding this comment.
This is a great catch @GayathriSrividya This logic must have been messed up somewhere throughout refactoring.
I left one comment to avoid having two blocks for positional_deletes, let me know what you think. This will reduce the number of lines quite a bit.
- Combine the two positional-delete handling blocks into one: apply
take(indices) and the row filter together inside the single
'if positional_deletes' block, removing the redundant mid-loop
empty-batch check (filtering an empty batch is free).
- Replace the sentinel-based regression test (which passed even when
the fix was reverted) with a behavioral test that picks a scenario
where the old bug produces a distinct wrong answer:
data=[1,2,3,4], pos_delete=2 (value 3), filter=id>2
correct: [4] old bug (scanner pre-filters): [3,4]
The assertion on [4] will fail against any regression.
Addresses review feedback from @ebyhr and @Fokko.
Fokko
left a comment
There was a problem hiding this comment.
Amazing, thanks @GayathriSrividya 🙌
PyArrow < 21 raises IndexError when RecordBatch.filter(Expression) produces zero rows. Wrap the call in try/except and fall back to an empty slice, which is already handled by the num_rows == 0 guard below. Add regression test covering the positional-delete path where the post-delete filter eliminates all remaining rows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
…ndexError RecordBatch.filter(Expression) raises IndexError on PyArrow 17-20 when the result has zero rows (fixed upstream in apache/arrow#46057). Replace the try/except approach with the Table-based workaround that avoids the bug, consistent with the pattern already used for the non-positional- deletes path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM, the refactor makes sense.
pyarrow_filter is already applied for the none positional delete path:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 1663 to 1697 in fb96df1
Looks like it was accidentally added to the non positional delete path in https://github.com/apache/iceberg-python/pull/1621/changes#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1401-R1423
|
Thanks for the PR @GayathriSrividya and thank you everyone for the reviews! |
Closes #3272
What this changes
This PR updates the Arrow scan path in
_task_to_record_batchesto avoid redundant filtering when there are no positional deletes.Scanner.from_fragmentas the only filter path whenpositional_deletesis absent.current_batch.filter(pyarrow_filter)only in the positional-delete path, after deletes are applied.Why
The previous flow could perform an extra table-level refilter even when the scanner already applied the predicate. This change removes that stale workaround path while keeping correct behavior for positional delete scenarios.
Tests
Added regression coverage in
tests/io/test_pyarrow.py:test_task_to_record_batches_filter_without_positional_deletes_avoids_table_refiltertest_task_to_record_batches_filter_with_positional_deletes_handles_empty_batchValidated locally:
python -m pytest tests/io/test_pyarrow.py -q -k "task_to_record_batches_nanos or filter_without_positional_deletes_avoids_table_refilter or filter_with_positional_deletes_handles_empty_batch"make lint