feat: add dictionary_columns to scan API for memory-efficient string reads#3234
feat: add dictionary_columns to scan API for memory-efficient string reads#3234tanmayrauth wants to merge 1 commit into
Conversation
946d70a to
52b2070
Compare
52b2070 to
9fc3b0c
Compare
|
@kevinjqliu @Fokko can you please review and approve this? |
|
@geruh @kevinjqliu @Fokko can you please review this implementation? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Closes #3170 ## Rationale Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates values and can dramatically reduce peak memory usage. This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale. ## Changes - Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`, `TableScan.__init__`, and `StagedTable.scan()`. - Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` → `ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`. - Only applied when `task.file.file_format == FileFormat.PARQUET`; silently ignored for ORC (which does not support this kwarg). ## Usage ```python # Read the "payload" column as dictionary-encoded to save memory df = table.scan(dictionary_columns=("payload",)).to_arrow() ``` ## Verification - Added `test_dictionary_columns_produces_dict_encoded_output` — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical. - `make lint` ✓ - `pytest tests/table/ tests/io/test_pyarrow.py` ✓ --------- Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Exposes
dictionary_columns: tuple[str, ...] | None = NoneonTable.scan()andDataScan, threading it through to PyArrow'sParquetFileFormatso that named columns are read asDictionaryArrayinstead of plainlarge_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).Key implementation details:
dictionary_columnsis only passed for ParquetArrowScan.to_table()rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row countDataScan.to_arrow_batch_reader()rebuildstarget_schemawith dict types to prevent.cast()from silently decoding DictionaryArray back to plain stringDataScan.__init__declares and stores the param soTableScan.update()(which usesinspect.signature) preserves it across scan copiesFixes #3168, closes #3170
Rationale for this change
Are these changes tested? Yes
Are there any user-facing changes? No