feat: add dictionary_columns to scan API for memory-efficient string reads by tanmayrauth · Pull Request #3234 · apache/iceberg-python

tanmayrauth · 2026-04-13T17:44:39Z

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

ORC files are guarded — dictionary_columns is only passed for Parquet
ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

…reads

tanmayrauth · 2026-04-16T04:53:03Z

@kevinjqliu @Fokko can you please review and approve this?

tanmayrauth · 2026-04-16T18:55:35Z

@geruh @kevinjqliu @Fokko can you please review this implementation?

github-actions · 2026-05-17T00:46:43Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2026-05-24T00:48:56Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Closes #3170 ## Rationale Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates values and can dramatically reduce peak memory usage. This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale. ## Changes - Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`, `TableScan.__init__`, and `StagedTable.scan()`. - Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` → `ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`. - Only applied when `task.file.file_format == FileFormat.PARQUET`; silently ignored for ORC (which does not support this kwarg). ## Usage ```python # Read the "payload" column as dictionary-encoded to save memory df = table.scan(dictionary_columns=("payload",)).to_arrow() ``` ## Verification - Added `test_dictionary_columns_produces_dict_encoded_output` — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical. - `make lint` ✓ - `pytest tests/table/ tests/io/test_pyarrow.py` ✓ --------- Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

tanmayrauth mentioned this pull request Apr 13, 2026

feature request: pass optional parameters to DataScan/pyarrow #3170

Closed

tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 946d70a to 52b2070 Compare April 13, 2026 18:45

feat: add dictionary_columns to scan API for memory-efficient string …

9fc3b0c

…reads

tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48

github-actions Bot added the stale label May 17, 2026

github-actions Bot closed this May 24, 2026

GayathriSrividya mentioned this pull request Jun 5, 2026

feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads #3461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

tanmayrauth commented Apr 13, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanmayrauth commented Apr 13, 2026

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

Uh oh!

tanmayrauth commented Apr 16, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanmayrauth commented Apr 16, 2026 •

edited

Loading