Skip to content

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

Closed
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:feat/dictionary-columns-scan
Closed

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

Conversation

@tanmayrauth

Copy link
Copy Markdown

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

  • ORC files are guarded — dictionary_columns is only passed for Parquet
  • ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
  • DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
  • DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

@tanmayrauth tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48
@tanmayrauth

Copy link
Copy Markdown
Author

@kevinjqliu @Fokko can you please review and approve this?

@tanmayrauth

tanmayrauth commented Apr 16, 2026

Copy link
Copy Markdown
Author

@geruh @kevinjqliu @Fokko can you please review this implementation?

@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 17, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this May 24, 2026
Fokko pushed a commit that referenced this pull request Jun 18, 2026
Closes #3170

## Rationale

Columns that contain large or frequently repeated string values (e.g.
JSON blobs, low-cardinality categoricals) can exhaust memory when
PyArrow loads them as plain string arrays. PyArrow's Parquet reader
natively supports dictionary-encoded reads via its `dictionary_columns`
kwarg, which deduplicates values and can dramatically reduce peak memory
usage.

This was previously discussed in #3168 and a prior implementation
(#3234) was closed as stale.

## Changes

- Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`,
`TableScan.__init__`, and `StagedTable.scan()`.
- Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()`
→ `ArrowScan.__init__` → `_task_to_record_batches` →
`_get_file_format()`.
- Only applied when `task.file.file_format == FileFormat.PARQUET`;
silently ignored for ORC (which does not support this kwarg).

## Usage

```python
# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()
```

## Verification

- Added `test_dictionary_columns_produces_dict_encoded_output` —
confirms the requested column is dict-encoded, non-requested columns are
plain, and values are identical.
- `make lint` ✓
- `pytest tests/table/ tests/io/test_pyarrow.py` ✓

---------

Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

1 participant