[spark] Lazy partition pruning for engine format table by Zouxxyy · Pull Request #8300 · apache/paimon

Zouxxyy · 2026-06-20T08:54:15Z

Purpose

Replace the eager InMemoryFileIndex (which recursively lists all files at construction time) with LazyPartitionPruningFileIndex that defers file listing until listFiles() is called and prunes partition directories level-by-level using partition filters.

For a table with 20×30=600 partitions, querying a single partition (p1=1 AND p2=1) now discovers 1 file instead of 600. Range queries (p1>15) and non-leading column filters (p2=1) also benefit from per-level pruning.

Controlled by spark.paimon.format-table.engine.lazy-partition-pruning (default true). Set to false to fall back to eager listing.

Tests

Engine static/dynamic partition overwrite
Partition pruning on multi-level partitions with config comparison (lazy vs eager)
Partition pruning with all basic types (11 types)
Null partition values and multi-column predicate fallback
Data visibility after insert to same/new partitions

JingsongLi · 2026-06-20T13:20:59Z

+      leafDirToChildrenFiles
+  }
+
+  override def refresh(): Unit = fileStatusCache.invalidateAll()


This only invalidates the shared FileStatusCache. Once fullIndex has been initialized, it still keeps its own cached leaf files, leaf-dir map, and partition spec, while Spark's InMemoryFileIndex.refresh() also calls refresh0() to rebuild those fields. I reproduced this by listing an index with pt=1, creating pt=2, calling refresh(), and then listFiles(Nil, Nil) still returned only pt=1. Please refresh/recreate fullIndex here so REFRESH TABLE and write refresh paths do not leave unfiltered scans/allFiles/partitionSpec stale.

Thanks for the review. I investigated the refresh() path:

REFRESH TABLE in Spark V2 goes through RefreshTableExec → catalog.invalidateTable(ident), which causes the next query to call loadTable() and recreate the entire table instance (including a fresh LazyPartitionPruningFileIndex). FileIndex.refresh() is not called in this path. Same pattern as CatalogFileIndex, which also uses an immutable val sizeInBytes and its refresh() only clears fileStatusCache.

FileIndex.refresh() is only called from V1 write path (InsertIntoHadoopFsRelationCommand) and CacheManager.recacheByPath. Engine format tables use V2 write path, so refresh() is not triggered during normal read/write operations.

Added a config spark.paimon.format-table.engine.lazy-partition-pruning (default true). Set to false to fall back to eager listing, which may be better for small tables queried repeatedly without partition filters — eager listing caches all files at construction and avoids per-query directory traversal overhead.

Replace the eager InMemoryFileIndex (which recursively lists all files at construction time) with LazyPartitionPruningFileIndex that defers file listing until listFiles() is called and prunes partition directories level-by-level using partition filters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Zouxxyy force-pushed the xinyu/optimize-format-rw branch from d294391 to c97bb41 Compare June 20, 2026 09:38

JingsongLi reviewed Jun 20, 2026

View reviewed changes

Zouxxyy force-pushed the xinyu/optimize-format-rw branch from c97bb41 to 495f0c7 Compare June 20, 2026 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Lazy partition pruning for engine format table#8300

[spark] Lazy partition pruning for engine format table#8300
Zouxxyy wants to merge 1 commit into
apache:masterfrom
Zouxxyy:xinyu/optimize-format-rw

Zouxxyy commented Jun 20, 2026 •

edited

Loading

Uh oh!

JingsongLi Jun 20, 2026

Uh oh!

Zouxxyy Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zouxxyy commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

JingsongLi Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zouxxyy commented Jun 20, 2026 •

edited

Loading