Skip to content

[vector] Support raw fallback for vector search#8302

Open
JingsongLi wants to merge 1 commit into
apache:masterfrom
JingsongLi:codex/vector-search-raw-fallback
Open

[vector] Support raw fallback for vector search#8302
JingsongLi wants to merge 1 commit into
apache:masterfrom
JingsongLi:codex/vector-search-raw-fallback

Conversation

@JingsongLi

@JingsongLi JingsongLi commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Support vector search over unindexed raw rows when global-index.search-mode is configured as full or detail, while keeping the default fast mode index-only. This improves vector search freshness after new data is written before the vector global index is rebuilt.

Changes

  • Add a VectorGlobalIndexer interface so vector index implementations can expose their metric for raw score computation.
  • Extend Java vector and batch vector reads to merge indexed results with raw rows outside indexed row-id ranges in full / detail modes.
  • Preserve Spark vector search behavior by using Spark execution for fast index-only mode and local raw fallback when raw rows need scanning.
  • Add Java tests covering default fast behavior, full-mode raw fallback, filtered raw fallback, and batch vector search.

Testing

  • mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorSearchBuilderTest test
  • mvn -pl paimon-common,paimon-core,paimon-lumina,paimon-vector,paimon-spark/paimon-spark-common -am -Pfast-build -DskipTests -DfailIfNoTests=false compile
  • mvn -pl paimon-common,paimon-core,paimon-lumina,paimon-vector,paimon-spark/paimon-spark-common -DskipTests spotless:check
  • git diff --check origin/master..HEAD

Notes

  • Python support is intentionally left out of this PR.
  • mvnflink is not available in this local environment (mvnflink not found), so verification used Maven directly.

@JingsongLi JingsongLi force-pushed the codex/vector-search-raw-fallback branch 7 times, most recently from 0221d5c to c3f0b4f Compare June 20, 2026 13:37

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I found two correctness issues in the raw-fallback path that should be fixed before merging:

  1. Scalar pre-filter can drop valid rows when the scalar index only partially overlaps a vector-indexed range. VectorScanImpl attaches any scalar index whose row range intersects the vector range, and AbstractVectorRead.preFilter then turns the scalar-index result into a global includeRowIds bitmap. Rows inside the vector-indexed range but outside the scalar-index coverage are therefore excluded from vector search even though they still need to be evaluated by the residual table filter. I reproduced this with a vector index covering [0, 9], a btree index on id covering only [3, 7], and filter id >= 8; the query should return row 8, but returns an empty result. The test command was:
mvn -pl paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip=true -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=VectorSearchBuilderTest#testPartialScalarPreFilterMustNotDropUnindexedScalarRows test
  1. Raw-only fallback uses the wrong metric when there is no vector index split. In FULL/DETAIL mode, a table can have raw ranges but no vector index files yet. In that case globalIndexer is null and rawSearchMetric falls back to l2, ignoring the configured vector-index metric. For a cosine table with vectors [100, 0] and [0.9, 0.1], querying [1, 0] should return row 0, but raw-only search returns row 1 because it ranks by L2. I reproduced this with:
mvn -pl paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip=true -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=VectorSearchBuilderTest#testFullModeRawOnlyUsesConfiguredMetric test

I think the first issue needs a conservative scalar pre-filter: either only use scalar indexes when their coverage is complete for the vector split, or add the scalar-index-uncovered portions of the vector split back to the candidate bitmap so the residual filter can still be applied. For the second issue, raw search needs to derive the metric from the configured vector index type/options even when no index file exists, rather than defaulting to L2.

@JingsongLi JingsongLi force-pushed the codex/vector-search-raw-fallback branch from c3f0b4f to b3a9cc1 Compare June 20, 2026 15:36
@JingsongLi JingsongLi force-pushed the codex/vector-search-raw-fallback branch from b3a9cc1 to 7a5af13 Compare June 20, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants