Skip to content

feat: support CLUSTER BY AUTO and CLUSTER BY NONE for Databricks liquid clustering#5846

Open
EhabEasee wants to merge 1 commit into
SQLMesh:mainfrom
EhabEasee:feat/clustered-by-auto-none
Open

feat: support CLUSTER BY AUTO and CLUSTER BY NONE for Databricks liquid clustering#5846
EhabEasee wants to merge 1 commit into
SQLMesh:mainfrom
EhabEasee:feat/clustered-by-auto-none

Conversation

@EhabEasee

Copy link
Copy Markdown

Databricks supports two keyword forms of liquid clustering that don't take column arguments:

  • CLUSTER BY AUTO — lets Databricks automatically select clustering columns
  • CLUSTER BY NONE — disables liquid clustering on a table

Previously, SQLMesh had no way to express these in a model definition. This PR adds support for both.

Changes

constants.py: Adds LIQUID_CLUSTERING_KEYWORDS = frozenset({"AUTO", "NONE"}) as a shared constant used across the parser, validator, and adapter.

Parsing (dialect.py): The clustered_by property parser now recognises bare AUTO and NONE tokens (unquoted VAR tokens) as liquid clustering keywords rather than column references. Backtick-quoted `auto` / `none` are still treated as regular column names, preserving backwards compatibility for columns that happen to share those names.

Validation (meta.py): A single string passed to clustered_by is normalised to a list before processing. The validator then skips the column-count check for exp.Var(AUTO|NONE), but only when the field is clustered_by and the dialect is databricks. On deserialisation from JSON, keyword strings are restored to exp.Var sentinels before list_of_fields_validator can normalise them into quoted columns.

Validation (definition.py): The validate_definition column-existence check skips keyword sentinels for the same clustered_by + databricks scope.

Code generation (databricks.py): _build_table_properties_exp detects a single exp.Var in clustered_by (guarded by a ValueError if the Var holds an unexpected value), and emits CLUSTER BY AUTO / CLUSTER BY NONE without wrapping in a tuple. Multi-column paths are unchanged.

Usage

-- In a SQLMesh model definition
MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by AUTO
);

MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by NONE
);

Via the Python API, both a plain string and exp.Var are accepted:

create_sql_model(..., dialect="databricks", clustered_by="AUTO")
create_sql_model(..., dialect="databricks", clustered_by=exp.Var(this="AUTO"))

Columns with the names auto or none are still supported via backtick quoting:

MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by (`auto`, `none`)
);

Tests

  • tests/core/test_dialect.py — parser round-trips: AUTO/NONE keywords, backtick-quoted columns, paren-wrapped single columns, multi-column lists, mixed list (a, AUTO), non-Databricks dialect
  • tests/core/test_model.py — model DDL; Python API with both exp.Var and plain string; backtick-quoted column names; render_definition output; JSON serialisation round-trip; non-Databricks dialect rejection; mixed-list column treatment
  • tests/core/engine_adapter/test_databricks.py — adapter emits CLUSTER BY AUTO / CLUSTER BY NONE without column parens

…id clustering

Adds parser, validator, and Databricks adapter support for the keyword
forms of liquid clustering. Bare AUTO/NONE (unquoted VAR tokens) are
recognised as keywords; backtick-quoted `auto`/`none` and
parenthesised forms remain real column references.

- Add LIQUID_CLUSTERING_KEYWORDS constant to avoid repeating the
  sentinel set across dialect, meta, definition, and adapter
- Parser (dialect.py): detect VAR-token AUTO/NONE on clustered_by;
  strip Paren from single-column clustered_by to match partitioned_by
  normalisation
- Validator (meta.py): normalise single string input to list; restore
  keyword sentinels from JSON strings on deserialisation; skip
  column-count check for keywords, gated on clustered_by + databricks
- validate_definition (definition.py): skip keyword sentinels in the
  column-existence check, same gate
- Adapter (databricks.py): emit CLUSTER BY AUTO / CLUSTER BY NONE
  without a tuple wrapper; raise ValueError on unexpected bare Var
- Tests: parser round-trips, Python API (exp.Var and plain string),
  backtick-quoted columns, render_definition, JSON round-trip,
  non-Databricks rejection, mixed-list behaviour, adapter SQL emission

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant