Ignore remote rewrites of vector search direct_access_index_spec.schema_json#5481
Merged
janniklasrose merged 8 commits intoJun 12, 2026
Merged
Conversation
The Vector Search backend canonicalizes the SQL type aliases in a DIRECT_ACCESS index's schema_json on create (e.g. "int" -> "integer") and returns the normalized form on read. The fake server echoed the request verbatim, so the create -> get round-trip didn't match the real API and couldn't reproduce the schema drift this normalization causes. Fold the aliases to their canonical spelling, matching brickindex-common/src/utils/ColumnSpec.scala. HTML escaping is disabled when re-serializing so array<...> keeps its angle brackets. Co-authored-by: Isaac
A DIRECT_ACCESS index's direct_access_index_spec is immutable, so any diff against it plans a destructive recreate. The Vector Search backend canonicalizes the SQL type aliases in schema_json (e.g. "int" -> "integer") and returns the normalized spelling, so a config that uses an alias drifts against the remote schema on every redeploy and silently recreates the index, dropping all upserted vectors. Fold the aliases to their canonical form (matching brickindex-common/src/utils/ColumnSpec.scala) in OverrideChangeDesc and skip the recreate when the two schemas differ only by aliases. A genuine schema change still recreates. Co-authored-by: Isaac
Create a DIRECT_ACCESS index whose schema_json uses the SQL type aliases the backend canonicalizes (int, bigint, smallint, tinyint, array<int>), alongside types that pass through unchanged (float, string, array<float>). The test server returns the normalized schema and the redeploy plan reports no changes, exercising the alias round-trip end-to-end without a live workspace. Pinned to local: a live workspace returns a different spelling than the test server's canonical form, so only the hermetic output is stable. Co-authored-by: Isaac
janniklasrose
commented
Jun 9, 2026
Collaborator
|
Commit: 49097b3
22 interesting tests: 15 SKIP, 7 KNOWN
Top 28 slowest tests (at least 2 minutes):
|
The previous approach (b63b466) had the backend behavior backwards: the user provides user-facing type names ("integer", "long", "short", "byte") in a DIRECT_ACCESS index's schema_json and Unity Catalog stores them as Spark type names ("int", "bigint", "smallint", "tinyint"). GET returns the Spark names - and the columns in sorted key order, which alias folding alone could not account for. Replace the OverrideChangeDesc alias comparison with ignore_remote_changes on direct_access_index_spec.schema_json: remote rewrites of the field are ignored while a genuine local schema edit still recreates via the parent spec's recreate_on_changes rule. Flip the test server normalization to match the real API (user-facing -> Spark names, sorted keys) and update the schema_normalization acceptance test to cover the rewrite round-trip, including deliberately unsorted keys. The bind fixture switches to the Spark spelling: bind seeds state from the remote, so the user-facing spelling would read as a local edit after bind and plan a recreate of the adopted index. Co-authored-by: Isaac
pietern
approved these changes
Jun 12, 2026
….com:databricks/cli into vector-search-acceptance-tests
Contributor
Author
|
Integration tests hanging on lack of available windows runners. No code change since last successful run https://go/deco-tests/27410539804 |
Collaborator
Integration test reportCommit: 6996d63
469 interesting tests: 415 MISS, 42 FAIL, 7 KNOWN, 3 PANIC, 2 SKIP
Top 50 slowest tests (at least 2 minutes):
|
deco-sdk-tagging Bot
added a commit
that referenced
this pull request
Jun 17, 2026
## Release v1.4.0 ### CLI * Improved error messages for `ssh connect`: when an SSH connection attempt fails, the client now fetches and prints the server's recent error logs ([#5555](#5555)). * Increase the SSH server startup timeout from 10 to 45 minutes when a GPU accelerator is requested via `databricks ssh connect --accelerator` ([#5569](#5569)). * Fix authentication falling back to the default profile in `.databrickscfg` when a host is already configured via the environment (e.g. `DATABRICKS_HOST` with `DATABRICKS_TOKEN`) ([#5616](#5616)). * ssh: fix opening remote environment in Cursor, which previously hung on default-extension install and never opened the editor ([#5619](#5619)). * Improve the error shown when `databricks labs install` cannot find a project's `labs.yml`: the message now explains that either the requested version does not exist or the project is not installable with the CLI, and links to the repository ([#5559](#5559)). ### Bundles * Remove API enum values and types that are still in development from the `databricks-bundles` Python package; these were never accepted by the backend ([#5484](#5484)). * direct: Fix resolving a resource reference that is used more than once within the same field ([#5558](#5558)). * Bundle variable references now accept Unicode letters in path segments (e.g. `${var.变量}`). ([#5532](#5532)) * Ignore remote changes for vector search direct_access_index_spec.schema_json to prevent drift when the backend normalizes the schema ([#5481](#5481)). * Remove hidden, never-functional `--existing-dashboard-id`, `--existing-dashboard-path`, `--existing-alert-id`, and `--existing-genie-space-id` alias flags from `bundle generate`; use the documented `--existing-id` / `--existing-path` flags instead ([#5591](#5591)). * engine/direct: Fix WAL corruption after two consecutive failed deploys ([#5606](#5606)). * engine/direct: Don't open the deployment state WAL when a deploy's plan fails ([#5607](#5607)). * Ignore unity catalog managed schema property defaults to avoid unnecessary drift ([#5195](#5195)). * Add `postgres_roles` and `postgres_databases` resources to create Postgres roles and databases on a Lakebase branch ([#5467](#5467), [#5627](#5627)). * direct: Stop spurious recreate/rename on redeploy when the backend normalizes a resource's name-based ID (e.g. Unity Catalog lowercasing a schema or volume name) ([#5599](#5599)). * Fix the generated pipeline README to suggest `databricks bundle run <pipeline> --refresh <table>` for running a single transformation; the previously documented `--select` flag is not supported by `bundle run` ([#5252](#5252)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
ignore_remote_changesfordirect_access_index_spec.schema_json, replacing the earlier alias-folding inOverrideChangeDesc. Remote rewrites of the field are ignored; a genuine local schema edit still recreates via the parent spec'srecreate_on_changesrule.integer,long,short,byte) are stored as Spark type names (int,bigint,smallint,tinyint) and the columns are returned in sorted key order. The first revision of this PR had the mapping direction backwards.schema_normalizationacceptance test covers the rewrite round-trip, including deliberately unsorted keys. Thebindfixture switches to the Spark spelling: bind seeds state from the remote, so any other spelling reads as a local edit after bind and would plan a recreate of the adopted index.Why
When a Direct Access Vector Search index is created, the user provides type names like
integer,long,short,byteinschema_json. These are stored in Unity Catalog as Spark type names (int,bigint,smallint,tinyint). When the API reads the index back, it returns the Spark type names in sorted key order instead of the original user input, causingdatabricks bundle planto see a false diff on the immutableschema_jsonfield and propose an unnecessary recreate that would drop all upserted vectors.The downside of not having drift detection is eliminated by the field being immutable in the first place. Out-of-band changes would surface via an ID change.
Tests
libs/testserver,bundle/direct/...bundle/resources/vector_search_indexes/...,bundle/deployment/bind/vector_search_index, andbundle/invariant, whose deploy-then-plan no-drift check now exercises the new rule with the user-facing spelling.This pull request and its description were written by Isaac, an AI coding agent.