Conversation
Recovering a header-only WAL (a deploy that opened the WAL but committed nothing, e.g. a crash between UpgradeToWrite and Finalize) advanced the in-memory serial without persisting it. After a second such failure the next WAL header was written two serials ahead of the committed state, and every later command failed WAL recovery until the WAL was deleted by hand. Only advance the serial when the WAL carried entries, i.e. when the merged state is actually written back. Co-authored-by: Isaac
This was referenced Jun 15, 2026
Name it after the WAL condition it recovers, matching sibling tests (empty-wal, stale-wal, future-serial-wal). Use a generic test-bundle name. Co-authored-by: Isaac
Collaborator
Integration test reportCommit: ec65a6b
25 interesting tests: 15 SKIP, 7 KNOWN, 3 flaky
Top 30 slowest tests (at least 2 minutes):
|
andrewnester
approved these changes
Jun 15, 2026
Collaborator
Integration test reportCommit: 8f270ba
530 interesting tests: 471 MISS, 41 FAIL, 7 KNOWN, 5 PANIC, 4 flaky, 2 SKIP
Top 50 slowest tests (at least 2 minutes):
|
deco-sdk-tagging Bot
added a commit
that referenced
this pull request
Jun 17, 2026
## Release v1.4.0 ### CLI * Improved error messages for `ssh connect`: when an SSH connection attempt fails, the client now fetches and prints the server's recent error logs ([#5555](#5555)). * Increase the SSH server startup timeout from 10 to 45 minutes when a GPU accelerator is requested via `databricks ssh connect --accelerator` ([#5569](#5569)). * Fix authentication falling back to the default profile in `.databrickscfg` when a host is already configured via the environment (e.g. `DATABRICKS_HOST` with `DATABRICKS_TOKEN`) ([#5616](#5616)). * ssh: fix opening remote environment in Cursor, which previously hung on default-extension install and never opened the editor ([#5619](#5619)). * Improve the error shown when `databricks labs install` cannot find a project's `labs.yml`: the message now explains that either the requested version does not exist or the project is not installable with the CLI, and links to the repository ([#5559](#5559)). ### Bundles * Remove API enum values and types that are still in development from the `databricks-bundles` Python package; these were never accepted by the backend ([#5484](#5484)). * direct: Fix resolving a resource reference that is used more than once within the same field ([#5558](#5558)). * Bundle variable references now accept Unicode letters in path segments (e.g. `${var.变量}`). ([#5532](#5532)) * Ignore remote changes for vector search direct_access_index_spec.schema_json to prevent drift when the backend normalizes the schema ([#5481](#5481)). * Remove hidden, never-functional `--existing-dashboard-id`, `--existing-dashboard-path`, `--existing-alert-id`, and `--existing-genie-space-id` alias flags from `bundle generate`; use the documented `--existing-id` / `--existing-path` flags instead ([#5591](#5591)). * engine/direct: Fix WAL corruption after two consecutive failed deploys ([#5606](#5606)). * engine/direct: Don't open the deployment state WAL when a deploy's plan fails ([#5607](#5607)). * Ignore unity catalog managed schema property defaults to avoid unnecessary drift ([#5195](#5195)). * Add `postgres_roles` and `postgres_databases` resources to create Postgres roles and databases on a Lakebase branch ([#5467](#5467), [#5627](#5627)). * direct: Stop spurious recreate/rename on redeploy when the backend normalizes a resource's name-based ID (e.g. Unity Catalog lowercasing a schema or volume name) ([#5599](#5599)). * Fix the generated pipeline README to suggest `databricks bundle run <pipeline> --refresh <table>` for running a single transformation; the previously documented `--select` flag is not supported by `bundle run` ([#5252](#5252)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
The direct engine's local state is
resources.json(committed) plus aresources.json.walwrite-ahead log whose serial must be exactly one ahead ofthe committed serial. When a deploy opened the WAL but committed nothing — a
header-only WAL, e.g. a crash between
UpgradeToWriteandFinalize— recoveryadvanced the in-memory serial without persisting it. After a second such failure
the next WAL header was written two serials ahead of the committed state, and
every later
bundlecommand then failed WAL recovery (WAL serial (N) is ahead of expected) until the WAL was deleted by hand.Fix: only advance the serial when the WAL carried entries, i.e. when the merged
state is actually written back, so a no-op WAL is discarded without drift.
Why
Fixes #5557. Reported by a customer: any error that took more than one deploy
attempt to clear wedged the bundle on the second failure.
Tests
bundle/deploy/wal/two-crashed-deploys: two deploys killed mid-apply recoverwithout wedging.
TestHeaderOnlyWALRecoveryDoesNotAdvanceSerial. Both confirmed to failwhen the fix is reverted.
This pull request and its description were written by Isaac.