ssh: surface server errors from a running server on failed connections#5555
Merged
Conversation
The ssh server keeps its recent warning/error log lines in a bounded in-memory buffer and serves them at /logs next to /metadata. When the spawned ssh client exits with a connection-level failure (code 255), "ssh connect" fetches that endpoint and prints the server's actual errors instead of only a generic hint. The Jobs API exposes no stdout logs for a running notebook task, so this is the only way to read the server's errors while the bootstrap job is still alive. Co-authored-by: Isaac
ddc7889 to
0a76b0c
Compare
Collaborator
Integration test reportCommit: d56d64f
22 interesting tests: 15 SKIP, 7 KNOWN
Top 20 slowest tests (at least 2 minutes):
|
rclarey
approved these changes
Jun 12, 2026
Collaborator
Integration test reportCommit: 0ded6ef
524 interesting tests: 468 MISS, 38 FAIL, 7 KNOWN, 4 PANIC, 3 flaky, 2 SKIP, 2 RECOVERED
Top 50 slowest tests (at least 2 minutes):
|
deco-sdk-tagging Bot
added a commit
that referenced
this pull request
Jun 17, 2026
## Release v1.4.0 ### CLI * Improved error messages for `ssh connect`: when an SSH connection attempt fails, the client now fetches and prints the server's recent error logs ([#5555](#5555)). * Increase the SSH server startup timeout from 10 to 45 minutes when a GPU accelerator is requested via `databricks ssh connect --accelerator` ([#5569](#5569)). * Fix authentication falling back to the default profile in `.databrickscfg` when a host is already configured via the environment (e.g. `DATABRICKS_HOST` with `DATABRICKS_TOKEN`) ([#5616](#5616)). * ssh: fix opening remote environment in Cursor, which previously hung on default-extension install and never opened the editor ([#5619](#5619)). * Improve the error shown when `databricks labs install` cannot find a project's `labs.yml`: the message now explains that either the requested version does not exist or the project is not installable with the CLI, and links to the repository ([#5559](#5559)). ### Bundles * Remove API enum values and types that are still in development from the `databricks-bundles` Python package; these were never accepted by the backend ([#5484](#5484)). * direct: Fix resolving a resource reference that is used more than once within the same field ([#5558](#5558)). * Bundle variable references now accept Unicode letters in path segments (e.g. `${var.变量}`). ([#5532](#5532)) * Ignore remote changes for vector search direct_access_index_spec.schema_json to prevent drift when the backend normalizes the schema ([#5481](#5481)). * Remove hidden, never-functional `--existing-dashboard-id`, `--existing-dashboard-path`, `--existing-alert-id`, and `--existing-genie-space-id` alias flags from `bundle generate`; use the documented `--existing-id` / `--existing-path` flags instead ([#5591](#5591)). * engine/direct: Fix WAL corruption after two consecutive failed deploys ([#5606](#5606)). * engine/direct: Don't open the deployment state WAL when a deploy's plan fails ([#5607](#5607)). * Ignore unity catalog managed schema property defaults to avoid unnecessary drift ([#5195](#5195)). * Add `postgres_roles` and `postgres_databases` resources to create Postgres roles and databases on a Lakebase branch ([#5467](#5467), [#5627](#5627)). * direct: Stop spurious recreate/rename on redeploy when the backend normalizes a resource's name-based ID (e.g. Unity Catalog lowercasing a schema or volume name) ([#5599](#5599)). * Fix the generated pipeline README to suggest `databricks bundle run <pipeline> --refresh <table>` for running a single transformation; the previously documented `--select` flag is not supported by `bundle run` ([#5252](#5252)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
/logsnext to the existing/metadataendpoint, behind the same driver-proxy auth. Implemented as a teeslog.Handler, so all records still flow to stdout (the run-page logs) unchanged.sshclient exits with a connection-level failure (code 255),ssh connectfetches/logsand prints the server's actual errors (e.g.failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory). The generic "install openssh-server" hint remains as the fallback when no logs are available (e.g. older server versions without/logs); the fetch is best-effort.newDriverProxyRequestfromgetServerMetadata, shared by the new logs fetch.Why
When a connection attempt fails against a healthy-looking bootstrap job (FAILURE_MODES.md Mode 1: the container lacks
sshd, the server logs the error per connection and keeps running), the real error was unreachable from the client: the Jobs API exposes no stdout logs for a running notebook task (GetRunOutputrequires a terminal state andRunOutput.Logsis unsupported for notebook tasks). The server's own HTTP service behind the driver proxy is the only channel available while the job is alive. Complements #5552, which covers the terminated-job case.Tests
sessionattrs, HTTP handler);./task test-exp-sshand full lint pass.The SSH connection closed unexpectedly. Recent SSH server errors:followed by the server'sfailed to start SSHD process: fork/exec ...: no such file or directorylog line. A regularssh connect(no plant) still connects end-to-end.This pull request and its description were written by Isaac.