experimental/ssh: show compute provisioning status during ssh connect startup by TanishqDatabricks · Pull Request #5576 · databricks/cli

TanishqDatabricks · 2026-06-12T16:21:56Z

Changes

While the SSH server bootstrap job's compute spins up, the spinner now reads Waiting for compute to start... (all connection types) instead of Starting SSH server.... For GPU accelerators, a persistent notice is printed upfront: Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....

Why

ssh connect --accelerator=GPU_8xH100 frequently fails with:

Error: failed to ensure that ssh server is running: failed to submit and start ssh server job: timed out: waiting for task to start (current state: PENDING)

GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.

Tests

go build, go vet, and go test ./experimental/ssh/... all pass; TestWaitForJobToStartSurfacesFailure updated for the waitForJobToStart signature change.
The change is display-only (spinner and notice text); no control flow or error behavior is modified.

This pull request and its description were written by Isaac.

… startup GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at P90 to acquire, but while waiting `ssh connect` only showed a generic "Starting SSH server... (task: PENDING)" spinner, so users assumed a long wait meant a service outage (see the Zillow report in #remote-development-help). Show "Waiting for compute to start..." while the bootstrap job's compute spins up (all connection types, including dedicated-cluster auto-start), and print an upfront notice for GPU accelerators that provisioning can take upwards of 10 minutes. The startup timeout increase for GPU accelerators is handled separately. Co-authored-by: Isaac

github-actions · 2026-06-12T16:22:27Z

Waiting for approval

Based on git history, these people are best suited to review:

@ilia-db -- recent work in experimental/ssh/internal/client/
@anton-107 -- recent work in experimental/ssh/internal/client/

Eligible reviewers: @andrewnester, @denik, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

_{Suggestions based on git history. See OWNERS for ownership rules.}

eng-dev-ecosystem-bot · 2026-06-12T17:09:58Z

Integration test report

Commit: ea6ce07

Run: 27428511968

	Env	🟨KNOWN	💚RECOVERED	🙈SKIP	✅pass	🙈skip	Time
🟨	aws linux	7		15	264	977	8:10
🟨	aws windows	7		15	266	975	13:17
💚	aws-ucws linux		7	15	360	891	7:56
💚	aws-ucws windows		7	15	362	889	12:18
💚	azure linux		1	17	267	975	7:11
💚	azure windows		1	17	269	973	11:21
💚	azure-ucws linux		1	17	365	887	8:01
💚	azure-ucws windows		1	17	367	885	12:46
💚	gcp linux		1	17	263	978	8:26
💚	gcp windows		1	17	265	976	11:44

22 interesting tests: 15 SKIP, 7 KNOWN

	Test Name	aws linux	aws windows	aws-ucws linux	aws-ucws windows	azure linux	azure windows	azure-ucws linux	azure-ucws windows	gcp linux	gcp windows
🟨	TestAccept	🟨K	🟨K	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/invariant/no_drift	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/permissions	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions	🟨K	🟨K	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct	🟨K	🟨K	💚R	💚R
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	🟨K	🟨K	💚R	💚R
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions	🟨K	🟨K	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct	🟨K	🟨K	💚R	💚R
🟨	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	🟨K	🟨K	💚R	💚R
🙈	TestAccept/bundle/resources/postgres_branches/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/replace_existing	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/update_protected	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/without_branch_id	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_projects/update_display_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/synced_database_tables/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_indexes/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_indexes/grants/select	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/ssh/connection	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S

Top 28 slowest tests (at least 2 minutes):

duration	env	testname
6:16	gcp windows	TestAccept
6:12	aws-ucws windows	TestAccept
6:05	azure-ucws windows	TestAccept
6:02	azure windows	TestAccept
4:54	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:19	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:14	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:08	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:57	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:41	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:29	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:17	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:16	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:12	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:05	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:58	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:56	azure linux	TestAccept
2:54	gcp linux	TestAccept
2:49	aws-ucws linux	TestAccept
2:47	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:46	azure-ucws linux	TestAccept
2:46	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:36	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:34	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

anton-107

Thanks — the diff is clean and the intent is right. Two requested changes on the provisioning notice, both about the wording.

1. Differentiate the message by accelerator type

Right now GPU_1xA10 and GPU_8xH100 get the identical "upwards of 10 minutes" notice, but their provisioning latencies differ a lot — a single A10 is typically acquired much faster than an 8×H100 node. Telling an A10 user to expect 10+ minutes is misleading, and the 8×H100 case arguably warrants a stronger heads-up (P90 ~30 min).

Suggest keying the message off opts.Accelerator — e.g. a small map[string]string of accelerator → expected-time phrasing, with a generic fallback for anything not in the map. That also keeps it correct as new accelerator types are added.

2. Tighten the wording

"upwards of 10 minutes" is a touch informal and slightly misrepresents the data: with P50 ≈ 10 min it implies 10 min is the floor, when in fact roughly half the time it finishes faster — and the real pain is the ~30 min P90 that drove the 45-min timeout in #5569. Anchoring on a range is more useful to someone staring at a long PENDING state. The trailing ... also reads casual for a one-time sentence (vs. the ongoing spinner text, where it fits).

Suggested wording:

GPU_8xH100: Provisioning GPU_8xH100 compute. This typically takes around 10 minutes and can exceed 30 minutes when capacity is constrained.
GPU_1xA10: Provisioning GPU_1xA10 compute. This usually takes a few minutes, longer when capacity is constrained. (adjust to the latency we actually observe)

The matching spinner text can stay short, e.g. Provisioning GPU_8xH100 compute....

TanishqDatabricks mentioned this pull request Jun 12, 2026

experimental/ssh: show compute provisioning status during ssh connect startup #5572

Closed

TanishqDatabricks temporarily deployed to test-trigger-is June 12, 2026 16:22 — with GitHub Actions Inactive

TanishqDatabricks requested a review from anton-107 June 12, 2026 16:23

anton-107 requested changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental/ssh: show compute provisioning status during ssh connect startup#5576

experimental/ssh: show compute provisioning status during ssh connect startup#5576
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-connect-gpu-startup-ux

TanishqDatabricks commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

eng-dev-ecosystem-bot commented Jun 12, 2026

Uh oh!

anton-107 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TanishqDatabricks commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Why

Tests

Uh oh!

github-actions Bot commented Jun 12, 2026

Waiting for approval

Uh oh!

eng-dev-ecosystem-bot commented Jun 12, 2026

Integration test report

Uh oh!

anton-107 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Differentiate the message by accelerator type

2. Tighten the wording

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TanishqDatabricks commented Jun 12, 2026 •

edited

Loading

anton-107 left a comment •

edited

Loading