experimental/ssh: show compute provisioning status during ssh connect startup#5576
experimental/ssh: show compute provisioning status during ssh connect startup#5576TanishqDatabricks wants to merge 1 commit into
Conversation
… startup GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at P90 to acquire, but while waiting `ssh connect` only showed a generic "Starting SSH server... (task: PENDING)" spinner, so users assumed a long wait meant a service outage (see the Zillow report in #remote-development-help). Show "Waiting for compute to start..." while the bootstrap job's compute spins up (all connection types, including dedicated-cluster auto-start), and print an upfront notice for GPU accelerators that provisioning can take upwards of 10 minutes. The startup timeout increase for GPU accelerators is handled separately. Co-authored-by: Isaac
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Integration test reportCommit: ea6ce07
22 interesting tests: 15 SKIP, 7 KNOWN
Top 28 slowest tests (at least 2 minutes):
|
There was a problem hiding this comment.
Thanks — the diff is clean and the intent is right. Two requested changes on the provisioning notice, both about the wording.
1. Differentiate the message by accelerator type
Right now GPU_1xA10 and GPU_8xH100 get the identical "upwards of 10 minutes" notice, but their provisioning latencies differ a lot — a single A10 is typically acquired much faster than an 8×H100 node. Telling an A10 user to expect 10+ minutes is misleading, and the 8×H100 case arguably warrants a stronger heads-up (P90 ~30 min).
Suggest keying the message off opts.Accelerator — e.g. a small map[string]string of accelerator → expected-time phrasing, with a generic fallback for anything not in the map. That also keeps it correct as new accelerator types are added.
2. Tighten the wording
"upwards of 10 minutes" is a touch informal and slightly misrepresents the data: with P50 ≈ 10 min it implies 10 min is the floor, when in fact roughly half the time it finishes faster — and the real pain is the ~30 min P90 that drove the 45-min timeout in #5569. Anchoring on a range is more useful to someone staring at a long PENDING state. The trailing ... also reads casual for a one-time sentence (vs. the ongoing spinner text, where it fits).
Suggested wording:
- GPU_8xH100:
Provisioning GPU_8xH100 compute. This typically takes around 10 minutes and can exceed 30 minutes when capacity is constrained. - GPU_1xA10:
Provisioning GPU_1xA10 compute. This usually takes a few minutes, longer when capacity is constrained.(adjust to the latency we actually observe)
The matching spinner text can stay short, e.g. Provisioning GPU_8xH100 compute....
Changes
While the SSH server bootstrap job's compute spins up, the spinner now reads
Waiting for compute to start...(all connection types) instead ofStarting SSH server.... For GPU accelerators, a persistent notice is printed upfront:Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....Why
ssh connect --accelerator=GPU_8xH100frequently fails with:GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.
Tests
go build,go vet, andgo test ./experimental/ssh/...all pass;TestWaitForJobToStartSurfacesFailureupdated for thewaitForJobToStartsignature change.This pull request and its description were written by Isaac.