Skip to content

Heartbeat loss / session close terminates in-flight jobs (§6.4, §6.7) #37

@nficano

Description

@nficano

Category: spec-conformance Severity: blocker
Location: src/Arcp.Runtime/SessionState.Jobs.cs:28-45
Spec: ARCP v1.1 §6.4

What

The running job is launched with the session's _cts.Token (and Job.CancellationSource is linked to that same token via the parentCancellation argument). SessionState.CloseAsync (SessionState.cs:122) and the ReceiverLoop finally (SessionState.cs:90) call _cts.Cancel(); HeartbeatLoop calls CloseAsync on idle ≥ 2 intervals (SessionState.Outbound.cs:49). So a heartbeat timeout, a graceful close, or a transient transport drop cancels every running job. Spec §6.4 states the runtime MUST NOT terminate jobs on heartbeat loss, and §6.7 states in-flight jobs are not affected by close and remain resumable within the resume window.

Evidence

var submission = await _server.JobManager
    .SubmitAsync(submit, SessionId, Principal?.Subject, emit, inboundTraceId, _cts.Token, cancellationToken)
    .ConfigureAwait(false);
...
_ = Task.Run(() => _server.JobManager.RunAsync(job, resolved, emit, _cts.Token), _cts.Token);

Proposed fix

Decouple job lifetime from session lifetime: give each job (or the JobManager) its own CancellationTokenSource rooted at the server/runtime, not the submitting session's _cts. Session teardown should stop streaming to that transport but leave the job running and resumable.

Acceptance criteria

  • A job submitted in a session keeps running (and remains resumable) after that session's transport drops or its heartbeat is declared lost; only an explicit job.cancel or runtime/lease/budget limit terminates it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions