feat(tracing): OTel span queue and export telemetry (SGPINF-1863) by james-cardenas · Pull Request #373 · scaleapi/scale-agentex-python

james-cardenas · 2026-05-28T22:37:54Z

Summary

SGPINF-1863 - Validate SDK Connection Fix (PR #362) to Unblock 2x Export Capacity

Add OpenTelemetry metrics for async span queue processing: depth, batch lag, drain duration, and shutdown flush timing.
Record SGP export success/failure counters on async upsert_batch completion with bounded HTTP status labels.
Introduce AGENTEX_TRACING_METRICS=0|false|no|off kill switch to disable SDK-side recording without code changes.

Stacked on PR362 (stas/tracing-perf-linger-keepalive) for load-test observability before M5 re-run.

Test plan

uv run pytest src/agentex/lib/core/observability/tests/test_tracing_metrics.py src/agentex/lib/core/observability/tests/test_tracing_metrics_recording.py tests/lib/core/tracing/test_span_queue.py tests/lib/core/tracing/processors/test_sgp_tracing_processor.py -o addopts=
Pin SDK in rocket mock agent and verify metrics appear in staging Mimir during a short load test
Confirm AGENTEX_TRACING_METRICS=0 disables recording with no export overhead regression

Greptile Summary

This PR adds OpenTelemetry metrics instrumentation to the async span queue and SGP exporter: queue depth, batch lag, drain duration, export success/failure counters, and shutdown flush timing. An AGENTEX_TRACING_METRICS kill switch disables recording without code changes. All three issues flagged in the prior review (queue depth under-reporting, unbounded http_code label cardinality, and hard-coded "sgp" in record_export_success) have been addressed.

tracing_metrics.py / tracing_metrics_recording.py: New modules define all OTel instruments and thin "best-effort" recording helpers that lazy-load the OTel SDK only on first use, keeping hot-path import overhead near zero.
span_queue.py: Recording calls are woven into enqueue, drain-batch coalesce, per-phase dispatch timing, retry-exhausted failure, permanent failure, and shutdown-timeout paths; the record_batch_coalesced call correctly passes self._queue.qsize() + len(batch) to capture pre-drain depth.
sgp_tracing_processor.py: record_export_success is called after each successful upsert_batch, taking an explicit processor label rather than a hard-coded string.

Confidence Score: 5/5

Safe to merge. All metrics calls are wrapped in best-effort try/except blocks and gated by the kill switch, so a misconfigured OTel provider cannot disrupt span export.

All three issues raised in the previous review have been correctly resolved. The failure-metric paths in _handle_failure are mutually exclusive — the retryable branch returns early, so the permanent-failure counter at the bottom is never double-fired. Recording helpers lazy-load OTel and swallow all exceptions, keeping the hot path safe. Test coverage is thorough across unit and integration layers.

No files require special attention.

Important Files Changed

Filename	Overview
src/agentex/lib/core/observability/tracing_metrics.py	New module: defines all OTel instruments for span queue + export telemetry. Cardinality is well bounded; classify_export_error now maps out-of-range codes to the sentinel 'other', preventing new timeseries per unique integer.
src/agentex/lib/core/observability/tracing_metrics_recording.py	New recording helpers: lazy OTel load, kill-switch guard, best-effort try/except on every call. record_export_success now accepts an explicit processor parameter. Logic is clean and test-covered.
src/agentex/lib/core/tracing/span_queue.py	Metrics integration across enqueue, drain-batch coalesce (depth fix with + len(batch)), per-phase timing, and failure paths. Failure paths are mutually exclusive (retryable returns early), so no double-counting.
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py	Adds record_export_success after each successful upsert_batch with explicit processor='sgp' label; called only after the await completes successfully so failures propagate normally.
tests/lib/core/tracing/test_span_queue.py	New TestAsyncSpanQueueMetrics class covers depth-inclusive batch coalescing, enqueued counter, dropped-on-shutdown counter, export failure on processor error, and kill-switch overhead check.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[enqueue] -->|put_nowait| B[_SpanQueueItem\nenqueued_at = monotonic]
    A -->|QueueFull| C[record_span_dropped\nreason=queue_full]
    A -->|_stopping| D[record_span_dropped\nreason=shutdown]
    A --> E[record_span_enqueued]
    B --> F[_drain_loop\ncollect batch]
    F --> G[record_batch_coalesced\nqueue_depth = qsize + len_batch]
    G --> H[_process_items START]
    H --> I[await processor.on_spans_start]
    I -->|success| J[record_export_success]
    I -->|exception| K{_is_retryable_exc?}
    K -->|yes, exhausted| L[record_export_failure\nspan_count=exhausted]
    K -->|yes, retriable| M[_reenqueue items]
    K -->|no| N[record_export_failure\nspan_count=all items]
    G --> O[await barrier]
    O --> P[_process_items END]
    P --> Q[await processor.on_spans_end]
    Q -->|success| R[record_export_success]
    Q -->|exception| S{_is_retryable_exc?}
    S -->|yes, exhausted| T[record_export_failure]
    S -->|yes, retriable| U[_reenqueue items]
    S -->|no| V[record_export_failure]
    W[shutdown timeout] --> X[record_shutdown_timeout]

_{Reviews (11): Last reviewed commit: "Standardize on len(spans) in tracing pro..." | Re-trigger Greptile}

Two compounding causes of slow SGP trace export under load: - The async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. Adds a 100ms linger (tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. - httpx keepalive was disabled (max_keepalive_connections=0) in SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule to avoid "bound to a different event loop" errors in sync-ACP. Each span paid a full TLS handshake. Replaced with a per-event-loop client cache keyed on id(asyncio.get_running_loop()); connections are reused within a loop and cross-loop safety is preserved. Tests cover linger coalescing, batch-size cap interaction, per-loop client caching, a keepalive-enabled regression guard, and disabled-processor null-client behavior.

Addresses Greptile review feedback on PR #362. The original `dict[int, AsyncSGPClient]` cache used `id(asyncio.get_running_loop())` as the key. In CPython `id()` returns a memory address, and once a loop is garbage-collected its address can be assigned to a new loop — a fresh loop hashing to a stale entry would receive a client whose httpx.AsyncClient was bound to the dead loop, reintroducing the "bound to a different event loop" error this PR was built to prevent. Switching the cache to `weakref.WeakKeyDictionary` keyed on the loop object itself fixes the bug: the entry is evicted automatically when the loop is collected, so id() recycling can't cause stale-client reuse. Multi-loop caching benefit is preserved (better than the single-slot pattern in TracingModule for agents that bounce between loops). Same fix applied to AgentexAsyncTracingProcessor. Added a regression test verifying the cache evicts a closed/dropped loop's entry after gc.collect().

Addresses both Greptile P3 findings on PR #362: - AgentexAsyncTracingProcessor implemented the same per-loop client cache pattern as SGPAsyncTracingProcessor but had no dedicated test file. Added test_agentex_tracing_processor.py mirroring the SGP coverage: single-build-per-loop, keepalive-enabled regression guard, and WeakKeyDictionary eviction after GC. Skipped cleanly with pytest.importorskip when pydantic_ai isn't installed (the SDK dev venv state), since agentex_tracing_processor pulls in agentex.lib.adk which requires it. - test_linger_respects_batch_size_cap used linger_ms=500, forcing the tail singleton batch to wait out the full 500ms timeout — the test only asserts no batch exceeds the cap, so dropping to linger_ms=50 keeps correctness while cutting wall time by ~10x.

Emit queue depth, batch lag, drain timing, and export success/failure counters for async span processing. Failures include bounded HTTP status labels; disable SDK recording with AGENTEX_TRACING_METRICS=0. Co-authored-by: Cursor <cursoragent@cursor.com>

Include batch size in queue_depth sampling, clamp out-of-range HTTP codes to a bounded label, and parameterize export success processor.

smoreinis

left a few minor comments inline, but nothing to block getting this in - I did merge the underlying PR into next so you will need a rebase but hopefully it should be relatively painless. will be great to have more observability on the spans behavior!!

Span export was strictly serial: the drain loop awaited each upsert_batch to completion before sending the next, so per-pod egress was capped at ~1/request-latency (~150ms server-side ⇒ ~6-7 PUT/s/pod) regardless of CPU or backend headroom. Under load the queue backlogged and only drained after the run. The drain now dispatches each batch as its own task, bounded by AGENTEX_SPAN_QUEUE_CONCURRENCY (default 8), so multiple upsert_batch requests are in flight at once and a pod can keep up with span production. The per-span START-before-END invariant is preserved: END send-tasks snapshot the in-flight START tasks and await them before issuing. Since a span's START is always enqueued (and thus dispatched) before its END, that span's START send is either still in flight (waited on) or already done. Independent spans export fully concurrently. Setting concurrency=1 restores the old serial behavior.

…ature to require processor arg

- Re-check backpressure before dispatching the END task so a batch carrying both event types can't push _inflight past the concurrency cap (the semaphore was already the hard limit; this tightens the in-flight task bound to match). - Document the retry-ordering caveat directly in _reenqueue: a re-enqueued START goes to the back of the queue and may miss a concurrently-dispatched END's barrier snapshot when retries are enabled (benign at the default max_retries=1).

Bring in span queue reliability (bounded queue, retries, drop counting), ACP protocol package move (#371), openai-agents bump (#375), and related next changes without mixing in span-queue telemetry integration. Co-authored-by: Cursor <cursoragent@cursor.com>

Wire OTel metrics into the post-merge span queue: enqueue/drop counters, batch coalescing depth/lag, drain phase timing, and export failure recording on permanent failure or exhausted retries. Preserves enqueued_at across re-enqueue for lag histograms. Co-authored-by: Cursor <cursoragent@cursor.com>

socket-security · 2026-05-29T16:32:07Z

No dependency changes detected. Learn more about Socket for GitHub.

👍 No dependency changes detected in pull request

…-span-queue-telemetry

…into SGPINF-1863-span-queue-telemetry-stas-tracing-concurrent-egress

…-span-queue-telemetry

stainless-app Bot and others added 8 commits May 18, 2026 22:40

feat(api): add schedule, checkpoints, and deployment endpoints

53b5c36

fix: resolve lint and test failures from new endpoints (#360)

bdf129c

feat: added Pydantic AI sync, async, temporal integration (#359)

781dfe1

fix(lint): resolve ruff import ordering in tracing metrics

7893cfd

james-cardenas marked this pull request as ready for review May 28, 2026 22:48

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/agentex/lib/core/tracing/span_queue.py Outdated

Comment thread src/agentex/lib/core/observability/tracing_metrics.py Outdated

Comment thread src/agentex/lib/core/observability/tracing_metrics_recording.py Outdated

fix(tracing): address Greptile review on queue depth and labels

2b2a25b

Include batch size in queue_depth sampling, clamp out-of-range HTTP codes to a bounded label, and parameterize export success processor.

james-cardenas assigned smoreinis May 28, 2026

Base automatically changed from stas/tracing-perf-linger-keepalive to next May 29, 2026 01:03

smoreinis reviewed May 29, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/tracing_metrics_recording.py

smoreinis reviewed May 29, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/tracing_metrics_recording.py Outdated

smoreinis reviewed May 29, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/tracing_metrics_recording.py

smoreinis reviewed May 29, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/tracing_metrics_recording.py Outdated

smoreinis approved these changes May 29, 2026

View reviewed changes

james-cardenas and others added 8 commits May 28, 2026 18:42

Fix redundant '_ms' from duration histogram instruments

da4f6cc

fix(tracing): drop outcome label and rename export span failure metric

8ecee7b

clean-up repeated inline import in tracing metrics recording module

87b5dfc

Update tracing_metrics_recording::record_export_success() method sign…

b4fff59

…ature to require processor arg

james-cardenas added 3 commits May 29, 2026 10:40

Merge remote-tracking branch 'origin/next' into SGPINF-1863-implement…

7f956b7

…-span-queue-telemetry

Merge remote-tracking branch 'origin/stas/tracing-concurrent-egress' …

f0a28d9

…into SGPINF-1863-span-queue-telemetry-stas-tracing-concurrent-egress

Merge remote-tracking branch 'origin/next' into SGPINF-1863-implement…

4efa25f

…-span-queue-telemetry

james-cardenas added 2 commits June 1, 2026 11:37

chore(deps): restore uv.lock from next

8a84798

Resolve greptile deadcode catch

1fa0323

smoreinis reviewed Jun 1, 2026

View reviewed changes

Comment thread src/agentex/lib/core/tracing/span_queue.py

smoreinis reviewed Jun 1, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/tracing_metrics.py

smoreinis reviewed Jun 1, 2026

View reviewed changes

Comment thread src/agentex/lib/core/tracing/span_queue.py Outdated

smoreinis reviewed Jun 1, 2026

View reviewed changes

Comment thread src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Outdated

james-cardenas added 2 commits June 1, 2026 14:45

fix(tracing): record span drop metrics in a single counter increment

f2757ae

Standardize on len(spans) in tracing processor logs

6322f3b

james-cardenas merged commit 6669012 into next Jun 1, 2026
44 checks passed

james-cardenas deleted the SGPINF-1863-implement-span-queue-telemetry branch June 1, 2026 22:01

stainless-app Bot mentioned this pull request Jun 1, 2026

release: 0.11.8 #386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tracing): OTel span queue and export telemetry (SGPINF-1863)#373

feat(tracing): OTel span queue and export telemetry (SGPINF-1863)#373
james-cardenas merged 24 commits into
nextfrom
SGPINF-1863-implement-span-queue-telemetry

james-cardenas commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smoreinis left a comment

Uh oh!

socket-security Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

james-cardenas commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smoreinis left a comment

Choose a reason for hiding this comment

Uh oh!

socket-security Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

james-cardenas commented May 28, 2026 •

edited

Loading

socket-security Bot commented May 29, 2026 •

edited

Loading