feat(tracing): OTel span queue and export telemetry (SGPINF-1863)#373
Merged
Conversation
Two compounding causes of slow SGP trace export under load: - The async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. Adds a 100ms linger (tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. - httpx keepalive was disabled (max_keepalive_connections=0) in SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule to avoid "bound to a different event loop" errors in sync-ACP. Each span paid a full TLS handshake. Replaced with a per-event-loop client cache keyed on id(asyncio.get_running_loop()); connections are reused within a loop and cross-loop safety is preserved. Tests cover linger coalescing, batch-size cap interaction, per-loop client caching, a keepalive-enabled regression guard, and disabled-processor null-client behavior.
Addresses Greptile review feedback on PR #362. The original `dict[int, AsyncSGPClient]` cache used `id(asyncio.get_running_loop())` as the key. In CPython `id()` returns a memory address, and once a loop is garbage-collected its address can be assigned to a new loop — a fresh loop hashing to a stale entry would receive a client whose httpx.AsyncClient was bound to the dead loop, reintroducing the "bound to a different event loop" error this PR was built to prevent. Switching the cache to `weakref.WeakKeyDictionary` keyed on the loop object itself fixes the bug: the entry is evicted automatically when the loop is collected, so id() recycling can't cause stale-client reuse. Multi-loop caching benefit is preserved (better than the single-slot pattern in TracingModule for agents that bounce between loops). Same fix applied to AgentexAsyncTracingProcessor. Added a regression test verifying the cache evicts a closed/dropped loop's entry after gc.collect().
Addresses both Greptile P3 findings on PR #362: - AgentexAsyncTracingProcessor implemented the same per-loop client cache pattern as SGPAsyncTracingProcessor but had no dedicated test file. Added test_agentex_tracing_processor.py mirroring the SGP coverage: single-build-per-loop, keepalive-enabled regression guard, and WeakKeyDictionary eviction after GC. Skipped cleanly with pytest.importorskip when pydantic_ai isn't installed (the SDK dev venv state), since agentex_tracing_processor pulls in agentex.lib.adk which requires it. - test_linger_respects_batch_size_cap used linger_ms=500, forcing the tail singleton batch to wait out the full 500ms timeout — the test only asserts no batch exceeds the cap, so dropping to linger_ms=50 keeps correctness while cutting wall time by ~10x.
Emit queue depth, batch lag, drain timing, and export success/failure counters for async span processing. Failures include bounded HTTP status labels; disable SDK recording with AGENTEX_TRACING_METRICS=0. Co-authored-by: Cursor <cursoragent@cursor.com>
Include batch size in queue_depth sampling, clamp out-of-range HTTP codes to a bounded label, and parameterize export success processor.
smoreinis
reviewed
May 29, 2026
smoreinis
reviewed
May 29, 2026
smoreinis
reviewed
May 29, 2026
smoreinis
reviewed
May 29, 2026
smoreinis
approved these changes
May 29, 2026
Contributor
smoreinis
left a comment
There was a problem hiding this comment.
left a few minor comments inline, but nothing to block getting this in - I did merge the underlying PR into next so you will need a rebase but hopefully it should be relatively painless. will be great to have more observability on the spans behavior!!
Span export was strictly serial: the drain loop awaited each upsert_batch to completion before sending the next, so per-pod egress was capped at ~1/request-latency (~150ms server-side ⇒ ~6-7 PUT/s/pod) regardless of CPU or backend headroom. Under load the queue backlogged and only drained after the run. The drain now dispatches each batch as its own task, bounded by AGENTEX_SPAN_QUEUE_CONCURRENCY (default 8), so multiple upsert_batch requests are in flight at once and a pod can keep up with span production. The per-span START-before-END invariant is preserved: END send-tasks snapshot the in-flight START tasks and await them before issuing. Since a span's START is always enqueued (and thus dispatched) before its END, that span's START send is either still in flight (waited on) or already done. Independent spans export fully concurrently. Setting concurrency=1 restores the old serial behavior.
…ature to require processor arg
- Re-check backpressure before dispatching the END task so a batch carrying both event types can't push _inflight past the concurrency cap (the semaphore was already the hard limit; this tightens the in-flight task bound to match). - Document the retry-ordering caveat directly in _reenqueue: a re-enqueued START goes to the back of the queue and may miss a concurrently-dispatched END's barrier snapshot when retries are enabled (benign at the default max_retries=1).
Wire OTel metrics into the post-merge span queue: enqueue/drop counters, batch coalescing depth/lag, drain phase timing, and export failure recording on permanent failure or exhausted retries. Preserves enqueued_at across re-enqueue for lag histograms. Co-authored-by: Cursor <cursoragent@cursor.com>
|
No dependency changes detected. Learn more about Socket for GitHub. 👍 No dependency changes detected in pull request |
…-span-queue-telemetry
…into SGPINF-1863-span-queue-telemetry-stas-tracing-concurrent-egress
…-span-queue-telemetry
smoreinis
reviewed
Jun 1, 2026
smoreinis
reviewed
Jun 1, 2026
smoreinis
reviewed
Jun 1, 2026
smoreinis
reviewed
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SGPINF-1863 - Validate SDK Connection Fix (PR #362) to Unblock 2x Export Capacity
upsert_batchcompletion with bounded HTTP status labels.AGENTEX_TRACING_METRICS=0|false|no|offkill switch to disable SDK-side recording without code changes.Stacked on PR362 (
stas/tracing-perf-linger-keepalive) for load-test observability before M5 re-run.Test plan
uv run pytest src/agentex/lib/core/observability/tests/test_tracing_metrics.py src/agentex/lib/core/observability/tests/test_tracing_metrics_recording.py tests/lib/core/tracing/test_span_queue.py tests/lib/core/tracing/processors/test_sgp_tracing_processor.py -o addopts=AGENTEX_TRACING_METRICS=0disables recording with no export overhead regressionGreptile Summary
This PR adds OpenTelemetry metrics instrumentation to the async span queue and SGP exporter: queue depth, batch lag, drain duration, export success/failure counters, and shutdown flush timing. An
AGENTEX_TRACING_METRICSkill switch disables recording without code changes. All three issues flagged in the prior review (queue depth under-reporting, unboundedhttp_codelabel cardinality, and hard-coded"sgp"inrecord_export_success) have been addressed.tracing_metrics.py/tracing_metrics_recording.py: New modules define all OTel instruments and thin "best-effort" recording helpers that lazy-load the OTel SDK only on first use, keeping hot-path import overhead near zero.span_queue.py: Recording calls are woven into enqueue, drain-batch coalesce, per-phase dispatch timing, retry-exhausted failure, permanent failure, and shutdown-timeout paths; therecord_batch_coalescedcall correctly passesself._queue.qsize() + len(batch)to capture pre-drain depth.sgp_tracing_processor.py:record_export_successis called after each successfulupsert_batch, taking an explicitprocessorlabel rather than a hard-coded string.Confidence Score: 5/5
Safe to merge. All metrics calls are wrapped in best-effort try/except blocks and gated by the kill switch, so a misconfigured OTel provider cannot disrupt span export.
All three issues raised in the previous review have been correctly resolved. The failure-metric paths in _handle_failure are mutually exclusive — the retryable branch returns early, so the permanent-failure counter at the bottom is never double-fired. Recording helpers lazy-load OTel and swallow all exceptions, keeping the hot path safe. Test coverage is thorough across unit and integration layers.
No files require special attention.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[enqueue] -->|put_nowait| B[_SpanQueueItem\nenqueued_at = monotonic] A -->|QueueFull| C[record_span_dropped\nreason=queue_full] A -->|_stopping| D[record_span_dropped\nreason=shutdown] A --> E[record_span_enqueued] B --> F[_drain_loop\ncollect batch] F --> G[record_batch_coalesced\nqueue_depth = qsize + len_batch] G --> H[_process_items START] H --> I[await processor.on_spans_start] I -->|success| J[record_export_success] I -->|exception| K{_is_retryable_exc?} K -->|yes, exhausted| L[record_export_failure\nspan_count=exhausted] K -->|yes, retriable| M[_reenqueue items] K -->|no| N[record_export_failure\nspan_count=all items] G --> O[await barrier] O --> P[_process_items END] P --> Q[await processor.on_spans_end] Q -->|success| R[record_export_success] Q -->|exception| S{_is_retryable_exc?} S -->|yes, exhausted| T[record_export_failure] S -->|yes, retriable| U[_reenqueue items] S -->|no| V[record_export_failure] W[shutdown timeout] --> X[record_shutdown_timeout]Reviews (11): Last reviewed commit: "Standardize on len(spans) in tracing pro..." | Re-trigger Greptile