Skip to content

durable: extract trace context from checkpoints#818

Merged
joeyzhao2018 merged 17 commits into
mainfrom
joey/durable-tracecontext-extraction
Jun 2, 2026
Merged

durable: extract trace context from checkpoints#818
joeyzhao2018 merged 17 commits into
mainfrom
joey/durable-tracecontext-extraction

Conversation

@joeyzhao2018
Copy link
Copy Markdown
Contributor

@joeyzhao2018 joeyzhao2018 commented May 6, 2026

Summary

https://datadoghq.atlassian.net/browse/APMSVLS-493

Adds Datadog trace-context extraction for AWS Lambda Durable Execution events.
The tracecontext propagation injection (i.e. checkpoint creation) part is done in dd-trace-py PR#17773

Durable executions can resume with prior execution state instead of receiving the original trigger shape directly. This change teaches extract_dd_trace_context to recover Datadog propagation context from durable execution checkpoints, or from the original execution input payload when no checkpoint exists.

Changes

  • Detect durable execution events via DurableExecutionArn.
  • Normalize InitialExecutionState.Operations whether it is provided as a list or operation map.
  • Extract trace context from the latest _datadog_{N} checkpoint operation.
  • Fall back to Datadog headers in the original ExecutionDetails.InputPayload.
  • Preserve custom extractor precedence before durable-specific extraction.
  • Continue through the existing extraction chain when no complete durable context is found.
  • Add tests for:
    • latest checkpoint selection from an operation map
    • fallback extraction from JSON input payload headers

Testing

  • python -m pytest -o addopts='' tests/test_tracing.py::TestExtractAndGetDDTraceContext::test_extracts_durable_trace_context_from_latest_checkpoint_operation_map tests/test_tracing.py::TestExtractAndGetDDTraceContext::test_extracts_durable_trace_context_from_input_payload_when_no_checkpoint
    • 2 passed

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20b00544e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread datadog_lambda/tracing.py Outdated
@joeyzhao2018 joeyzhao2018 requested a review from mabdinur May 13, 2026 00:00
Comment thread datadog_lambda/tracing.py Outdated
Comment thread datadog_lambda/tracing.py Outdated
Comment thread datadog_lambda/tracing.py Outdated
@joeyzhao2018 joeyzhao2018 changed the title durable: extract trace context from checkpoints and input payload durable: extract trace context from checkpoints May 20, 2026
@datadog-official

This comment has been minimized.

gh-worker-dd-mergequeue-cf854d Bot pushed a commit to DataDog/dd-trace-py that referenced this pull request May 31, 2026
…ume (#17773)

## Description
https://datadoghq.atlassian.net/browse/APMSVLS-493

This adds Datadog trace-context checkpointing for AWS Durable Execution workflows so traces can continue across Lambda invocations when a workflow suspends and later resumes.

When a durable handler raises `SuspendExecution`, the integration now appends an async synthetic `_datadog_{N}` STEP containing the current propagation headers. On replay, `datadog-lambda-python` can read the latest checkpoint and reactivate the trace context before the workflow continues. And that part is in the [datadog-lambda-python PR#818
](DataDog/datadog-lambda-python#818)
The checkpoint writer also:
- anchors saved parent ID to the first durable execute span, then reuses that parent ID on all following invocations' durable execute spans (All lambda invocations are siblings) 
- skips redundant checkpoint writes when only volatile per-span fields change
- allocates checkpoint names with a thread-safe per-state counter so concurrent/map/parallel saves do not collide
- writes checkpoints only on suspend, not when workflows complete or fail terminally 

## Testing

- Added targeted unit coverage in `tests/contrib/aws_durable_execution_sdk_python/test_trace_checkpoint.py` for header stabilization, parent-id anchoring/reuse, `traceparent` rewriting, replay diff suppression, checkpoint numbering, concurrent allocation, and no-op failure paths.
- Tested using a real lambda durable function.
- 
<img width="949" height="818" alt="Screenshot 2026-05-08 at 2 53 02 PM" src="https://github.com/user-attachments/assets/8872719e-730e-4681-a2b9-d8bb85d8977b" />
<img width="1138" height="562" alt="Screenshot 2026-05-08 at 2 55 26 PM" src="https://github.com/user-attachments/assets/b54d7e47-befe-4bac-9b8a-4719327d2b70" />


## Risks

Low to medium. This only writes Datadog checkpoint metadata on the `SuspendExecution` path, and failures in the checkpoint writer are swallowed so workflow execution should not be affected.

The main behavioral risks are:
- a user-defined step name colliding with the reserved `_datadog_*` prefix
- unexpected SDK changes around `state.operations`, `create_checkpoint`, or operation parent IDs


# Note
To avoid creating too many checkpoints, we are excluding the ``dd=p:`` part when comparing the tracecontext.

Diffing the new headers against the stored payload of the highest-N existing ``_datadog_*`` operation suppresses redundant writes; only ``x-datadog-parent-id`` and the ``dd=p:`` entry of ``tracestate`` are stripped before comparison so sampling priority, decision-maker, origin, and propagation tags still trigger a fresh save when they change.


Co-authored-by: pablomartinezbernardo <pablo.martinezbernardo@datadoghq.com>
Co-authored-by: joey.zhao <joey.zhao@datadoghq.com>
Comment thread datadog_lambda/durable.py
Comment on lines +63 to +68
# Checkpoint data is written by the dd-trace-py in Datadog style
# (x-datadog-* headers). Extraction goes through the standard
# propagator.extract path, which honors DD_TRACE_PROPAGATION_STYLE_EXTRACT.
# The default extract list (datadog, tracecontext, baggage) already
# includes datadog. Customers who override the extract list MUST keep
# datadog in it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this somewhere? Do we have a workaround so compatibility doesn't depend on customer config?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will put this in the public documentation. the alternative was my previous commit which introduces some unnecessary complexity.

@joeyzhao2018 joeyzhao2018 merged commit f040916 into main Jun 2, 2026
104 checks passed
@joeyzhao2018 joeyzhao2018 deleted the joey/durable-tracecontext-extraction branch June 2, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants