durable: extract trace context from checkpoints#818
Merged
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20b00544e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…k.readthedocs.io/en/stable/guides/using_black_with_other_tools.html#flake8 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… takes higher priority than the durable checkpoints
This comment has been minimized.
This comment has been minimized.
gh-worker-dd-mergequeue-cf854d Bot
pushed a commit
to DataDog/dd-trace-py
that referenced
this pull request
May 31, 2026
…ume (#17773) ## Description https://datadoghq.atlassian.net/browse/APMSVLS-493 This adds Datadog trace-context checkpointing for AWS Durable Execution workflows so traces can continue across Lambda invocations when a workflow suspends and later resumes. When a durable handler raises `SuspendExecution`, the integration now appends an async synthetic `_datadog_{N}` STEP containing the current propagation headers. On replay, `datadog-lambda-python` can read the latest checkpoint and reactivate the trace context before the workflow continues. And that part is in the [datadog-lambda-python PR#818 ](DataDog/datadog-lambda-python#818) The checkpoint writer also: - anchors saved parent ID to the first durable execute span, then reuses that parent ID on all following invocations' durable execute spans (All lambda invocations are siblings) - skips redundant checkpoint writes when only volatile per-span fields change - allocates checkpoint names with a thread-safe per-state counter so concurrent/map/parallel saves do not collide - writes checkpoints only on suspend, not when workflows complete or fail terminally ## Testing - Added targeted unit coverage in `tests/contrib/aws_durable_execution_sdk_python/test_trace_checkpoint.py` for header stabilization, parent-id anchoring/reuse, `traceparent` rewriting, replay diff suppression, checkpoint numbering, concurrent allocation, and no-op failure paths. - Tested using a real lambda durable function. - <img width="949" height="818" alt="Screenshot 2026-05-08 at 2 53 02 PM" src="https://github.com/user-attachments/assets/8872719e-730e-4681-a2b9-d8bb85d8977b" /> <img width="1138" height="562" alt="Screenshot 2026-05-08 at 2 55 26 PM" src="https://github.com/user-attachments/assets/b54d7e47-befe-4bac-9b8a-4719327d2b70" /> ## Risks Low to medium. This only writes Datadog checkpoint metadata on the `SuspendExecution` path, and failures in the checkpoint writer are swallowed so workflow execution should not be affected. The main behavioral risks are: - a user-defined step name colliding with the reserved `_datadog_*` prefix - unexpected SDK changes around `state.operations`, `create_checkpoint`, or operation parent IDs # Note To avoid creating too many checkpoints, we are excluding the ``dd=p:`` part when comparing the tracecontext. Diffing the new headers against the stored payload of the highest-N existing ``_datadog_*`` operation suppresses redundant writes; only ``x-datadog-parent-id`` and the ``dd=p:`` entry of ``tracestate`` are stripped before comparison so sampling priority, decision-maker, origin, and propagation tags still trigger a fresh save when they change. Co-authored-by: pablomartinezbernardo <pablo.martinezbernardo@datadoghq.com> Co-authored-by: joey.zhao <joey.zhao@datadoghq.com>
pablomartinezbernardo
approved these changes
Jun 2, 2026
Comment on lines
+63
to
+68
| # Checkpoint data is written by the dd-trace-py in Datadog style | ||
| # (x-datadog-* headers). Extraction goes through the standard | ||
| # propagator.extract path, which honors DD_TRACE_PROPAGATION_STYLE_EXTRACT. | ||
| # The default extract list (datadog, tracecontext, baggage) already | ||
| # includes datadog. Customers who override the extract list MUST keep | ||
| # datadog in it. |
Contributor
There was a problem hiding this comment.
Should we document this somewhere? Do we have a workaround so compatibility doesn't depend on customer config?
Contributor
Author
There was a problem hiding this comment.
We will put this in the public documentation. the alternative was my previous commit which introduces some unnecessary complexity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
https://datadoghq.atlassian.net/browse/APMSVLS-493
Adds Datadog trace-context extraction for AWS Lambda Durable Execution events.
The tracecontext propagation injection (i.e. checkpoint creation) part is done in dd-trace-py PR#17773
Durable executions can resume with prior execution state instead of receiving the original trigger shape directly. This change teaches
extract_dd_trace_contextto recover Datadog propagation context from durable execution checkpoints, or from the original execution input payload when no checkpoint exists.Changes
DurableExecutionArn.InitialExecutionState.Operationswhether it is provided as a list or operation map._datadog_{N}checkpoint operation.ExecutionDetails.InputPayload.Testing
python -m pytest -o addopts='' tests/test_tracing.py::TestExtractAndGetDDTraceContext::test_extracts_durable_trace_context_from_latest_checkpoint_operation_map tests/test_tracing.py::TestExtractAndGetDDTraceContext::test_extracts_durable_trace_context_from_input_payload_when_no_checkpoint