Skip to content

[Common] EP C API: version config structs and extend nvte_ep_prepare with total_recv_tokens_per_rank placeholder#3154

Merged
phu0ngng merged 9 commits into
NVIDIA:mainfrom
phu0ngng:ep-c-api
Jun 30, 2026
Merged

[Common] EP C API: version config structs and extend nvte_ep_prepare with total_recv_tokens_per_rank placeholder#3154
phu0ngng merged 9 commits into
NVIDIA:mainfrom
phu0ngng:ep-c-api

Conversation

@phu0ngng

Copy link
Copy Markdown
Collaborator

Description

  • Versions the EP config structs (NVTEEpGroupConfig, NVTEEpLayerConfig) with a leading struct_size field and passes them by pointer, so fields can be added without breaking ABI.
  • Adds a total_recv_tokens_per_rank placeholder output to nvte_ep_prepare for future use (accepted, may be null, ignored for now).
  • Renames the nvte_ep_prepare output token_counts to recv_tokens_per_expert for clarity.
  • Updates all call sites (JAX bindings, C++ distributed tests) and docs accordingly.
  • PENDING: Update PyT callers' side after PR [PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035 is merged.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR versions the EP C API config structs (NVTEEpGroupConfig, NVTEEpLayerConfig) by prepending a struct_size field and changing all API functions to accept them by pointer, enabling future ABI-compatible field additions. It also renames token_countsrecv_tokens_per_expert for clarity and adds a total_recv_tokens_per_rank null-accepted placeholder to nvte_ep_prepare.

  • normalize_ep_config() in ep_api.cpp handles the versioning contract cleanly: struct_size == 0 is treated as the base layout, values below min_size are rejected with a clear diagnostic, and a partial memcpy lets older callers omit unknown trailing fields which then default to zero.
  • kGroupConfigMinSize and kLayerConfigMinSize are marked "frozen" and cover all current fields, so new fields appended in future versions will transparently default to zero for old callers — the design is correct for the first versioned release.
  • The NVTE_EP_*_CONFIG_INIT macros and designated-initialiser updates at all call sites (JAX, PyTorch, C++ tests) are consistent; PyTorch's ep_prepare Python-parameter rename is intentionally deferred pending PR [PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035.

Confidence Score: 5/5

Safe to merge; the ABI versioning design is sound, all call sites have been updated, and the ignored placeholder is clearly annotated.

The normalize_ep_config() logic correctly handles all struct_size cases, the frozen min_size constants preserve backward compatibility for future field additions, and the rename/pointer-conversion is consistently applied across JAX, PyTorch, and C++ test call sites. The only finding is a style-level duplication in test code that carries no correctness risk.

No files require special attention. The PyTorch ep_prepare parameter name (token_counts) is intentionally left as-is pending a dependent PR.

Important Files Changed

Filename Overview
transformer_engine/common/include/transformer_engine/ep.h Adds struct_size versioning field to NVTEEpGroupConfig and NVTEEpLayerConfig; renames max_num_sms→num_comm_sms; adds NVTE_EP_*_CONFIG_INIT macros; extends nvte_ep_prepare with nullable total_recv_tokens_per_rank placeholder and switches config args to pointers.
transformer_engine/common/ep/ep_api.cpp Introduces normalize_ep_config() template that handles struct_size versioning (0→min_size, range check, partial memcpy); rewires all public entry points to use pointer-typed config args; stubs correctly updated in the !NVTE_WITH_NCCL_EP branch.
transformer_engine/common/ep/ep_backend.cpp Renames max_num_sms→num_comm_sms in validate_config and init(); adds total_recv_tokens_per_rank as an explicitly ignored parameter to prepare() with a clear "reserved placeholder" comment; max_token_dtype range check retained.
transformer_engine/common/ep/ep_backend.h Updates prepare() signature to accept total_recv_tokens_per_rank alongside the renamed recv_tokens_per_expert; no other changes.
transformer_engine/jax/csrc/extensions/ep.cpp Updates all config construction to use designated-initialiser syntax with struct_size, switches nvte_ep_* calls to pointer args, renames token_counts→recv_tokens_per_expert locally; passes nullptr for total_recv_tokens_per_rank.
transformer_engine/pytorch/csrc/extensions/ep.cpp Migrates config construction to designated initialisers with struct_size, switches nvte_ep_* calls to pointer args; ep_prepare() Python-facing parameter name intentionally kept as token_counts pending PR #3035.
tests/cpp_distributed/test_ep.cu Renames token_counts→recv_tokens_per_expert throughout; adds NVTEEpLayerConfig layer_cfg_ to both EPBuffers and EPTensors and initializes it with NVTE_EP_LAYER_CONFIG_INIT; all nvte_ep_prepare call sites updated to new pointer-based signature.
tests/cpp_distributed/test_ep_common.h Replaces NVTEEpGroupConfig{} zero-init with NVTE_EP_GROUP_CONFIG_INIT in ep_bootstrap() and ep_reinitialize(); updates nvte_ep_initialize() calls to pass by pointer.

Reviews (3): Last reviewed commit: "Merge branch 'main' into ep-c-api" | Re-trigger Greptile

Comment thread transformer_engine/common/ep/ep_backend.cpp
Comment thread transformer_engine/common/ep/ep_api.cpp

@jberchtold-nvidia jberchtold-nvidia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI. I think for now the struct_size field is sufficient for versioning. I feel like we have a lot of structs that have needed versioning, so in future would be great for someone to align the C API with similar versioning functionality, like some VERSIONED_STRUCT macro. But out of scope for this PR

@phu0ngng phu0ngng added the 2.17 label Jun 29, 2026
phu0ngng added 6 commits June 29, 2026 09:45
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
… test

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng phu0ngng merged commit 3df5e19 into NVIDIA:main Jun 30, 2026
44 of 54 checks passed
@phu0ngng phu0ngng deleted the ep-c-api branch June 30, 2026 17:45
KshitijLakhani pushed a commit that referenced this pull request Jul 1, 2026
…` with `total_recv_tokens_per_rank` placeholder (#3154)

* versioning EP C configs

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Rename EP prepare token_counts to recv_tokens_per_expert

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Add total_recv_tokens_per_rank placeholder to nvte_ep_prepare

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Adapt PyTorch EP binding to versioned nvte_ep C config API

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Rename EP group config max_num_sms to num_comm_sms

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants