Skip to content

Do not retry on certification errors#211

Open
ashwgit wants to merge 1 commit into
release-engineering:mainfrom
ashwinifork:fail-on-cert-error
Open

Do not retry on certification errors#211
ashwgit wants to merge 1 commit into
release-engineering:mainfrom
ashwinifork:fail-on-cert-error

Conversation

@ashwgit

@ashwgit ashwgit commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@lslebodn @JAVGan PTAL. This one prevents cloudpub from retrying when there's a certification error.

Summary by Sourcery

Handle Azure certification failures as non-retriable errors and adjust error modeling accordingly.

Bug Fixes:

  • Prevent Azure job polling and publish operations from retrying when failures are caused by certification errors.

Enhancements:

  • Model Azure configure job errors as structured dictionaries and introduce helpers to detect certification-related failures.
  • Introduce a dedicated CertificationError type and integrate it into Azure job status and publish flows to surface clearer failure reasons.

Tests:

  • Extend Azure service and utils tests with fixtures and cases covering certification error detection, non-retry behavior, and updated error formats.

@sourcery-ai

sourcery-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Reviewer's Guide

Introduce explicit handling of Azure Marketplace certification failures by detecting structured certification errors in job responses, preventing retries for these permanent failures, and updating models and tests to work with structured error objects instead of plain strings.

Sequence diagram for Azure publish flow without retries on certification errors

sequenceDiagram
    actor Caller
    participant RetryWrapper
    participant AzureService as MsAzureService
    participant AzureAPI as AzureMarketplace

    Caller->>RetryWrapper: _publish_live(product, product_name)
    RetryWrapper->>AzureService: _publish_live(product, product_name)
    AzureService->>AzureAPI: submit_to_status(product_id, status='live')
    AzureAPI-->>AzureService: ConfigureStatus(job_result='failed', errors)
    AzureService->>AzureService: is_certification_error(errors)
    alt [errors indicate certification failure]
        AzureService->>RetryWrapper: raise CertificationError
        RetryWrapper-->>Caller: propagate CertificationError (no retry)
    else [non certification failure]
        AzureService->>RetryWrapper: raise RuntimeError
        RetryWrapper->>AzureService: retry _publish_live(product, product_name)
    end
Loading

File-Level Changes

Change Details Files
Detect Azure certification errors from structured job error payloads and expose a helper API for callers.
  • Add a recursive helper _contains_certification_error to inspect error dicts and nested details for invalidState certification messages.
  • Add is_certification_error(errors) utility that returns True when any error indicates a certification failure.
  • Add unit test coverage for is_certification_error with certification and non-certification error inputs.
cloudpub/ms_azure/utils.py
tests/ms_azure/test_utils.py
Prevent retries and raise a dedicated CertificationError when Azure jobs fail due to certification issues during status polling and publish operations.
  • Extend query_job_status to raise CertificationError instead of InvalidStateError when job_details.errors indicate certification failure.
  • Update _publish_preview and _publish_live to stop joining errors as strings, use the structured errors list in failure messages, and raise CertificationError on certification failures.
  • Configure tenacity retry decorators on _publish_preview and _publish_live to not retry when CertificationError is raised.
  • Adjust publish preview/live failure tests to work with structured error objects and to assert behavior with and without certification errors, including no-retry semantics and expected exception types.
cloudpub/ms_azure/service.py
tests/ms_azure/test_service.py
Represent Azure configure job errors as structured objects and introduce a dedicated CertificationError exception type.
  • Change ConfigureStatus.errors from List[str] to List[Dict[str, Any]] to align with Azure error payloads and support nested error details.
  • Introduce CertificationError as a subclass of InvalidStateError to model permanent certification failures.
  • Add fixtures for certification-style error payloads used across tests.
cloudpub/models/ms_azure.py
cloudpub/error.py
tests/ms_azure/conftest.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • There is a leftover print("errors : ", res.errors) in _publish_preview; consider removing or replacing it with proper logging to avoid noisy stdout in production.
  • Now that ConfigureStatus.errors is a List[Dict[str, Any]], error messages in _publish_preview and _publish_live rely on the raw list str() representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
  • In _publish_preview you use self._raise_error(CertificationError, ...) while _publish_live directly raises CertificationError; consider using a consistent error-raising path in both methods for easier handling and logging.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There is a leftover `print("errors : ", res.errors)` in `_publish_preview`; consider removing or replacing it with proper logging to avoid noisy stdout in production.
- Now that `ConfigureStatus.errors` is a `List[Dict[str, Any]]`, error messages in `_publish_preview` and `_publish_live` rely on the raw list `str()` representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
- In `_publish_preview` you use `self._raise_error(CertificationError, ...)` while `_publish_live` directly raises `CertificationError`; consider using a consistent error-raising path in both methods for easier handling and logging.

## Individual Comments

### Comment 1
<location path="cloudpub/ms_azure/service.py" line_range="201-202" />
<code_context>
         """
         job_details = self._query_job_details(job_id=job_id)
         if job_details.job_result == "failed":
             error_message = f"Job {job_id} failed: \n{job_details.errors}"
+            if is_certification_error(job_details.errors):
+                self._raise_error(CertificationError, error_message)
             self._raise_error(InvalidStateError, error_message)
</code_context>
<issue_to_address>
**issue:** Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.

`job_details.errors` is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the `message` fields) before formatting, and align this behavior with any other call sites that include `errors` in messages.
</issue_to_address>

### Comment 2
<location path="cloudpub/ms_azure/service.py" line_range="729-735" />
<code_context>
             product.id, state="preview"
         ):
-            errors = "\n".join(res.errors)
+            errors = res.errors
+            print("errors : ", res.errors)
             failure_msg = (
                 f"Failed to submit the product {product_name} ({product.id}) to preview. "
                 f"Status: {res.job_result} Errors: {errors}"
             )
+            if is_certification_error(res.errors):
+                self._raise_error(CertificationError, failure_msg)
             raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**issue:** The plain `print` call and raw `errors` formatting are inconsistent with the surrounding logging behavior.

This block now uses `print("errors : ", res.errors)` and includes the raw `errors` list (list of dicts) directly in the message. Please avoid `print` in this context and use the existing logging approach (e.g., `log.*`) or rely on `failure_msg`. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
</issue_to_address>

### Comment 3
<location path="cloudpub/ms_azure/service.py" line_range="735-744" />
<code_context>
                 f"Failed to submit the product {product_name} ({product.id}) to preview. "
                 f"Status: {res.job_result} Errors: {errors}"
             )
+            if is_certification_error(res.errors):
+                self._raise_error(CertificationError, failure_msg)
             raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**suggestion:** Use of `self._raise_error` for `CertificationError` would align behavior with other failure paths.

`query_job_status` uses `self._raise_error(CertificationError, error_message)`, but here `_publish_live` calls `CertificationError` via `_raise_error` as well. To keep error handling consistent and ensure any shared logic in `_raise_error` (logging, metrics, error mapping) always applies, use `self._raise_error(CertificationError, failure_msg)` instead of raising `CertificationError` directly.

Suggested implementation:

```python
            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)

```

```python
    def _publish_live(self, product: Product, product_name: str) -> None:
        res = self.submit_to_status(product_id=product.id, status='live')

        # Example structure of the failure handling logic that should
        # now be using `self._raise_error` for CertificationError:
        if res.job_result != 'succeeded' or not self.get_submission_state(
            product.id, state="live"
        ):
            errors = res.errors
            failure_msg = (
                f"Failed to submit the product {product_name} ({product.id}) to live. "
                f"Status: {res.job_result} Errors: {errors}"
            )
            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)

```

I only see the beginning of `_publish_live` in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:

1. Locate the existing error-handling code inside `_publish_live` (likely similar to the preview path) that currently does something like:
   ```python
   if is_certification_error(res.errors):
       raise CertificationError(failure_msg)
   ```
2. Replace that `raise CertificationError(...)` with:
   ```python
   self._raise_error(CertificationError, failure_msg)
   ```
3. Ensure the `failure_msg` variable name and the condition (`res.job_result`, `get_submission_state`, etc.) match your actual implementation; the block I provided is a template to show how to integrate `_raise_error`, not a strict structural change to your business logic.

The key behavioral change is: all `CertificationError` raises in `_publish_live` should now go through `self._raise_error`.
</issue_to_address>

### Comment 4
<location path="tests/ms_azure/test_service.py" line_range="989-994" />
<code_context>
         azure_service._publish_preview.retry.sleep = mock.Mock()  # type: ignore
         expected_err = (
             f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
-            "Status: failed Errors: failure1\nfailure2"
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard

In `test_publish_live_fail_on_retry`, the regex `Status: failed Errors: *` is too permissive and will match almost any error content. Since `ConfigureStatus.errors` is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using `re.escape(str(err_resp.errors))`, as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread cloudpub/ms_azure/service.py
Comment thread cloudpub/ms_azure/service.py Outdated
Comment on lines +729 to +735
errors = res.errors
print("errors : ", res.errors)
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The plain print call and raw errors formatting are inconsistent with the surrounding logging behavior.

This block now uses print("errors : ", res.errors) and includes the raw errors list (list of dicts) directly in the message. Please avoid print in this context and use the existing logging approach (e.g., log.*) or rely on failure_msg. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.

Comment thread cloudpub/ms_azure/service.py
Comment on lines 989 to 994
expected_err = (
f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
"Status: failed Errors: failure1\nfailure2"
f"Status: failed Errors: {re.escape(str(err_resp.errors))}"
)

# Test

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard

In test_publish_live_fail_on_retry, the regex Status: failed Errors: * is too permissive and will match almost any error content. Since ConfigureStatus.errors is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using re.escape(str(err_resp.errors)), as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.

@JAVGan JAVGan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes please, but they will make the code future proof

Comment thread cloudpub/ms_azure/service.py Outdated
product.id, state="preview"
):
errors = "\n".join(res.errors)
errors = res.errors

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this variable errors as you're not using it anywhere. I see you're directly passing res.errors to the lower functions. Or replace res.errors with errors but I do prefer to have less variables.

f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwgit Since we're having this if two times, here and in L202 what about encapsulating it in some private function within this own module instead? Then we will reduce the nested code:

def __check_raise_certification_errors(errors):
    if is_certification_error(errors):
        self._raise_error(CertificationError, error_message)

[...]
# On line 202
    self.__check_raise_certification_errors(job_details.errors)
[...]
# On this line
    self.__check_raise_certification_errors(res.errors)

Comment thread cloudpub/ms_azure/service.py Outdated
Comment on lines +765 to +766
if is_certification_error(res.errors):
raise CertificationError(failure_msg)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use here what Sourcery-AI pointed out:

Suggested change
if is_certification_error(res.errors):
raise CertificationError(failure_msg)
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)

Or even better, call the private function I mentioned in the other thread:

Suggested change
if is_certification_error(res.errors):
raise CertificationError(failure_msg)
self.__check_raise_certification_errors(res.errors)

Comment on lines +549 to +550
def _contains_certification_error(item: Any) -> bool:
code: str = item.get("code", "")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Item here points to Any, can you please change it to Dict[str, Any]

OR

Verify inside whether it's a dictionary, like:

if not isinstance(item, dict):
    raise [...]

message: str = item.get("message", "")
if code == "invalidState" and "certification" in message.lower():
return True
for detail in item.get("details") or []:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please validate somehow that item["details"] is a list.

I know we should receive a list from MS but if they change somehow their format the code shouldn't break.

Comment on lines +641 to +650
"details": [
{
"code": "invalidState",
"message": "Certification",
"details": [
{
"code": "invalidState",
"message": "Issues found during Certification.",
}
],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Please also test some invalid error messages, like details not being present or present as a dict or something unexpected.

Make sure we can parse anything and just retrieve the certification error when we're sure the format is this one

@ashwgit ashwgit force-pushed the fail-on-cert-error branch from 70d18be to ab75c04 Compare June 26, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants