Do not retry on certification errors#211
Conversation
Reviewer's GuideIntroduce explicit handling of Azure Marketplace certification failures by detecting structured certification errors in job responses, preventing retries for these permanent failures, and updating models and tests to work with structured error objects instead of plain strings. Sequence diagram for Azure publish flow without retries on certification errorssequenceDiagram
actor Caller
participant RetryWrapper
participant AzureService as MsAzureService
participant AzureAPI as AzureMarketplace
Caller->>RetryWrapper: _publish_live(product, product_name)
RetryWrapper->>AzureService: _publish_live(product, product_name)
AzureService->>AzureAPI: submit_to_status(product_id, status='live')
AzureAPI-->>AzureService: ConfigureStatus(job_result='failed', errors)
AzureService->>AzureService: is_certification_error(errors)
alt [errors indicate certification failure]
AzureService->>RetryWrapper: raise CertificationError
RetryWrapper-->>Caller: propagate CertificationError (no retry)
else [non certification failure]
AzureService->>RetryWrapper: raise RuntimeError
RetryWrapper->>AzureService: retry _publish_live(product, product_name)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- There is a leftover
print("errors : ", res.errors)in_publish_preview; consider removing or replacing it with proper logging to avoid noisy stdout in production. - Now that
ConfigureStatus.errorsis aList[Dict[str, Any]], error messages in_publish_previewand_publish_liverely on the raw liststr()representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure. - In
_publish_previewyou useself._raise_error(CertificationError, ...)while_publish_livedirectly raisesCertificationError; consider using a consistent error-raising path in both methods for easier handling and logging.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There is a leftover `print("errors : ", res.errors)` in `_publish_preview`; consider removing or replacing it with proper logging to avoid noisy stdout in production.
- Now that `ConfigureStatus.errors` is a `List[Dict[str, Any]]`, error messages in `_publish_preview` and `_publish_live` rely on the raw list `str()` representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
- In `_publish_preview` you use `self._raise_error(CertificationError, ...)` while `_publish_live` directly raises `CertificationError`; consider using a consistent error-raising path in both methods for easier handling and logging.
## Individual Comments
### Comment 1
<location path="cloudpub/ms_azure/service.py" line_range="201-202" />
<code_context>
"""
job_details = self._query_job_details(job_id=job_id)
if job_details.job_result == "failed":
error_message = f"Job {job_id} failed: \n{job_details.errors}"
+ if is_certification_error(job_details.errors):
+ self._raise_error(CertificationError, error_message)
self._raise_error(InvalidStateError, error_message)
</code_context>
<issue_to_address>
**issue:** Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.
`job_details.errors` is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the `message` fields) before formatting, and align this behavior with any other call sites that include `errors` in messages.
</issue_to_address>
### Comment 2
<location path="cloudpub/ms_azure/service.py" line_range="729-735" />
<code_context>
product.id, state="preview"
):
- errors = "\n".join(res.errors)
+ errors = res.errors
+ print("errors : ", res.errors)
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
+ if is_certification_error(res.errors):
+ self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**issue:** The plain `print` call and raw `errors` formatting are inconsistent with the surrounding logging behavior.
This block now uses `print("errors : ", res.errors)` and includes the raw `errors` list (list of dicts) directly in the message. Please avoid `print` in this context and use the existing logging approach (e.g., `log.*`) or rely on `failure_msg`. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
</issue_to_address>
### Comment 3
<location path="cloudpub/ms_azure/service.py" line_range="735-744" />
<code_context>
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
+ if is_certification_error(res.errors):
+ self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**suggestion:** Use of `self._raise_error` for `CertificationError` would align behavior with other failure paths.
`query_job_status` uses `self._raise_error(CertificationError, error_message)`, but here `_publish_live` calls `CertificationError` via `_raise_error` as well. To keep error handling consistent and ensure any shared logic in `_raise_error` (logging, metrics, error mapping) always applies, use `self._raise_error(CertificationError, failure_msg)` instead of raising `CertificationError` directly.
Suggested implementation:
```python
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
```
```python
def _publish_live(self, product: Product, product_name: str) -> None:
res = self.submit_to_status(product_id=product.id, status='live')
# Example structure of the failure handling logic that should
# now be using `self._raise_error` for CertificationError:
if res.job_result != 'succeeded' or not self.get_submission_state(
product.id, state="live"
):
errors = res.errors
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to live. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
```
I only see the beginning of `_publish_live` in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:
1. Locate the existing error-handling code inside `_publish_live` (likely similar to the preview path) that currently does something like:
```python
if is_certification_error(res.errors):
raise CertificationError(failure_msg)
```
2. Replace that `raise CertificationError(...)` with:
```python
self._raise_error(CertificationError, failure_msg)
```
3. Ensure the `failure_msg` variable name and the condition (`res.job_result`, `get_submission_state`, etc.) match your actual implementation; the block I provided is a template to show how to integrate `_raise_error`, not a strict structural change to your business logic.
The key behavioral change is: all `CertificationError` raises in `_publish_live` should now go through `self._raise_error`.
</issue_to_address>
### Comment 4
<location path="tests/ms_azure/test_service.py" line_range="989-994" />
<code_context>
azure_service._publish_preview.retry.sleep = mock.Mock() # type: ignore
expected_err = (
f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
- "Status: failed Errors: failure1\nfailure2"
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard
In `test_publish_live_fail_on_retry`, the regex `Status: failed Errors: *` is too permissive and will match almost any error content. Since `ConfigureStatus.errors` is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using `re.escape(str(err_resp.errors))`, as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| errors = res.errors | ||
| print("errors : ", res.errors) | ||
| failure_msg = ( | ||
| f"Failed to submit the product {product_name} ({product.id}) to preview. " | ||
| f"Status: {res.job_result} Errors: {errors}" | ||
| ) | ||
| if is_certification_error(res.errors): |
There was a problem hiding this comment.
issue: The plain print call and raw errors formatting are inconsistent with the surrounding logging behavior.
This block now uses print("errors : ", res.errors) and includes the raw errors list (list of dicts) directly in the message. Please avoid print in this context and use the existing logging approach (e.g., log.*) or rely on failure_msg. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
| expected_err = ( | ||
| f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. " | ||
| "Status: failed Errors: failure1\nfailure2" | ||
| f"Status: failed Errors: {re.escape(str(err_resp.errors))}" | ||
| ) | ||
|
|
||
| # Test |
There was a problem hiding this comment.
suggestion (testing): Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard
In test_publish_live_fail_on_retry, the regex Status: failed Errors: * is too permissive and will match almost any error content. Since ConfigureStatus.errors is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using re.escape(str(err_resp.errors)), as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
JAVGan
left a comment
There was a problem hiding this comment.
Some minor changes please, but they will make the code future proof
| product.id, state="preview" | ||
| ): | ||
| errors = "\n".join(res.errors) | ||
| errors = res.errors |
There was a problem hiding this comment.
You can drop this variable errors as you're not using it anywhere. I see you're directly passing res.errors to the lower functions. Or replace res.errors with errors but I do prefer to have less variables.
| f"Failed to submit the product {product_name} ({product.id}) to preview. " | ||
| f"Status: {res.job_result} Errors: {errors}" | ||
| ) | ||
| if is_certification_error(res.errors): |
There was a problem hiding this comment.
@ashwgit Since we're having this if two times, here and in L202 what about encapsulating it in some private function within this own module instead? Then we will reduce the nested code:
def __check_raise_certification_errors(errors):
if is_certification_error(errors):
self._raise_error(CertificationError, error_message)
[...]
# On line 202
self.__check_raise_certification_errors(job_details.errors)
[...]
# On this line
self.__check_raise_certification_errors(res.errors)
| if is_certification_error(res.errors): | ||
| raise CertificationError(failure_msg) |
There was a problem hiding this comment.
Please use here what Sourcery-AI pointed out:
| if is_certification_error(res.errors): | |
| raise CertificationError(failure_msg) | |
| if is_certification_error(res.errors): | |
| self._raise_error(CertificationError, failure_msg) |
Or even better, call the private function I mentioned in the other thread:
| if is_certification_error(res.errors): | |
| raise CertificationError(failure_msg) | |
| self.__check_raise_certification_errors(res.errors) |
| def _contains_certification_error(item: Any) -> bool: | ||
| code: str = item.get("code", "") |
There was a problem hiding this comment.
Item here points to Any, can you please change it to Dict[str, Any]
OR
Verify inside whether it's a dictionary, like:
if not isinstance(item, dict):
raise [...]
| message: str = item.get("message", "") | ||
| if code == "invalidState" and "certification" in message.lower(): | ||
| return True | ||
| for detail in item.get("details") or []: |
There was a problem hiding this comment.
Also please validate somehow that item["details"] is a list.
I know we should receive a list from MS but if they change somehow their format the code shouldn't break.
| "details": [ | ||
| { | ||
| "code": "invalidState", | ||
| "message": "Certification", | ||
| "details": [ | ||
| { | ||
| "code": "invalidState", | ||
| "message": "Issues found during Certification.", | ||
| } | ||
| ], |
There was a problem hiding this comment.
Nice!
Please also test some invalid error messages, like details not being present or present as a dict or something unexpected.
Make sure we can parse anything and just retrieve the certification error when we're sure the format is this one
70d18be to
ab75c04
Compare
@lslebodn @JAVGan PTAL. This one prevents cloudpub from retrying when there's a certification error.
Summary by Sourcery
Handle Azure certification failures as non-retriable errors and adjust error modeling accordingly.
Bug Fixes:
Enhancements:
Tests: