Skip to content

ref(utils): Make ParityChecker print out mismatches in a PII safe way#116038

Open
Christinarlong wants to merge 3 commits into
masterfrom
crl/webhook-payload-validation
Open

ref(utils): Make ParityChecker print out mismatches in a PII safe way#116038
Christinarlong wants to merge 3 commits into
masterfrom
crl/webhook-payload-validation

Conversation

@Christinarlong
Copy link
Copy Markdown
Contributor

@Christinarlong Christinarlong commented May 21, 2026

Context, I want to use ParityChecker (PC) for programmatically validating webhook payloads so we don't log PII. Currently, PC if it finds differences will output the repr. describe_value instead will give us a "censored" form of what the mismatches were. See the tests in #116040

This is 99% the same as the old ParityChecker but

  1. moves the class to be in utils
  2. Takes in a function parameter (format_value) that when a mismatch is found gets run on the old/new value so the default was repr.

…param

Move _ParityChecker and _qualify from testutils/helpers/serializer_parity.py
to sentry/utils/payload_comparison.py so it can be used by production code.

Changes from the original:
- Renamed _ParityChecker -> PayloadComparator, _qualify -> qualify
- Added format_value parameter (defaults to repr, preserving current behavior)
- Added describe_value() for PII-safe value formatting in production

serializer_parity.py now imports from the new location; all existing callers
of assert_serializer_parity are unaffected.
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 21, 2026
Move _ParityChecker and _qualify from testutils/helpers/serializer_parity.py
to sentry/utils/payload_comparison.py so it can be used by production code.

Changes from the original:
- Added format_value parameter (defaults to repr, preserving current behavior)
- Added describe_value() for PII-safe value formatting in production

serializer_parity.py now imports from the new location; all existing callers
of assert_serializer_parity are unaffected.
@Christinarlong Christinarlong changed the title ref: Extract PayloadComparator from _ParityChecker with format_value param ref(utils): Make ParityChecker print out mismatches in a PII safe way May 21, 2026
ParityChecker only recursed into nested structures when known_diffs
or unreliable field paths referenced something inside. When known_diffs
is empty (production validation case), differing nested dicts were
reported as a single flat mismatch instead of drilling into specific
fields. Now always recurses into Mapping and list types.
)
elif old_item != new_item:
self.mismatches.append(
f"{item_path}: old={self.format_value(old_item)}, new={self.format_value(new_item)}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is new and are to address when the nested dictionary has a mismatch but we those mismatches aren't known before hand (thus not in nested_diffs or nested_unreliable)

elif isinstance(old_val, Mapping) and isinstance(new_val, Mapping):
      self.compare(old_val, new_val, frozenset(), full_path, _qualify(diffs_path, key))

This conditional is for when both values are dicts , recur into the nested dict to find which of the keys val pairs are different. Previously, this would just print the keys of the old and new value dicts.

elif isinstance(old_val, list) and isinstance(new_val, list):
                if len(old_val) != len(new_val):
                    self.mismatches.append(
                        f"{full_path} count: old={len(old_val)}, new={len(new_val)}"
                    )
                for i, (old_item, new_item) in enumerate(zip(old_val, new_val)):
                    item_path = f"{full_path}[{i}]"
                    if isinstance(old_item, Mapping) and isinstance(new_item, Mapping):
                        self.compare(
                            old_item, new_item, frozenset(), item_path, _qualify(diffs_path, key)
                        )
                    elif old_item != new_item:
                        self.mismatches.append(
                            f"{item_path}: old={self.format_value(old_item)}, new={self.format_value(new_item)}"

If both old and new are lists, first report if the length is same or not. Then we iterate through the list and recur if the items are and compare again. Didn't add something for nested lists, since the webhook payload doesn't have anything of that format.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we can reformulate this to have just a generalized parpallel walk and diff phase that is basically a Generator[tuple[Path, ComparisonResult]] that is consumed by book-keeping code to validate known diffs, ignore unreliables, and produce a nice result at the end. Probably not literally a generator, as I think we'd want to be able to tell it to not bother recursing into know diffs once they're actually different or something.

We shouldn't have to have nested data walking code that is limited by how much we're willing to write out, and we can probably generalize list/dict and treat both as [K,V] sequences, differentiating by type checks.

That said, not important if this is good enough for our purposes.
But we do need to reconsidere whether DeepDiff could work here instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think DeepDiff would solve all the cases I'm trying to tack onto the paritychecker. Honestly would prefer that as it would also be way less maintenance and understanding work required than refactoring the checker to be handle lists/dicts generically and separating out the walk and diff phase. Perhaps llms can roll through it, but even taking the time to understand and grok the checker is a bit much imo

Though reading the last pr descr, it seems like it was considered and said nah too(?).

@Christinarlong Christinarlong marked this pull request as ready for review May 21, 2026 20:20
@Christinarlong Christinarlong requested a review from a team as a code owner May 21, 2026 20:20
@Christinarlong Christinarlong requested review from a team and kcons May 21, 2026 20:52
Copy link
Copy Markdown
Member

@kcons kcons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable; but I think we need to test this within the commit that changes it.

)
elif old_item != new_item:
self.mismatches.append(
f"{item_path}: old={self.format_value(old_item)}, new={self.format_value(new_item)}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we can reformulate this to have just a generalized parpallel walk and diff phase that is basically a Generator[tuple[Path, ComparisonResult]] that is consumed by book-keeping code to validate known diffs, ignore unreliables, and produce a nice result at the end. Probably not literally a generator, as I think we'd want to be able to tell it to not bother recursing into know diffs once they're actually different or something.

We shouldn't have to have nested data walking code that is limited by how much we're willing to write out, and we can probably generalize list/dict and treat both as [K,V] sequences, differentiating by type checks.

That said, not important if this is good enough for our purposes.
But we do need to reconsidere whether DeepDiff could work here instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants