Skip to content

fix(core): disambiguate truncated observation permalinks to prevent index collisions#931

Merged
phernandez merged 2 commits into
mainfrom
fix/909-observation-permalink-collision
Jun 10, 2026
Merged

fix(core): disambiguate truncated observation permalinks to prevent index collisions#931
phernandez merged 2 commits into
mainfrom
fix/909-observation-permalink-collision

Conversation

@phernandez

@phernandez phernandez commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Observation.permalink truncates content to 200 chars to stay under PostgreSQL's 2704-byte btree index row limit. Two distinct observations with the same category and an identical 200-char prefix collided on the same synthetic permalink, and index_entity_markdown's permalink-keyed dedup silently dropped the second one from the search index, making it unfindable.

This change appends a short stable digest (first 12 hex chars of sha256 of the full content) to the truncated prefix, so distinct content always yields distinct permalinks while exact duplicates still dedupe and the permalink stays well under the btree limit. The permalink is a computed property, not a stored column, so no migration is needed.

Upgrade note: existing search_index rows for >200-char observations keep their old digest-less permalinks until the entity is re-synced or bm reindex --search runs; release notes should mention this.

Test plan

  • uv run pytest tests/repository/test_observation_repository.py tests/services/test_search_service.py -x -q — 73 passed

Closes #909

🤖 Generated with Claude Code

Reviewed SHA: e1fe35560c58b4f0f825b55a243cb361796b726e
Verdict: approve
Status: success - BM Bossbot approved this head SHA

Summary:
Reviewed the full provided diff against the Basic Memory engineering style and the affected observation permalink/search-indexing path. The change preserves short-content permalinks, keeps exact long duplicates dedupable, and disambiguates distinct long observations that share a truncated prefix. The added repository and search-service regression tests cover the reported failure mode. I found no blocking issues.

Blocking findings:

  • None

Non-blocking findings:

  • None

BM Bossbot infographic for PR #931

BM Bossbot image provenance
  • Pull request: #931
  • Generated asset: /home/runner/work/basic-memory/basic-memory/docs/assets/infographics/pr-931.webp
  • Image model: gpt-image-2
  • Size: 1536x1024
  • Quality: high
  • Visual format: auto
  • Theme source: none

Theme / choice instruction:
None supplied.

Image prompt sent to gpt-image-2:

Create a polished landscape WebP visual for Basic Memory PR #931.

This is a non-gating visual summary. The authoritative merge gate is the
GitHub commit status named BM Bossbot Approval, not this image.

Use the BM Bossbot review summary below as source material. Preserve the
concrete before/after value story without inventing facts or turning
implementation details into clutter.

Choose the most appropriate visual form: infographic, map, scene, poster,
painting, tableau, cover image, illustrated artifact, or another image form that
best communicates the PR intent. Choose exactly one visual mode and follow only
that mode's rules. Do not blend the modes.

Mode A - infographic/map:
- Use a readable map backbone with structured information design: sections,
  route lines, checkpoints, nodes, annotations, status badges, compact evidence
  bullets, and a legend.
- Use this mode when the PR needs several facts, gates, checks, or before/after
  points to be read explicitly.

Mode B - editorial scene/poster/painting:
- Use image-first composition: an actual scene, movie poster, painting, tableau,
  cover image, or symbolic illustrated artifact.
- Use this mode when the PR intent can be shown through one staged visual moment
  with minimal text.
- Avoid dashboard layouts, data panels, timeline strips, flowcharts, legends,
  before/after boxes, bullet lists, and checklist columns in this mode.

The visual theme should drive the composition through original style cues while
the engineering meaning stays easy to scan.

Use high contrast, smooth anti-aliased text when text is present, clean edges,
and non-tiny labels. Text is optional for scene-first images, but any text that
appears must be readable.

Avoid fake screenshots, code blocks, invented claims, copyrighted characters,
logos, named fictional universes, direct band logos, album art, celebrity
likenesses, or decorations that obscure content.

BM Bossbot summary:
Reviewed SHA: `e1fe35560c58b4f0f825b55a243cb361796b726e`
Verdict: `approve`
Status: `success` - BM Bossbot approved this head SHA

Summary:
Reviewed the full provided diff against the Basic Memory engineering style and the affected observation permalink/search-indexing path. The change preserves short-content permalinks, keeps exact long duplicates dedupable, and disambiguates distinct long observations that share a truncated prefix. The added repository and search-service regression tests cover the reported failure mode. I found no blocking issues.

Blocking findings:
- None

Non-blocking findings:
- None

Images API revised prompt:
Not provided by the Images API.

phernandez and others added 2 commits June 9, 2026 17:39
…ndex collisions

Observation.permalink truncates content to 200 chars to stay under
PostgreSQL's 2704-byte btree index row limit. Two distinct observations
with the same category and an identical 200-char prefix therefore
collided on the same synthetic permalink, and index_entity_markdown's
permalink-keyed dedup silently dropped the second one from the search
index, making it unfindable.

Append a short stable digest (first 12 hex chars of sha256 of the FULL
content) to the truncated prefix so distinct content always yields
distinct permalinks, while exact duplicates still dedupe and the
permalink stays well under the btree limit. The permalink is a computed
property, not a stored column, so no migration is needed.

Fixes #909

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1fe35560c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +264 to +265
digest = hashlib.sha256(self.content.encode("utf-8")).hexdigest()[:12]
content_for_permalink = f"{self.content[:200]}-{digest}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delete legacy observation index rows during upgrade

For projects that already have indexed >200-character observations, this changes the computed permalink while the old digest-less search_index rows remain until reindex. If such a note is deleted before it is re-synced, both delete paths (SearchService.handle_delete and SyncService.handle_delete) compute the new digest permalink from the ORM observation and call delete_by_permalink, so the legacy observation row is never matched and remains searchable with a deleted entity_id; consider deleting by entity_id for observation cleanup or also removing the pre-digest permalink during the transition.

Useful? React with 👍 / 👎.

@phernandez phernandez merged commit 93ed340 into main Jun 10, 2026
29 checks passed
@phernandez phernandez deleted the fix/909-observation-permalink-collision branch June 10, 2026 01:20
groksrc added a commit that referenced this pull request Jun 10, 2026
Distinct observations sharing a category and a 200-char content prefix
used to collide on their truncated synthetic permalink and the second
was silently dropped from the search index (#909, fixed by #931 via a
content digest suffix).

These end-to-end MCP tests pin the behavior independent of the
disambiguation mechanism: both observations stay searchable, and
delete_note removes every index row — including disambiguated ones,
which would otherwise survive as ghost rows since search_index has no
FK cascade from entity. Post-delete assertions inspect the index via
search_service directly because the MCP search pipeline hides rows
whose entity is gone, which would mask an orphan.

Refs #909

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
phernandez pushed a commit that referenced this pull request Jun 10, 2026
…926)

Refs #909. Integration coverage over the #931 fix.

Signed-off-by: phernandez <paul@basicmemory.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Duplicate observation permalink collision drops the second observation from the search index

1 participant