fix(core): disambiguate truncated observation permalinks to prevent index collisions#931
Conversation
…ndex collisions Observation.permalink truncates content to 200 chars to stay under PostgreSQL's 2704-byte btree index row limit. Two distinct observations with the same category and an identical 200-char prefix therefore collided on the same synthetic permalink, and index_entity_markdown's permalink-keyed dedup silently dropped the second one from the search index, making it unfindable. Append a short stable digest (first 12 hex chars of sha256 of the FULL content) to the truncated prefix so distinct content always yields distinct permalinks, while exact duplicates still dedupe and the permalink stays well under the btree limit. The permalink is a computed property, not a stored column, so no migration is needed. Fixes #909 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1fe35560c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| digest = hashlib.sha256(self.content.encode("utf-8")).hexdigest()[:12] | ||
| content_for_permalink = f"{self.content[:200]}-{digest}" |
There was a problem hiding this comment.
Delete legacy observation index rows during upgrade
For projects that already have indexed >200-character observations, this changes the computed permalink while the old digest-less search_index rows remain until reindex. If such a note is deleted before it is re-synced, both delete paths (SearchService.handle_delete and SyncService.handle_delete) compute the new digest permalink from the ORM observation and call delete_by_permalink, so the legacy observation row is never matched and remains searchable with a deleted entity_id; consider deleting by entity_id for observation cleanup or also removing the pre-digest permalink during the transition.
Useful? React with 👍 / 👎.
Distinct observations sharing a category and a 200-char content prefix used to collide on their truncated synthetic permalink and the second was silently dropped from the search index (#909, fixed by #931 via a content digest suffix). These end-to-end MCP tests pin the behavior independent of the disambiguation mechanism: both observations stay searchable, and delete_note removes every index row — including disambiguated ones, which would otherwise survive as ghost rows since search_index has no FK cascade from entity. Post-delete assertions inspect the index via search_service directly because the MCP search pipeline hides rows whose entity is gone, which would mask an orphan. Refs #909 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
Summary
Observation.permalinktruncates content to 200 chars to stay under PostgreSQL's 2704-byte btree index row limit. Two distinct observations with the same category and an identical 200-char prefix collided on the same synthetic permalink, andindex_entity_markdown's permalink-keyed dedup silently dropped the second one from the search index, making it unfindable.This change appends a short stable digest (first 12 hex chars of sha256 of the full content) to the truncated prefix, so distinct content always yields distinct permalinks while exact duplicates still dedupe and the permalink stays well under the btree limit. The permalink is a computed property, not a stored column, so no migration is needed.
Upgrade note: existing
search_indexrows for >200-char observations keep their old digest-less permalinks until the entity is re-synced orbm reindex --searchruns; release notes should mention this.Test plan
uv run pytest tests/repository/test_observation_repository.py tests/services/test_search_service.py -x -q— 73 passedCloses #909
🤖 Generated with Claude Code
Reviewed SHA:
e1fe35560c58b4f0f825b55a243cb361796b726eVerdict:
approveStatus:
success- BM Bossbot approved this head SHASummary:
Reviewed the full provided diff against the Basic Memory engineering style and the affected observation permalink/search-indexing path. The change preserves short-content permalinks, keeps exact long duplicates dedupable, and disambiguates distinct long observations that share a truncated prefix. The added repository and search-service regression tests cover the reported failure mode. I found no blocking issues.
Blocking findings:
Non-blocking findings:
BM Bossbot image provenance
#931/home/runner/work/basic-memory/basic-memory/docs/assets/infographics/pr-931.webpgpt-image-21536x1024highautononeTheme / choice instruction:
None supplied.
Image prompt sent to
gpt-image-2:Images API revised prompt:
Not provided by the Images API.