fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract by tk-pkm111 · Pull Request #843 · basicmachines-co/basic-memory

tk-pkm111 · 2026-05-21T10:07:43Z

Summary

sqlite_search_repository.py converts vector distance to similarity with similarity = 1 - distance²/2, a formula that assumes unit-normalized vectors (the requirement is explicitly stated in comments on lines 65-67 of that file).
Some embedding models — notably multilingual ones such as paraphrase-multilingual-MiniLM-L12-v2 — return vectors with norm ≈ 2.9 rather than 1.0. This causes every similarity score to collapse near 0.0, making vector/hybrid search effectively non-functional for those models.
This PR applies L2 normalization inside _embed_batch() after converting raw model output to Python floats. Zero vectors are returned as-is to prevent division errors.

Root cause

FastEmbed's default English model (bge-small-en-v1.5) happens to return unit-normalized vectors, so the bug is invisible with the default config. Switching to any model that does not self-normalize (e.g., the multilingual MiniLM family) exposed the contract violation.

Changes

File	Change
`src/basic_memory/repository/fastembed_provider.py`	Add `import math`; L2-normalize each vector in `_embed_batch()`
`tests/repository/test_fastembed_provider.py`	Two new tests: norm==1.0 assertion for unnormalized stub model; zero-vector no-raise assertion

Test plan

All 8 existing + new unit tests pass: uv run --python 3.12 pytest tests/repository/test_fastembed_provider.py -v
New test test_fastembed_provider_l2_normalizes_output_vectors confirms output norm is 1.0 ± 1e-6 for a stub model that returns norm ≈ 2.9
New test test_fastembed_provider_zero_vector_does_not_raise confirms no ZeroDivisionError on zero vector

🤖 Generated with Claude Code

…ract sqlite_search_repository uses the formula `similarity = 1 - distance²/2` which is only correct for unit-normalized vectors (see comment on lines 65-67 of sqlite_search_repository.py). Models such as paraphrase-multilingual-MiniLM-L12-v2 return vectors with norm ≈ 2.9, causing cosine similarity scores to collapse near 0 for every query. Apply L2 normalization in `_embed_batch()` after converting to Python floats, so the contract holds regardless of which model is configured. Zero vectors are returned as-is to avoid division errors. Adds two unit tests: one that asserts the output norm is 1.0 for an unnormalized stub model, and one that verifies no exception is raised for zero vectors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: tk-pkm111 <133480534+tk-pkm111@users.noreply.github.com>

CLAassistant · 2026-05-21T10:07:55Z

All committers have signed the CLA.

phernandez · 2026-05-26T18:25:29Z

Hi @tk-pkm111 — thanks for the careful fix and the clear reproduction. The change looks solid: the unit-norm contract on sqlite_search_repository.py lines 65-67 is real, multilingual MiniLM models do break it, and the patch addresses it at the right boundary. Tests look meaningful too.

The one blocker before we can merge is the CLA — the bot still shows it as unsigned for this PR. Could you sign it via the link in CLAassistant's comment above? Once that's in we can move toward landing this.

Bring the LiteLLM provider in line with the unit-norm contract from sqlite_search_repository.py (lines 65-67): the cosine-similarity formula `1 - L²/2` is correct only for unit-normalized vectors. LiteLLM routes to many backends (Cohere, Vertex, Bedrock, etc.) that do not return normalized embeddings, so normalize at the provider boundary — same fix shape as the parallel FastEmbed change in basicmachines-co#843. Also align the response handling with OpenAIEmbeddingProvider: - attribute access on response items (item.index / item.embedding) - explicit duplicate-index guard Tests cover the three behaviors directly (unit norm, zero-vector pass-through, duplicate-index error) and the existing ordering test now reconstructs the expected normalized vectors so a normalization regression would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

tk-pkm111 · 2026-05-26T22:32:25Z

CLA is now signed — could you merge this when you get a chance?

phernandez

Reviewed the FastEmbed normalization change against the sqlite_search_repository unit-vector contract. No blockers found.\n\nLocal verification on PR head 4ef062b:\n- uv run pytest tests/repository/test_fastembed_provider.py -> 8 passed\n- uv run ruff check src/basic_memory/repository/fastembed_provider.py tests/repository/test_fastembed_provider.py -> passed\n\nNote: the repo's full Tests workflow is push-only and did not run for this external PR branch, so this review is based on focused local verification rather than the full CI matrix.

phernandez mentioned this pull request May 26, 2026

feat(core): add LiteLLM embedding provider #809

Open

phernandez approved these changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843

fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843
tk-pkm111 wants to merge 1 commit into
basicmachines-co:mainfrom
tk-pkm111:fix/fastembed-l2-normalization

tk-pkm111 commented May 21, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 21, 2026 •

edited

Loading

Uh oh!

phernandez commented May 26, 2026

Uh oh!

tk-pkm111 commented May 26, 2026

Uh oh!

phernandez left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tk-pkm111 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Test plan

Uh oh!

CLAassistant commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phernandez commented May 26, 2026

Uh oh!

tk-pkm111 commented May 26, 2026

Uh oh!

phernandez left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tk-pkm111 commented May 21, 2026 •

edited

Loading

CLAassistant commented May 21, 2026 •

edited

Loading