fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843
fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843tk-pkm111 wants to merge 1 commit into
Conversation
…ract sqlite_search_repository uses the formula `similarity = 1 - distance²/2` which is only correct for unit-normalized vectors (see comment on lines 65-67 of sqlite_search_repository.py). Models such as paraphrase-multilingual-MiniLM-L12-v2 return vectors with norm ≈ 2.9, causing cosine similarity scores to collapse near 0 for every query. Apply L2 normalization in `_embed_batch()` after converting to Python floats, so the contract holds regardless of which model is configured. Zero vectors are returned as-is to avoid division errors. Adds two unit tests: one that asserts the output norm is 1.0 for an unnormalized stub model, and one that verifies no exception is raised for zero vectors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: tk-pkm111 <133480534+tk-pkm111@users.noreply.github.com>
|
Hi @tk-pkm111 — thanks for the careful fix and the clear reproduction. The change looks solid: the unit-norm contract on The one blocker before we can merge is the CLA — the bot still shows it as unsigned for this PR. Could you sign it via the link in CLAassistant's comment above? Once that's in we can move toward landing this. |
Bring the LiteLLM provider in line with the unit-norm contract from sqlite_search_repository.py (lines 65-67): the cosine-similarity formula `1 - L²/2` is correct only for unit-normalized vectors. LiteLLM routes to many backends (Cohere, Vertex, Bedrock, etc.) that do not return normalized embeddings, so normalize at the provider boundary — same fix shape as the parallel FastEmbed change in basicmachines-co#843. Also align the response handling with OpenAIEmbeddingProvider: - attribute access on response items (item.index / item.embedding) - explicit duplicate-index guard Tests cover the three behaviors directly (unit norm, zero-vector pass-through, duplicate-index error) and the existing ordering test now reconstructs the expected normalized vectors so a normalization regression would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
CLA is now signed — could you merge this when you get a chance? |
phernandez
left a comment
There was a problem hiding this comment.
Reviewed the FastEmbed normalization change against the sqlite_search_repository unit-vector contract. No blockers found.\n\nLocal verification on PR head 4ef062b:\n- uv run pytest tests/repository/test_fastembed_provider.py -> 8 passed\n- uv run ruff check src/basic_memory/repository/fastembed_provider.py tests/repository/test_fastembed_provider.py -> passed\n\nNote: the repo's full Tests workflow is push-only and did not run for this external PR branch, so this review is based on focused local verification rather than the full CI matrix.
Summary
sqlite_search_repository.pyconverts vector distance to similarity withsimilarity = 1 - distance²/2, a formula that assumes unit-normalized vectors (the requirement is explicitly stated in comments on lines 65-67 of that file).paraphrase-multilingual-MiniLM-L12-v2— return vectors with norm ≈ 2.9 rather than 1.0. This causes every similarity score to collapse near 0.0, making vector/hybrid search effectively non-functional for those models._embed_batch()after converting raw model output to Pythonfloats. Zero vectors are returned as-is to prevent division errors.Root cause
FastEmbed's default English model (
bge-small-en-v1.5) happens to return unit-normalized vectors, so the bug is invisible with the default config. Switching to any model that does not self-normalize (e.g., the multilingual MiniLM family) exposed the contract violation.Changes
src/basic_memory/repository/fastembed_provider.pyimport math; L2-normalize each vector in_embed_batch()tests/repository/test_fastembed_provider.pyTest plan
uv run --python 3.12 pytest tests/repository/test_fastembed_provider.py -vtest_fastembed_provider_l2_normalizes_output_vectorsconfirms output norm is 1.0 ± 1e-6 for a stub model that returns norm ≈ 2.9test_fastembed_provider_zero_vector_does_not_raiseconfirms noZeroDivisionErroron zero vector🤖 Generated with Claude Code